Troubleshooting Problems in the Cloud
Imagine that you are managing a large cloud based solution with lots of moving parts, thousands of users and fairly complicated business logic going on all the time. Now an irate customer phones into support complaining that something bad happened. Sometimes from the explanation you can figure out what happened quite easily. Sometimes what the user described sounds completely inexplicable and impossible. Now what do you do to troubleshoot the problem?
First let’s review how easy problems are solved and can be dealt with quickly:
- From the description of the problem, the programmer goes, oh of course we should have thought of that and on reviewing the code finds and easy fix which can be deployed easily. (Too bad this doesn’t happen more often).
- The programmer scratches his head, but carefully goes through the customer’s steps and reproduces the bug on his local system, so he can easily investigate in the debugger and solve the problem.
- The problem is a bit of a mystery so a programmer and support analyst using GoToMeeting to watch the customer work. On watching them work, they realize what this customer is doing differently to cause the problem and then the programmer can reproduce and debug the problem.
- On examining the standard application logging, an unhandled exception is found in the log around the time the customer reported the problem and from this the code can be examined and the cause of the exception can be determined and fixed.
These are the standard and generally easy problems to fix. But what happens when these methods don’t yield any results?
Here are some examples of harder problems that can drive people crazy.
- The problem happens infrequently and can’t be consistently reproduced. But it happens frequently enough that customers are getting rather annoyed.
- The system works fine most of the time, but every now and again, the system goes berserk for some reason. Then it just magically goes back to normal, or you need to restart the servers to get things back on track.
- The problem only happens to certain users, perhaps at certain times. You can watch it happening to them via GoToMeeting, but can’t for the life of you get it to happen to you or to happen in your test environment and many other users never experience this.
- Often problems aren’t outright bugs, but other strange behaviors, like the whole system suddenly slowing down for no apparent reason and then some time later it goes back to normal, again for no apparent reason.
A lot of time these sorts of things are due to interactions between what people are doing in different parts of the system. Predicting and testing all these situations is very difficult and often a result of emergent phenomena which we talk about a bit later.
Generally the only way to solve the hard problems is to instrument your web application so you know exactly what is going on. Not only does this help with solving hard problems, but it also helps with making using your web site a better experience for users, since you can log what they do, how long it takes and where they are getting stuck. Often usability problems are more serious to users than program failures or other bugs.
Instrumenting you application generally means having good logging and making sure you log a lot of metrics that you can monitor. This can also means having extra APIs to the running application where you can inquire on the state of various components, find out how many of one type of object is currently in use for instance. Often your infrastructure like your web server will do a lot of logging for you, so be aware of this as there is no need to log things twice.
Having a dashboard to track these metrics is very helpful. The dashboard can both make reports and graphs from the application logs as well as use the application’s diagnostic API to provide useful information about what is happening. Often you can integrate with third party vendors like New Relic or Microsoft Application Insights, but one way or another you need this.
One objection to logging a lot of information is that the process of doing this can slow things down so much it becomes unacceptable. This is certainly true if you are logging synchronously to a file (i.e. not continuing on until the data is written). But modern systems get around this problems by logging asynchronously. The system doing the logging just fires the log message at a listener and goes on without waiting for or needing a reply. This causes logging to become much less a burden on the application. It then moves the problem to the listener application to log the messages, and if it gets too far behind it usually starts spilling messages. Most common logging infrastructures like Microsoft’s ETW already have this ability to take advantage of.
Some things you must track via APIs or logging:
- Any resource used in the system. Java and C# programmers often think they can’t have leaks because of garbage collection, but garbage collection isn’t very smart and often important resources will leak due to unexpected references or a circular dependency. In a 24×7 web application that needs to essentially run forever in a busy state, this is incredibly important.
- Generally what users are doing, so you know what the possible interactions might be and what are the concurrent things you may need to do to replicate the problem.
- Any exceptions or errors thrown by the program this tends to be a given, but make sure you include programmatically handled errors since these can be useful as well.
- Make sure you log performance metrics, if something takes longer than expected, log it.
- Any calls to external systems and the results (with performance stats).
- Assert type conditions where the program finds itself in an unexpected state (is make these get logged and not compiled out for production).
- Anything the programmer considers a sensitive area of the program that they are worried about.
There are also some things that must not be logged, these include any sort of passwords, decryption keys or sensitive customer data (ie nearly all customer data). Generally a lot of people need to work with the logs and there can’t be any sensitive information there as this will be considered a major security problem.
Make sure you have some good software (either home grown, open source or commercial) to search and analyze your logs. These logs will get incredibly big and will need to be carefully managed (usually archiving them each day). There are many good packages to make this chore far easier.
Emergent behavior refers to complex unexpected behaviors arising from large sets of much simpler processes. Modern web applications are getting bigger and bigger with many moving parts and many interactions between different systems and subsystems. We are already starting to see emergent behavior arising. Nothing on the scale of the system suddenly becoming intelligent, but on the scale where predicting what will happen in all cases has become impossible. It doesn’t matter how much QA you apply, chaos theory and mathematics quite clearly state that the system is beyond simple prediction. That doesn’t mean it is unstable. If done properly your system should still be quite stable, meaning small changes won’t cause radically different things to happen.
The key point is to keep this mind when building your system, and to make sure you have plenty of instrumentation in place so that even if you can’t predict what will happen, you can still see what is happening and act on it.
Diagnosing and solving hard problems in a large web application with thousands of concurrent users can be a lot of fun and very challenging. Having good preventative measures in place can make life a lot easier. You don’t want to be continually pushing out newer versions of your application just to add more logging because you can’t figure out what is going on.