Stability Testing of Sage 300 ERP
As we roll out Web versions of the Sage 300 ERP (Accpac) accounting modules, we need to ensure that the system can run for a long time without requiring a server re-boot or other manual corrective action. This is far more important in a Web environment than client/server. In client/server the Accpac Business Logic typically runs on the client and if it crashes (which is bad), it only affects one person. In the Web if something crashes then it could affect all users of the system. Many regular GPFs can be captured and handled just causing one person to redo their work; however for more severe errors they could crash a server process or even the server itself. Rebooting a server is a long time for people to wait to continue their work (especially if there is no one around to do the reboot).
As we develop Sage 300 ERP we want to ensure that the system is completely stable and can run for months and months under heavy multi-user load. Hopefully the only time you need to reboot your server is when it installs the monthly patch Tuesday Microsoft Updates.
Why Do Things Crash?
In both client/server and web computing we need to deal with basic GPFs which are usually caused by programming mistakes like memory overruns. In the early days of Windows these were very prevalent, but in more modern times these can often be caught and handled within the program. Plus modern programming systems like Java and .Net work to prevent these sorts of problems by not supporting the programming paradigms that lead to them. Even for C programs, there are better static code analysis programs and methods to test and eliminate these sorts of bugs. Generally these sorts of problems are fairly reproducible and easily diagnosed and solved in a source code debugger.
The harder to reproduce and diagnose problems usually involve running the system for a long time with a lot of users before something bad suddenly happens. The problem can’t be reproduced with just one user. It’s not known how to reproduce the problem in a shorter time. Often once we know what the problem really is, then we can construct an easier way to reproduce it, but of course now we don’t need it, on the other hand throwing in a unit test to make sure this doesn’t happen again is a good idea.
These sorts of problems usually start to occur when the system is stressed and low on resources. Sometimes the problem is that our program has a resource leak and is the cause of the system being low on resources. Another cause is multi-user. We run as a multi-threaded process on the server where the individual threads handle individual web requests. If these aren’t coded properly they could conflict and cause problems. The Java and .Net environments don’t really help with these situations and often make things harder by hiding the details of what is going on making these harder to solve. These are the problems we really need to avoid.
How to Find These Problems
How do we solve these problems? How do we test for them? We use a combination of manual and automated testing to find if we have these problems and then employ a number of tools to track what happened.
Usually every Friday we get a large number of people to participate in manual multi-user testing where everyone accesses a common server and performs transactions, either as part of functional testing that needs to be done anyway or following scripts that concentrate effort in areas we are concerned about. Then anything weird that happens is recorded for later investigation.
We also use automated testing. We extensively use Fitnesse and Selenium as I blogged about here and here. We can string all these tests together, or run them repeatedly to create long running tests that the system is stable and that nothing bad happens over time. Another very powerful tool is JMeter which performs load testing on Web Servers. Here we record the HTTP requests that the browser makes to the server as users perform their tasks, then we build these into tests that we can run for as long as we want. JMeter will then play back the HTTP requests to the server, but it can simulate any number of users doing this. This gives us a way to easily automate multi-user testing and run with far more users and for longer periods of time than manual testing allows. Below is a screen shot of JMeter with one of our tests loaded. It might be hard to read the screen shot, but notice under the Runtime item is a Pick One Randomly item. This item will pick one of several tests randomly and run it. Then we simulate a large number of users doing this for long periods of time.
How to Fix These Problems
The key to fixing these sorts of problems is diagnosing what is going on. Usually if you can find the correct place in the code where things are breaking down, then fixing the problem is usually fairly easy. The real trick is finding that one line of code that requires fixing. Though major refactoring’s are required now and then, and then you need to re-test to ensure you have really fixed the problem and not introduced any new problems.
At the most basic level we include a lot of logging in the product, so when we run these tests we have logging turned on and can study the logs to see what happened. This only gives clues of places to look and often we have to add more logging and then re-run the test. Logging isn’t great, but it’s very tried and true and often gets you there in the end. We use log4j for our server logging.
For memory leaks, the Java JDK gives a number of good tools to diagnose problems. The Java JDK includes the tool JConsole which you can connect to the JVM used by Tomcat and monitor the memory usage. It also records the number of classes loaded, CPU utilization and a few other interesting statistics. But one really cool feature is the ability to generate a dump file of the Java Heap at any time. From the MBean tab in JConsole you can choose com.sun.management – HotSpotDiagnostic – Operations – dumpHeap. The p0 parameter is the name of the dump file and should have an .hprof extension. By default the file will end up in the Tomcat folder. After you invoke this you get a binary dump of the Java heap, now what to do with it? Fortunately the JDK includes another tool called jhat which is used to analyze the heap and then run a web server that lets you browse the contents of the heap from a web browser. From this you can see all the Java objects in the heap, and hopefully determine what is leaking (usually one of the objects with a very high instance count). You can also see the contents of each object and who references it.
One good tool for solving these problems is ReplayDirector from Replay Solutions. This is a commercial product that records everything that happens in the JVM. You can configure your Tomcat service, so that ReplayDirector records everything that happens as you run. It doesn’t matter if you call outside of Java since it will record the call and the result. Then you can play back the recording in the regular Eclipse Java debugger as often as you like to figure out the problem. When testing, the testers can put marks in the recording to indicate where and what happened. You can set breakpoints on any REST or other HTTP call into Tomcat, you can expand log4j messages to see them even if they aren’t logged. Below is a screenshot of the main ReplayDirector Eclipse plugin. This was recorded on a server being driven by JMeter. The green line is CPU usage; the blue line is memory usage. From the looks of the blue line, you can probably guess that I’m looking for a memory leak. I can set a breakpoint at a later SData request and then break into the program and look around in the debugger to see what is going on.
Ensuring a Web Server can keep on running reliably for long periods of time can sometimes be quite a difficult task. Fortunately the tools that help with this are getting better and better, so we can monitor, record and reproduce all the weird things that testing reveals and ensure that our customers never see these sorts of things themselves.