Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘troubleshooting

Troubleshooting Problems in the Cloud

with one comment

Introduction

Imagine that you are managing a large cloud based solution with lots of moving parts, thousands of users and fairly complicated business logic going on all the time. Now an irate customer phones into support complaining that something bad happened. Sometimes from the explanation you can figure out what happened quite easily. Sometimes what the user described sounds completely inexplicable and impossible. Now what do you do to troubleshoot the problem?

Easy Problems

First let’s review how easy problems are solved and can be dealt with quickly:

  • From the description of the problem, the programmer goes, oh of course we should have thought of that and on reviewing the code finds and easy fix which can be deployed easily. (Too bad this doesn’t happen more often).
  • The programmer scratches his head, but carefully goes through the customer’s steps and reproduces the bug on his local system, so he can easily investigate in the debugger and solve the problem.
  • The problem is a bit of a mystery so a programmer and support analyst using GoToMeeting to watch the customer work. On watching them work, they realize what this customer is doing differently to cause the problem and then the programmer can reproduce and debug the problem.
  • On examining the standard application logging, an unhandled exception is found in the log around the time the customer reported the problem and from this the code can be examined and the cause of the exception can be determined and fixed.

These are the standard and generally easy problems to fix. But what happens when these methods don’t yield any results?

Harder Problems

Here are some examples of harder problems that can drive people crazy.

  • The problem happens infrequently and can’t be consistently reproduced. But it happens frequently enough that customers are getting rather annoyed.
  • The system works fine most of the time, but every now and again, the system goes berserk for some reason. Then it just magically goes back to normal, or you need to restart the servers to get things back on track.
  • The problem only happens to certain users, perhaps at certain times. You can watch it happening to them via GoToMeeting, but can’t for the life of you get it to happen to you or to happen in your test environment and many other users never experience this.
  • Often problems aren’t outright bugs, but other strange behaviors, like the whole system suddenly slowing down for no apparent reason and then some time later it goes back to normal, again for no apparent reason.

A lot of time these sorts of things are due to interactions between what people are doing in different parts of the system. Predicting and testing all these situations is very difficult and often a result of emergent phenomena which we talk about a bit later.

Preventative Measures

Generally the only way to solve the hard problems is to instrument your web application so you know exactly what is going on. Not only does this help with solving hard problems, but it also helps with making using your web site a better experience for users, since you can log what they do, how long it takes and where they are getting stuck. Often usability problems are more serious to users than program failures or other bugs.

Instrumenting you application generally means having good logging and making sure you log a lot of metrics that you can monitor. This can also means having extra APIs to the running application where you can inquire on the state of various components, find out how many of one type of object is currently in use for instance. Often your infrastructure like your web server will do a lot of logging for you, so be aware of this as there is no need to log things twice.

Having a dashboard to track these metrics is very helpful.  The dashboard can both make reports and graphs from the application logs as well as use the application’s diagnostic API to provide useful information about what is happening. Often you can integrate with third party vendors like New Relic or Microsoft Application Insights, but one way or another you need this.

appinsights

One objection to logging a lot of information is that the process of doing this can slow things down so much it becomes unacceptable. This is certainly true if you are logging synchronously to a file (i.e. not continuing on until the data is written). But modern systems get around this problems by logging asynchronously. The system doing the logging just fires the log message at a listener and goes on without waiting for or needing a reply. This causes logging to become much less a burden on the application. It then moves the problem to the listener application to log the messages, and if it gets too far behind it usually starts spilling messages. Most common logging infrastructures like Microsoft’s ETW already have this ability to take advantage of.

Some things you must track via APIs or logging:

  • Any resource used in the system. Java and C# programmers often think they can’t have leaks because of garbage collection, but garbage collection isn’t very smart and often important resources will leak due to unexpected references or a circular dependency. In a 24×7 web application that needs to essentially run forever in a busy state, this is incredibly important.
  • Generally what users are doing, so you know what the possible interactions might be and what are the concurrent things you may need to do to replicate the problem.
  • Any exceptions or errors thrown by the program this tends to be a given, but make sure you include programmatically handled errors since these can be useful as well.
  • Make sure you log performance metrics, if something takes longer than expected, log it.
  • Any calls to external systems and the results (with performance stats).
  • Assert type conditions where the program finds itself in an unexpected state (is make these get logged and not compiled out for production).
  • Anything the programmer considers a sensitive area of the program that they are worried about.

There are also some things that must not be logged, these include any sort of passwords, decryption keys or sensitive customer data (ie nearly all customer data). Generally a lot of people need to work with the logs and there can’t be any sensitive information there as this will be considered a major security problem.

Make sure you have some good software (either home grown, open source or commercial) to search and analyze your logs. These logs will get incredibly big and will need to be carefully managed (usually archiving them each day). There are many good packages to make this chore far easier.

Emergent Behaviors

Emergent behavior refers to complex unexpected behaviors arising from large sets of much simpler processes. Modern web applications are getting bigger and bigger with many moving parts and many interactions between different systems and subsystems. We are already starting to see emergent behavior arising. Nothing on the scale of the system suddenly becoming intelligent, but on the scale where predicting what will happen in all cases has become impossible. It doesn’t matter how much QA you apply, chaos theory and mathematics quite clearly state that the system is beyond simple prediction. That doesn’t mean it is unstable. If done properly your system should still be quite stable, meaning small changes won’t cause radically different things to happen.

The key point is to keep this mind when building your system, and to make sure you have plenty of instrumentation in place so that even if you can’t predict what will happen, you can still see what is happening and act on it.

Summary

Diagnosing and solving hard problems in a large web application with thousands of concurrent users can be a lot of fun and very challenging. Having good preventative measures in place can make life a lot easier. You don’t want to be continually pushing out newer versions of your application just to add more logging because you can’t figure out what is going on.

Advertisements

Diagnosing Problems with Sage ERP Accpac 6.0A

with 24 comments

Now that people are installing and deploying Accpac 6, hopefully, to get the new Web technologies working you just need to install, play with Database Setup and then away you go with no problems. Or at least hopefully this is how it works out for most people. But for those that it doesn’t, this blog posting is to help you figure out what has gone wrong. The main purpose of this blog posting is to help you track down and solve unforeseen problems on your own, hopefully giving you tools that will give you clues to what is wrong and allow you to come up with solutions.

This blog posting assumes that you can run both regular Database Setup and the classic Accpac Desktop. If you are having trouble at this step then check out these blog postings on diagnosing problems in Accpac part 1, part 2, part 3 and part 4.

Database Setup

The first time you hit some of the new technologies is when you configure the Portal’s database from Database Setup’s Portal… button. When you click here it runs a Java program that configures the portal for JDBC and tests the connection. Often if you are going to have trouble with the Java runtime or with accessing the database server from Java, this is where you will run into problems.

Bad Java Runtime

The Accpac 6 installation installs the Java Runtime version 6 update 17. If you previously have a later version of the Java runtime or let the Java Updater run then you will have a newer version. If you get an error about having trouble with a Java class such as com.sage.orion.connection.services.CommandLine, then chances are you have a problem with the runtime. We’ve seen systems where many versions of the Java runtime are installed and even though they are supposed to co-exist, there appear to be problems, plus now and then a bad Java update gets installed from the Java Updater. This was especially a problem shortly after Oracle bought Sun, at which point Oracle renamed quite a few things from Sun to Oracle and put out a couple of bad updates. They seem to be getting better at this again, but this is something to watch out for. A good remedy for this is to un-install all versions of the Java runtime and delete the program files\java directories (or the x86 variant if running 64 bit). Then either re-install Accpac or install a good known version of Java directly. Remember that Accpac is a 32 bit application and as such we require the 32 bit version of Java.

Troubles Connecting to SQL Server from Java

If your Java runtime is ok, then the next thing is: the Java classes will test your settings by connecting to the Portal database. If you get an error about “can’t locate server or user id/password are incorrect”, it might be a SQL Server configuration problem. The configuration of SQL Server from Database Setup for normal Accpac companies is fairly forgiving; it is quite smart about finding SQL Server’s and SQL Server instances and can use any communications protocol that is configured. But we access the Portal database using JDBC which is a bit more stringent. If you can setup regular Accpac companies, but not the Portal database then it is usually due to a few SQL Server configuration properties:

  1. In the SQL Server Configuration Manager – SQL Server Network Configuration – Protocols for serverinstance: make sure TCP/IP is enabled (usually it isn’t for SQL Server express and a few other varieties). Make note of the Port number used to ensure you are using the correct one (right click on TCP/IP and choose properties).
  2. Use the servername alone and the correct port number. Do NOT use server/instancename, this isn’t currently recognized (we will probably fix this for Product Update 1).
  3. In SQL Server Management Studio: in the properties for the server, security tab, ensure “SQL Server and Windows authentication mode” is selected for Server authentication.
  4. In SQL Server Management Studio: in the properties for the server, connections tab, make sure “Allow remote connections to this server” is checked.

If none of this works, then you are going to have to run a SQL Trace to see what is failing.

Hopefully this helps you get past Database Setup.

Tracing Web Requests

Below is a deployment diagram of the new Web components included with Sage ERP Accpac 6.0A. For more details look here.

As you can see requests come from the Internet and enter the Web Server via Microsoft Internet Information Services (IIS). Static content like HTML files and bitmap images will be served up directly from here. Requests for data are sent through the Jakarta ISAPI re-director to Tomcat which runs our SDataServlet which in turn calls the Accpac business logic objects (Views) to get the actual accounting data.

The general approach to troubleshooting here is to use logging to figure out how far requests from the Browser are getting and then to see why are failing at that point. All the components in the above deployment diagram have logging capabilities. Usually by default they only log errors which at this point are often all you need, however they can all be configured to provide much more detail on everything that is going on in the system.

Logging

First I’ll quickly summarize how to turn up and control the logging levels for the various services.  When running normally you won’t want to leave all these logging levels high since it wastes a lot of disk space and causes a performance drag writing out all this information. Usually you want to leave IIS logging on for every request so you can audit and keep an eye out for malicious attacks. But the other logs may as well be left at the error level.

IIS’s logging is controlled from the IIS Manager via the Logging icon for the web server. By default it’s enabled and the logs are stored in: C:\Inetpub\logs\LogFiles\W3SVC1. Note that the various versions of IIS are different so you might need to look around a bit for the settings.

The Jakarta IIS redirector is configured via the file: C:\Program Files (x86)\Common Files\Sage\Sage Accpac\Tomcat6\Jakarta\isapi_redirect.properties. Change the log_level from error to debug. Note that this setting causes an extremely large amount of output and will affect server performance, so remember to set this back to error when you are done. The log file is AccpacRedirector.log in the same C:\Program Files (x86)\Common Files\Sage\Sage Accpac\Tomcat6\Jakarta folder.

For Tomcat and SDataServlet, the log files are stored in: C:\Program Files (x86)\Common Files\Sage\Sage Accpac\Tomcat6\logs . To configure the logging level edit: C:\Program Files (x86)\Common Files\Sage\Sage Accpac\Tomcat6\webapps\SDataServlet\WEB-INF\classes\log4j.properties and change “log4j.rootLogger=ERROR, logfile” to “log4j.rootLogger=ALL, logfile”.

Diagnosing Problems

The general approach is to follow messages through the system by looking at the log files. Having a set of log files from a working system can help greatly since then you can compare them, since differences often give a clue to the cause of problems.

IIS

First, do the Browser requests even get to IIS? If not try using the server’s IP address instead of the URL. Try Browsing from the server using localhost. This can give an indication if IIS just isn’t working at all or you have DNS problems or there is some problem resolving your URL. Obvious things to check are that the IIS service is running, check the Windows Event Log if it won’t start or that IIS is configured on port 80 (HTTP) and 443 (HTTPS). Note that only one program can use a given port so if something else is using port 80, say Apache as a web server for something else then only one of them can have port 80 and one process will fail to start. If you have to use a different port, this is ok, just remember to put :portnumber after the servername in your URLs.

Next, what if the browser requests get to IIS but are rejected? What if you get a 403 error return (access forbidden)? This also manifests itself in that you will see “isapi_redirect.dll” as part of URLs in the log file. This means you don’t have sufficient permissions to execute the Jakarta redirector or there is some other sort of access problem. If you get general 403 errors on requests for static content like HTML files or image files, then the file permissions on the WebUIs folder under the Accpac directory are too strong. This folder only contains HTML, JavaScript and image files. Neither data, nor anything valuable is stored in this folder, so it is ok to relax the permissions and grant read-only access for everyone (don’t grant write since you certainly don’t want to risk a hacker writing files here). The other case is if you see isap_redirect.dll in the URL, this means you don’t have sufficient permissions to load this DLL. We install IIS with all our processes running as the local system account, this account should have permission to do anything on the machine, but it won’t have access to network resources so the most common cause of this is having key files installed on a network share. The best solution here is to ensure the Accpac programs are installed on the web server so it only needs to access the LAN to get to the database server. You can also change the user the Sage Accpac Application pool runs under to something with domain privileges.

Another problem we’ve seen with IIS is that in some cases the Accpac install fails to create the various virtual directories and application pools inside IIS. We haven’t figured out why this happens yet, but it’s worth quickly checking that the SageERPAccpac and SDataServlet applications are added to the default web site and that Sage ERP Accpac App Pool is added to the application pools. Right now the only solution we have for this is to create these entries manually, copying from a good system. I’ll update this blog posting with more details once we figure this out. It’s strange because we don’t get any errors back from IIS when this happens.

Jakarta

So far we haven’t seen problems with Jakarta. This is an ISAPI module that loads into IIS and redirects all the Accpac SData requests to the Tomcat Server. I would first look in Tomcat’s logs and then only come back here if nothing is appearing in the Tomcat logs.

Generally here you need to turn on the logging and then look to ensure messages are making it to Jakarta (i.e. the log isn’t empty). Then usually Jakarta provides fairly good diagnostics in its log when things don’t work. These are usually along the lines that the Tomcat service isn’t running, there is a port conflict or something of this nature.

Tomcat/SDataServlet

Apache Tomcat is a Java application server that runs the server side of Java web applications. The Accpac SData server is written in Java and runs under this. Tomcat’s own log files give good information if there are problems with Tomcat. Tomcat can have problems if a bad version of the Java runtime is installed, hopefully this was resolved back when you ran Database Setup, but if you get a bunch of weird Java errors in the Tomcat logs then check out the info in the Database Setup section above on cleaning up the Java runtimes.

Tomcat is also the point where we need to load DLLs from the Accpac\runtime directory. If you see errors about loading Java classes and you see a4wapiShim.dll mentioned, then you probably don’t have the Accpac\runtime folder in your system PATH. Note that since Tomcat runs as a system service under Windows, this must be in the system PATH and not your user PATH.

Next when you look in the SDataServlet log files you will see various Accpac error messages. Make sure you can login to the standard Windows desktop and access the company you are using. You can’t perform data activation from the web portal, so this needs to be done from the Windows desktop.

If you installed an alpha, beta or release candidate of Accpac 6, try uninstalling, deleting the C:\Program Files (x86)\Common Files\Sage\Sage Accpac\Tomcat6 folder and then re-installing. There might be some old leftover files that need deleting.

Watch out for IE6

We see people trying to use IE6 a surprising amount. It’s always surprising to me that people are still using IE 6, but they are. If the reported symptom is that the browser is hanging, then it is probably IE6. We don’t support IE6 and IE6 definitely doesn’t work in anyway. Worse the IE6 JavaScript engine can completely lock up Windows causing you to need to re-boot your workstation. The usual symptom of IE6 is that the first time you hit the portal it creates the uiContent table and then locks up, not creating the other three tables in the PORTAL database.

Remember we only officially support IE 7 and 8 for Sage ERP Accpac 6.0A. We will add support for Safari, Chrome and Firefox in Sage ERP Accpac 6.1A. However we won’t ever be adding support for IE 6. We will validate IE 9 once it’s released. Also if you are running IE 7, although it works, why not upgrade to IE 8? It does work quite a bit better.

Quotes to Orders

Diagnosing problems with Quotes to Orders is done the same as above. But just a couple of notes:

  • Ensure you use .Net remoting as the protocol from the Web Deployment Manager.
  • Make sure that you have activated and setup the SageCRM (EW) module within Accpac.

Fiddler

Fiddler is a great tool for tracking down problems. It spies on TCP/IP traffic through IE. Note that you have to use the proper server name in the URL, you can’t use localhost. This will show you all the requests and responses made from the Browser, often giving important clues when solving problems.

Summary

I hope as you deploy Sage ERP Accpac 6.0A, that you don’t need to refer to this blog posting very often. Further I hope that this blog posting doesn’t need to be read much. But if you do run into problems, I hope you find this information helpful.

Now off to TPAC.

Update 2011/03/29: If you have an individual workstation that is having trouble loading the Portal, try clearing the Browser cache. Sometimes if the Portal failed to load half way through then IE caches that and won’t finish until the cache is cleared and it can start clean.

Update 2014/10/02: For SQL Server Express, by default it uses dynamic ports. To work with the Portal you need to configure it to use a specific port. To do this run the SQL Server Configuration Manager. Choose “SQL Server Network Configuration”. First ensure TCP/IP is enabled, then right click on it and choose properties. Choose the IP Addresses tab, scroll to the bottom and in the IPAall setting, blank out the TCP Dynamic Ports entry and set the TCP Port to 1433 (or whatever you need).

Written by smist08

March 5, 2011 at 5:24 pm

Posted in sage 300

Tagged with , , , ,