Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘high availability

Scaling and Availability for the new Sage 300 Web UIs

with 3 comments

Introduction

I introduced our new Sage 300 Web UIs, talked about installing them and then discussed security implications. Now the question is that you have hundreds of users and things are starting to run quite slowly, what do you do? Similarly suppose you are all happily using the Web UIs and the web server hardware breaks down or Windows Update kicks in or Windows fails for some other reason? These two problems are quite related since the solution to both is the same, namely adding another Web Server. If you have two Web Servers and one breaks down, then people might run a bit slower since they are all on the remaining server, but at least they keep running. If the Web UIs slow down when there are a certain number of users, then add another web server to distribute the load.

This articles will look at the various issues around adding Web Servers. For the other parts of the system I talked about various techniques here.

Poor Man’s Scaling

Later in the article we’ll talk about automatic failover and automatic ways to distribute load. In this section I just wanted to point out that you can do this manually without requiring any extra configuration, servers or hardware.

Basically just have two Web Servers, each with its own URL (which might just be //servername/sage300) and then just assign which server your users sign on to. You would want each server to have the Sage 300 programs installed locally, but use the same shared data folder and the same databases on the same database server.

Then if one server fails, just send an email to everyone using that server to use the other one. This way it’s pretty easy to add servers, but it’s up to you to distribute your users over the servers and it’s up to you to switch the users from one server to another when one goes down or you want to do maintenance.

Sticky Load Balancer

Ideally we would like to have a pool of Web Servers that are all behind the same URL. Then as users access the URL they will be distributed among the working servers in the pool. If a server fails, this will be automatically detected and it will be removed from the pool.

loadBalancer

There are quite a few hardware load balancers on the market, most of which have the “sticky” feature that we require. Sticky means that once a user starts talking to one server, all their traffic will be directed to that same server, unless it fails. For the Sage 300 Web UIs most of the UIs are what are called stateless and don’t require this feature. However we do have a number of stateful UIs that must communicate with the same server to do things like build up an Invoice or other accounting document.

Most load balancers will detect when a server fails (usually by regularly pinging it) and hence remove it from the pool.

Many load balancers also have the feature of decoding HTTPS for you. So you have an HTTPS connection to the load balancer and then an HTTP connection from the load balancer to the Web Server. This improves performance of the Web Server since decoding HTTPS traffic can be quite computationally intensive.

High_availability_247

If you are thinking “high availability”, you might want to now ask: what happens if the load balancer fails? In this case you would have two load balancers in an active/passive configuration where the passive will take over if it detects the active one has failed.

You also wouldn’t want all your servers to automatically do Windows Update, otherwise they will all do this at the same time and the whole system will be unavailable during this process. It’s a good idea to take control of when Windows Update happens and to stagger it across your infrastructure.

IIS ARR

There are a number of software solutions for load balancing. After all a hardware load balancer is just really a computer with the exact hardware ports required and then runs the load balancer software. One software solution is built into IIS called Application Request Routing (ARR). Here you can have a server in front of your Web server which has ARR configured to know about your pool of servers, have sticky session enabled and off you go.

If you want to make the ARR server HA (Highly Available) you can add a second one. If you have them work in parallel then they need to share a SQL database that you might want to also make HA.

Geographic HA

You might want to be highly available across geographic locations. However keep in mind that there is only one SQL database and the location where that is located will work really well, and the other location will probably have terrible performance. Generally for disaster recovery if something catastrophic happens to one location, you would have a second location that you can bring online reasonably quickly and probably involve restoring the SQL Server database from an off-site backup.

The Cloud

Rather than managing all these servers in your own datacenter, you might consider running them all as virtual servers in a Cloud such as AWS or Azure. Here you can create all these servers and configurations fairly easily. You can also add web servers when you need extra capacity and delete a few when you aren’t using them and want to save a bit of money.

There are lots of arguments between running in the cloud versus running in your own data center. These often revolve around data security, data privacy, cost, and the skills needed to maintain the given system. Whichever is right for you, you will still want to make sure you can configure the correct capacity for your needs and that you have the correct level of disaster recovery, failover and backup for your needs.

Summary

This was just a quick introduction to how you increase capacity of your Sage 300 Web Servers, along with a quick discussion of High Availability. As in all things, the most deluxe solution will be very expensive and no solution will likely be unacceptable. So you will need to find the correct balance for your business.

 

Advertisements

Written by smist08

August 16, 2015 at 12:35 am

Disaster Recovery

with 4 comments

Introduction

In a previous blog article I talked about business continuity, what you need to do to keep Sage 300 ERP up and running with little or no downtown. However I mushed together two concepts, namely keeping a service highly available along with having a disaster recovery plan. In this article I want to separate these two concepts apart and consider them separately.

We’ve had to give these two concepts a lot of thought when crafting our Sage 300 Online product offering, since we want to have this service available as close to 100% as possible and then if something truly catastrophic happens, back on its feet as quickly as possible.

Terminology

There is some common terminology which you always see in discussions on this topic:

RPO – Recovery Point Objective: this is the maximum tolerable period in which data might be lost due to a major incident. So for instance if you have to restore from a backup, how long ago was that backup made.

RTO – Recovery Time Objective: this is the duration of time within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences. For instance if a computer fails, how long can you wait to replace it.

HA – High Availability: usually concerns keeping a system running with little or no downtime. This doesn’t include scheduled downtime and it usually doesn’t include a major disaster like an earthquake eating a datacenter.

DR – Disaster Recovery: this is the process, policies and procedures that are related to preparing for recovery or continuation of technology infrastructure which are vital to an organization after a natural or human-induced disaster.

High Availability

HA means creating a system that can keep running when individual components fail (no single point of failure), like one computer’s motherboard frying, a power supply failing or a hard disk failure. These are reasonably rare events, but often systems in data centers run on dozens of individual computers and things do fail and you don’t want to be down for a day waiting for a new part to be delivered.

Of course if you don’t mind being down for a day or two when things fail, then there is no point spending the money to protect against this. Which is why most businesses set RPO and RTO targets for these type of things.

Some of this comes down to procedures as well. For instance if you have all redundant components but then run Windows Update on them all at once, they will reboot all at once bringing your system down. You could schedule a maintenance windows for this, but generally if you have redundant components you can Windows update the first and when its fine and back up, then you do the secondary.

If you are running Sage ERP on a newer Windows Server and using SQL Server as your database then there are really good hardware/software combinations of all the standard components to give you really good solid high availability. I talked about some of these in this article.

Disaster Recovery

This usually refers to having a tested plan to spin up your IT infrastructure at an alternate site in the case of a major disaster like an earthquake or hurricane wiping out you currently running systems.

natural_disaster

Again depending on your RPO/RTO requirements will depend on how much money you spend on this. For instance do you purchase backup hardware and have it ready to go in an alternate geographic region (far enough away that the same disaster couldn’t take out both locations)?

For sure you need to have complete backups of everything that are stored far away that you can recover from. Then it’s a matter of acquiring the hardware and restoring all your backups. Often people are storing these backups in the cloud these days, this is because cloud storage has become quite inexpensive and most cloud storage solutions provide redundancy across multiple geographies.

The key point here is to test your procedure. If your DR plan isn’t tested then chances are it won’t work when it’s needed. Performing a DR drill is quite time consuming, but really essential if you are serious about business continuity.

Azure

One of the attractions of the cloud is having a lot of these things done for you. Sage 300 Online handles setting up all its systems HA, as well as having a tested DR plan ready to implement. Azure helps by having many data centers in different locations and then having a lot of HA and DR features built into their components (especially the PaaS ones). This then removes a lot of management and procedural headaches from running your business.

Hard Decisions

If a data center is completely wiped out, then the decision to execute the DR plan is easy. However the harder decision comes in when the primary site has been down for a few hours, people are working hard to restore service, but it seems to be dragging on. Then you can have a hard decision to kick in the DR plan or to wait to see if the people recovering the primary can be successful. These sort of things are often caused by electrical problems, or problems with large SANs.

One option is to start spinning up the alternative site, restoring backups if necessary and getting ready, so when you do make the decision you can do the switch over quickly. This way you can often delay the hard decision and give the people fixing the problem a bit more time.

Dangers

Having a good tested DR plan is the first step, but businesses need to realize that if a major disaster like an earthquake wiping out a lot of data centers, then many people are going to activate their DR plans at once. This scenario won’t have been tested. We could easily experience a cascading outage from the high usage that causes many other sites to go down, until the initial wave passes. Generally businesses have to be prepared to not receive good service until everyone is moved over and things settle down again.

Summary

Responsible companies should have solid plans for both high availability and disaster recovery. At the same time they need to compare the cost of these against the time they can afford to be down against the probability of these scenarios happening to them. Due to the costs and complexities of handling these scenarios, many companies are moving to the cloud to offload these concerns to their cloud application provider. Of course when choosing a cloud provider make sure you check the RPO and RTO that they provide.

Written by smist08

April 12, 2014 at 5:56 pm