Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘rpo

Disaster Recovery

with 4 comments

Introduction

In a previous blog article I talked about business continuity, what you need to do to keep Sage 300 ERP up and running with little or no downtown. However I mushed together two concepts, namely keeping a service highly available along with having a disaster recovery plan. In this article I want to separate these two concepts apart and consider them separately.

We’ve had to give these two concepts a lot of thought when crafting our Sage 300 Online product offering, since we want to have this service available as close to 100% as possible and then if something truly catastrophic happens, back on its feet as quickly as possible.

Terminology

There is some common terminology which you always see in discussions on this topic:

RPO – Recovery Point Objective: this is the maximum tolerable period in which data might be lost due to a major incident. So for instance if you have to restore from a backup, how long ago was that backup made.

RTO – Recovery Time Objective: this is the duration of time within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences. For instance if a computer fails, how long can you wait to replace it.

HA – High Availability: usually concerns keeping a system running with little or no downtime. This doesn’t include scheduled downtime and it usually doesn’t include a major disaster like an earthquake eating a datacenter.

DR – Disaster Recovery: this is the process, policies and procedures that are related to preparing for recovery or continuation of technology infrastructure which are vital to an organization after a natural or human-induced disaster.

High Availability

HA means creating a system that can keep running when individual components fail (no single point of failure), like one computer’s motherboard frying, a power supply failing or a hard disk failure. These are reasonably rare events, but often systems in data centers run on dozens of individual computers and things do fail and you don’t want to be down for a day waiting for a new part to be delivered.

Of course if you don’t mind being down for a day or two when things fail, then there is no point spending the money to protect against this. Which is why most businesses set RPO and RTO targets for these type of things.

Some of this comes down to procedures as well. For instance if you have all redundant components but then run Windows Update on them all at once, they will reboot all at once bringing your system down. You could schedule a maintenance windows for this, but generally if you have redundant components you can Windows update the first and when its fine and back up, then you do the secondary.

If you are running Sage ERP on a newer Windows Server and using SQL Server as your database then there are really good hardware/software combinations of all the standard components to give you really good solid high availability. I talked about some of these in this article.

Disaster Recovery

This usually refers to having a tested plan to spin up your IT infrastructure at an alternate site in the case of a major disaster like an earthquake or hurricane wiping out you currently running systems.

natural_disaster

Again depending on your RPO/RTO requirements will depend on how much money you spend on this. For instance do you purchase backup hardware and have it ready to go in an alternate geographic region (far enough away that the same disaster couldn’t take out both locations)?

For sure you need to have complete backups of everything that are stored far away that you can recover from. Then it’s a matter of acquiring the hardware and restoring all your backups. Often people are storing these backups in the cloud these days, this is because cloud storage has become quite inexpensive and most cloud storage solutions provide redundancy across multiple geographies.

The key point here is to test your procedure. If your DR plan isn’t tested then chances are it won’t work when it’s needed. Performing a DR drill is quite time consuming, but really essential if you are serious about business continuity.

Azure

One of the attractions of the cloud is having a lot of these things done for you. Sage 300 Online handles setting up all its systems HA, as well as having a tested DR plan ready to implement. Azure helps by having many data centers in different locations and then having a lot of HA and DR features built into their components (especially the PaaS ones). This then removes a lot of management and procedural headaches from running your business.

Hard Decisions

If a data center is completely wiped out, then the decision to execute the DR plan is easy. However the harder decision comes in when the primary site has been down for a few hours, people are working hard to restore service, but it seems to be dragging on. Then you can have a hard decision to kick in the DR plan or to wait to see if the people recovering the primary can be successful. These sort of things are often caused by electrical problems, or problems with large SANs.

One option is to start spinning up the alternative site, restoring backups if necessary and getting ready, so when you do make the decision you can do the switch over quickly. This way you can often delay the hard decision and give the people fixing the problem a bit more time.

Dangers

Having a good tested DR plan is the first step, but businesses need to realize that if a major disaster like an earthquake wiping out a lot of data centers, then many people are going to activate their DR plans at once. This scenario won’t have been tested. We could easily experience a cascading outage from the high usage that causes many other sites to go down, until the initial wave passes. Generally businesses have to be prepared to not receive good service until everyone is moved over and things settle down again.

Summary

Responsible companies should have solid plans for both high availability and disaster recovery. At the same time they need to compare the cost of these against the time they can afford to be down against the probability of these scenarios happening to them. Due to the costs and complexities of handling these scenarios, many companies are moving to the cloud to offload these concerns to their cloud application provider. Of course when choosing a cloud provider make sure you check the RPO and RTO that they provide.

Written by smist08

April 12, 2014 at 5:56 pm