Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘sla

The Road to DevOps Part 2

leave a comment »


Last week we looked at an introduction to DevOps and concentrated on the issues around frequently deploying new versions of the software. This week we are continuing to look at DevOps but concentrating on issues with maintaining and monitoring the system during normal operations. This includes ensuring the system is available, provisioning new users, removing delinquent users and generally monitoring the system and ensuring it is healthy.


Generally everyone wants a service that is always available always healthy and working well. But this is too vague a statement and the reality is that things happen and need to be dealt with. This has to be acknowledged up front and strategies put into place to deal with them. First you need stronger guidelines and this usually starts with a Service Level Agreement (SLA) that is laid out for your customers. This details various metrics that you are promising to achieve and what happens when you don’t. Generally you need a good set of performance metrics to judge your service against.

Some of the common performance metrics are:

  • Throughput: System response speed.
  • Response Time: How quickly will a given issue be resolved?
  • Reliability: System availability.
  • Load balancing: When elasticity kicks in.
  • Time Outages: Will services be unavailable during that time?
  • Service Slow down: Will services be available, but with much lower throughput?
  • Durability: How likely to lose data.
  • Elasticity: How much a resource can grow?
  • Linearity: System performance as the load increases.
  • Agility: How quickly the we responds to load changes.
  • Automation: Percent of requests handled without human interaction.

Even if you don’t publish these metrics externally you need to track these to know how you are doing.  Generally a DevOps team takes the approach of continual improvement (like Kaizan). A good DevOps team has dashboards that track these metrics and are always looking for ways to improve them.


A basic rule with cloud applications is that you need to instrument and monitor everything. First this allows you to generate your SLA dashboard and ensure you are meeting your SLA. Second this lets you provide feedback back into development on what is working well and what is working badly. For instance you can track how much a given feature is being used, or perhaps how many people start using a feature, but don’t complete the operation. This could highlight a usability problem that needs to be addressed.

Similarly for performance optimizations. You don’t want to bother optimizing something that is infrequently used. But form good monitoring, for instance you can see a query that is being run very frequently and not delivering good performance. Attacking this would be helpful, both for people issuing the query and for other people perhaps slowed down while these slower queries are being processed.

The key point being to address what really matters, based on hard facts gathered by good instrumentation on what is really affecting your users. Perhaps you don’t need a monitoring center like the one below, but it sure would be cool.


Another operation that hopefully is going on at a rapid pace is provisioning new users. And then the reverse, hopefully at a very slow pace, is removing users. In normal operations, users should be able to sign up for your service very easily, perhaps filling out a web page, and then acknowledging a confirmation e-mail. All this should be done pretty much instantly.

What should not happen, is that the user fills out a web form, which is then submitted to a queue to be processed, then in the data center, someone reads this request and performs a number of manual steps to setup the user. Then hours or days later an e-mail is sent to the customer letting them know they can use the service.

This is really a matter of a DevOps team’s focus on automation. Chances are the steps of the manual process need to be performed, which is ok, as long as they can be automated (scripted) and performed automatically quickly eliminating any time dely. Generally any DevOps team is always looking to find any manual process and eliminate (automate) it.

Elastic Operations

Most people don’t buy their own data centers or their own server hardware anymore. Especially when starting up, you don’t want to make a huge investment in capital equipment. Most people use an IaaS or PaaS service like Amazon or Azure. These services are “elastic” in that you can run scripts to add capacity or remove capacity so then stretch and shrink with demand. You pay for what you use, so for each server you have running in this environment, you are paying some fee.

Generally a DevOps team should be monitoring the system load and when it hits a certain level, scripts are run that create a new server and add it to the system, so the load is shared by more computing resources. By the same token when usage drops, perhaps on the weekend or late at night, you would like to drop some of these computing resources to save money. Again the DevOps team needs to develop the necessary programs and scripts to support this sort of operation in an automatic manner (you don’t want to be paying someone to juggle these system resources). Adding resources is usually easier since they come up empty and once they are known to the load balancer they will start being used. When shutting down, you have to monitor and ensure no one is using the system before shutting it down (or have a method to move the active users to another server). Often this is done by stopping new requests going to the server and then just waiting for all the users to logoff or become inactive. Generally if your application in completely stateless, then this is all much easier.

Disaster Recovery

It is also the responsibility of the DevOps team to ensure there is a good disaster recovery plan in place. For instance the Azure and Amazon services have multiple datacenters. You need to control how you application is deployed to these and how backups and redundancy is managed. Generally the higher level of redundancy and the quicker the switch over can cost more money, so you need to make sure your plan is sufficient, but not overdone.

Suppose you use Azure or Amazon and even though you have a redundant deployment to multiple datacenters, the whole service goes down? Some companies actually have redundancy across IaaS providers so if Azure goes down, then they can still run on Amazon. Practically speaking, unless you are very large or have a very tight SLA, this tends to be overkill. Generally you just put in your SLA that you aren’t responsible for the provider being systemically down.


The transitions from Waterfall to Agile development was an interesting one with a lot of pitfalls along the way. The transition from Agile development to DevOps is a bigger steps and will involve many new learning opportunities. It takes a bit of patience, but in the end should lead to an improved Development organization and happier customers.