Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘devops

The Road to DevOps Part 2

leave a comment »


Last week we looked at an introduction to DevOps and concentrated on the issues around frequently deploying new versions of the software. This week we are continuing to look at DevOps but concentrating on issues with maintaining and monitoring the system during normal operations. This includes ensuring the system is available, provisioning new users, removing delinquent users and generally monitoring the system and ensuring it is healthy.


Generally everyone wants a service that is always available always healthy and working well. But this is too vague a statement and the reality is that things happen and need to be dealt with. This has to be acknowledged up front and strategies put into place to deal with them. First you need stronger guidelines and this usually starts with a Service Level Agreement (SLA) that is laid out for your customers. This details various metrics that you are promising to achieve and what happens when you don’t. Generally you need a good set of performance metrics to judge your service against.

Some of the common performance metrics are:

  • Throughput: System response speed.
  • Response Time: How quickly will a given issue be resolved?
  • Reliability: System availability.
  • Load balancing: When elasticity kicks in.
  • Time Outages: Will services be unavailable during that time?
  • Service Slow down: Will services be available, but with much lower throughput?
  • Durability: How likely to lose data.
  • Elasticity: How much a resource can grow?
  • Linearity: System performance as the load increases.
  • Agility: How quickly the we responds to load changes.
  • Automation: Percent of requests handled without human interaction.

Even if you don’t publish these metrics externally you need to track these to know how you are doing.  Generally a DevOps team takes the approach of continual improvement (like Kaizan). A good DevOps team has dashboards that track these metrics and are always looking for ways to improve them.


A basic rule with cloud applications is that you need to instrument and monitor everything. First this allows you to generate your SLA dashboard and ensure you are meeting your SLA. Second this lets you provide feedback back into development on what is working well and what is working badly. For instance you can track how much a given feature is being used, or perhaps how many people start using a feature, but don’t complete the operation. This could highlight a usability problem that needs to be addressed.

Similarly for performance optimizations. You don’t want to bother optimizing something that is infrequently used. But form good monitoring, for instance you can see a query that is being run very frequently and not delivering good performance. Attacking this would be helpful, both for people issuing the query and for other people perhaps slowed down while these slower queries are being processed.

The key point being to address what really matters, based on hard facts gathered by good instrumentation on what is really affecting your users. Perhaps you don’t need a monitoring center like the one below, but it sure would be cool.


Another operation that hopefully is going on at a rapid pace is provisioning new users. And then the reverse, hopefully at a very slow pace, is removing users. In normal operations, users should be able to sign up for your service very easily, perhaps filling out a web page, and then acknowledging a confirmation e-mail. All this should be done pretty much instantly.

What should not happen, is that the user fills out a web form, which is then submitted to a queue to be processed, then in the data center, someone reads this request and performs a number of manual steps to setup the user. Then hours or days later an e-mail is sent to the customer letting them know they can use the service.

This is really a matter of a DevOps team’s focus on automation. Chances are the steps of the manual process need to be performed, which is ok, as long as they can be automated (scripted) and performed automatically quickly eliminating any time dely. Generally any DevOps team is always looking to find any manual process and eliminate (automate) it.

Elastic Operations

Most people don’t buy their own data centers or their own server hardware anymore. Especially when starting up, you don’t want to make a huge investment in capital equipment. Most people use an IaaS or PaaS service like Amazon or Azure. These services are “elastic” in that you can run scripts to add capacity or remove capacity so then stretch and shrink with demand. You pay for what you use, so for each server you have running in this environment, you are paying some fee.

Generally a DevOps team should be monitoring the system load and when it hits a certain level, scripts are run that create a new server and add it to the system, so the load is shared by more computing resources. By the same token when usage drops, perhaps on the weekend or late at night, you would like to drop some of these computing resources to save money. Again the DevOps team needs to develop the necessary programs and scripts to support this sort of operation in an automatic manner (you don’t want to be paying someone to juggle these system resources). Adding resources is usually easier since they come up empty and once they are known to the load balancer they will start being used. When shutting down, you have to monitor and ensure no one is using the system before shutting it down (or have a method to move the active users to another server). Often this is done by stopping new requests going to the server and then just waiting for all the users to logoff or become inactive. Generally if your application in completely stateless, then this is all much easier.

Disaster Recovery

It is also the responsibility of the DevOps team to ensure there is a good disaster recovery plan in place. For instance the Azure and Amazon services have multiple datacenters. You need to control how you application is deployed to these and how backups and redundancy is managed. Generally the higher level of redundancy and the quicker the switch over can cost more money, so you need to make sure your plan is sufficient, but not overdone.

Suppose you use Azure or Amazon and even though you have a redundant deployment to multiple datacenters, the whole service goes down? Some companies actually have redundancy across IaaS providers so if Azure goes down, then they can still run on Amazon. Practically speaking, unless you are very large or have a very tight SLA, this tends to be overkill. Generally you just put in your SLA that you aren’t responsible for the provider being systemically down.


The transitions from Waterfall to Agile development was an interesting one with a lot of pitfalls along the way. The transition from Agile development to DevOps is a bigger steps and will involve many new learning opportunities. It takes a bit of patience, but in the end should lead to an improved Development organization and happier customers.


The Road to DevOps Part 1

with 5 comments


Historically Development and Operations have been two separate departments that rarely talk to each other. Development produces a product using some sort of development methodology and when they’ve finished it, they release it and hand it off to Operations. Operations then installs the product, upgrades the users from the old version and then maintains the system, fixing hardware problems, logging software defect complaints and keeping backups. This worked fine when Development produced a release every 18 months or so, but doesn’t really work in the modern Web world where software can often be released hourly.

In the new world Development and Operations need to be combined into one team and one department. The goal being to remove any organizational or bureaucratic barriers between the two. It also recognizes that the goal of the company isn’t just to produce the software but to run it and to keep it available, up to date and healthy for all its customers. Operations has to provide feedback immediately to Development and Development has to quickly address any issues and provide updates quickly back to Operations.

In this posting I’m covering the aspect of DevOps concerned with frequently rolling out new versions to the cloud. In a future posting I’ll cover the aspects of DevOps concerned with the normal running of a cloud based service, like provisioning new users, monitoring the system, scaling the system and such.

Agile to DevOps

We have transitioned our software development methodology from Waterfall to Agile. A key idea of Agile is that you break work into small stories that you complete in a single short sprint (in our case two weeks). A key goal is that during the sprint each story is done completely including testing, documentation and anything else required. This then leads to the product being in a releasable state at the end of each sprint. In an ideal world you could then release (or deploy) the product at the end of each sprint.

This is certainly an improvement over Waterfall where the product is only releasable every year or 18 months or so. But generally with Agile this is the ideal, but not really what is quite achieved. Generally a release consists of the outcome of a number of sprints (usually around 10) followed by a short regression, held concurrently with a beta test followed by release.

So the question is how do we remove this extra overhead and release continuously (or much more frequently)? This is where DevOps comes in. It is a methodology that has been developed to extend Agile Development and principles straight through to actually encompassing the deployment and maintenance of the product. DevOps does require Development change the way it does things as it requires Operations to change. DevOps requires a much more team bases approach to doing things and requires organizational boundaries be removed. DevOps requires a razor focus on automating all manual processes in the build to deploy to monitor to get feedback cycle.


Most development processes have an automated build system, usually built on something like Jenkins. The idea is that when you check in source code, the build system sees you checked it in, then it has rules that tell it what module it is part of, and rebuilds those modules, then it sees what modules depend on those modules and rebuilds those and so on. As part of the build process, unit tests are run and various automated tests are set off. The idea is that if anything goes wrong, it is very clear which check-in caused it and things can be quickly fixed and/or rolled back.

This is a good starting point but for effective DevOps it needs to be refined. Most modern source control systems support branching (most famously git). For DevOps it becomes even more crucial that the master branch of the product is always in releasable state and can be deployed at any time. The way this is achieved is by developing each feature as a separate branch. Then when a feature is completely ready for release it can be pulled into the master branch, which means it can be deployed at any time. Below is a diagram of how this process typically works:

Automated Testing

Obviously in this environment, it isn’t going to work if for every frequent release, you need to run a complete thorough manual test of the entire product. In an ideal world you have very complete test coverage using various levels of automated testing. I covered some aspects of this here. You still want to do some manual testing to ensure that things still feel right, but you don’t want to have to be repeating a lot of functional testing.


Operations can then take any release and in consultation with the various stakeholders release it to production. Operations is then in charge of this part of things and ensures the new versions in installed, data is converted and everything is running smoothly.

Some organizations release everything that is built which means their web site can be updated hourly. Github is a good example of this. But generally for ERP or CRM type software we want to be a bit more controlled. Generally there is a schedule of releases (perhaps one release every two weeks) and then a schedule of when things need to be done to be included in that release, which then controls which branches get pulled into the master branch. This is to ensure that there aren’t any disruptions to business customers. Again you can see that this process is blending elements of QA along with Operations which is why the DevOps team approach works so well.

A key idea that has emerged is the concept: “Infrastructure as Code”. In this world all Operations tasks are performed by code. This way things become much more repeatable and much more automated. It’s really this whole idea that you build your infrastructure by writing programs that then do the real work, that has largely led to movement of Developers into Operations.


This is where Development and Operations must be working as a team. Operations has to let Developers know what tools (scripts) they require to deploy the new version. They need automated procedures to roll out the new version, convert data and such. They have to be working together very closely to develop all these scripts in the various tools like Jenkins, PowerShell, Maven, Ant, Chef, Puppet or Nexus.

Performing all this work takes a lot of effort. It has to be people’s full time job, or it just won’t get done properly. If people aren’t fully applied to this, manual processes will start to creep in, productivity and quality will suffer.

Beyond successfully deploying the software, this team has to handle things when they go wrong. They need to be able to rollback a version. Reverse the database schema changes and return to a known stable good state.


DevOps is a whole new profession. Combining many of the skills of Development with those of Operations. People with these skill sets will be in high demand as this is becoming an area that is making or breaking companies on the Web and in the Cloud. No one likes to have outages; no one likes to roll out bad upgrades. In today’s fast paced world, these can put huge pressures on a company. DevOps as a profession and set of operating procedures is a good way to alleviate this pressure while keep up to the fast pace.