Monday, December 28, 2009

Continuous deployment for mission-critical applications

Having evangelized the concept of continuous deployment for the past few years, I've come into contact with almost every conceivable question, objection, or concern that people have about it. The most common reaction I get is something like, "that sounds great - for your business - but that could never work for my application." Or, phrased more hopefully, "I see how you can use continuous deployment to run an online consumer service, but how can it be used for B2B software?" Or variations thereof.

I understand why people would think that a consumer internet service like IMVU isn't really mission critical. I would posit that those same people have never been on the receiving end of a phone call from a sixteen-year-old girl complaining that your new release ruined their birthday party. That's where I learned a whole new appreciation for the idea that mission critical is in the eye of the beholder. But, even so, there are key concerns that lead people to conclude that continuous deployment can't be used in mission critical situations.

Implicit in these concerns are two beliefs:

1. That mission critical customers won't accept new releases on a continuous basis.
2. That continuous deployment leads to lower quality software than software built in large batches.

These beliefs are rooted in fears that make sense. But, as is often the case, the right thing to do is to address the underlying cause of the fear instead of avoiding improving the process. Let's take each in turn.

Another release? Do I have to?
Most customers of most products hate new releases. That's a perfectly reasonable reaction, given that most releases of most products are bad news. It's likely that the new release will contain new bugs. Even worse, the sad state of product development generally means that the new "features" are as likely to be ones that make the product worse, not better. So asking customers if they'd like to receive new releases more often usually leads to a consistent answer: "No, thank you." On the other hand, you'll get a very different reaction if you ask customers "next time you report an urgent bug, would you prefer to have it fixed immediately or to wait for a future arbitrary release milestone?"

Most enterprise customers of mission critical software mitigate these problems by insisting on releases on a regular, slow schedule. This gives them plenty of time to do stress testing, training, and their own internal deployment. Smaller customers and regular consumers rely on their vendors to do this for them, and are otherwise at their mercy. Switching these customers directly to continuous deployment sounds harder than it really is. That's because of the anatomy of a release. A typical "new feature" release is, in my experience, about 80% changes to underlying API's or architecture. That is, the vast majority of the release is not actually visible to the end-user. Most of these changes are supposed to be "side effect free" although few traditional development teams actually achieve that level of quality. So the first shift in mindset required for continuous deployment is this: if a change is supposedly "side effect free," release it immediately. Don't wait to bundle it up with a bunch of other related changes. If you do that, it will be much harder to figure out which change caused the unexpected side effects.

The second shift in mindset required is to separate the concept of a marketing release from the concept of an engineering release. Just because a feature is built, tested, integrated and deployed doesn't mean that any customers should necessarily see it. When deploying end-user-visible changes, most continuous deployment teams keep them hidden behind "flags" that allow for a gradual roll-out of the feature when it's ready. (See this blog post from Flickr for how they do this.) This allows the concept of "ready" to be much more all-encompassing than the traditional "developers threw it over the wall to QA, and QA approved of it." You might have the interaction designer who designed it take a look to see if it really conforms to their design. You might have the marketing folks who are going to promote it double-check that it does what they expect. You can train your operations or customer service staff on how it works - all live in the production environment. Although this sounds similar to a staging server, it's actually much more powerful. Because the feature is live in the real production environment, all kinds of integration risks are mitigated. For example, many features have decent performance themselves, but interact badly when sharing resources with other features. Those kinds of features can be immediately detected and reverted by continuous deployment. Most importantly, the feature will look, feel, and behave exactly like it does in production. Bugs that are found in production are real, not staging artifacts.

Plus, you want to get good at selectively hiding features from customers. That skill set is essential for gradual roll-outs and, most importantly, A/B split-testing. In traditional large batch deployment systems, split-testing a new feature seems like considerably more work than just throwing it over the wall. Continuous deployment changes that calculus, making split-tests nearly free. As a result, the amount of validated learning a continuous deployment team achieves per unit time is much higher.

The QA dilemma
A traditional QA process works through a checklist of key features, making sure that each feature works as specified before allowing the release to go forward. This makes sense, especially given how many bugs in software involve "action at a distance" or unexpected side-effects. Thus, even if a release is focused on changing Feature X, there's every reason to be concerned that it will accidentally break Feature Y. Over time, the overhead of this approach to QA becomes very expensive. As the product grows, the checklist has to grow proportionally. Thus, in order to get the same level of coverage for each release, the QA team has to grow (or, equivalently, the amount of time the product spends in QA has to grow). Unfortunately, it gets worse. In a successful startup, the development team is also growing. That means that there are more changes being implemented per unit time as well. Which means that either the number of releases per unit time is growing or, more likely, the number of changes in each release is growing. So for a growing team working on a growing product, the QA overhead is growing polynomially, even if the team is only growing linearly.

For organizations that have the highest standards for mission critical, and the budget to do it, full coverage can work. In fact, that's just what happens for organizations like the US Army, who have to do a massive amount of integration testing of products built by their vendors. Having those products fail in the field would be unacceptable. In order to achieve full coverage, the Army has a process for certifying these products. The whole process takes a massive amount of manpower, and requires a cycle time that would be lethal for most startups (the major certifications take approximately two years). And even the Army recognizes that improving this cycle time would have major benefits.

Very few startups can afford this overhead, and so they simply accept a reduction in coverage instead. That solves the problem in the short term, but not in the long term - because the extra bugs that get through the QA process wind up slowing the team down over time, imposing extra "firefighting" overhead, too.

I want to directly challenge the belief that continuous deployment leads to lower quality software. I just don't believe it. Continuous deployment offers three significant advantages over large batch development systems. Some of these benefits are shared by agile systems which have continuous integration but large batch releases, but others are unique to continuous deployment.
  1. Faster (and better) feedback. Engineers working in a continuous deployment environment are much more likely to get individually tailored feedback about their work. When they introduce a bug, performance problem, or scalability bottleneck, they are likely to know about it immediately. They'll be much less likely to hide behind the work of others, as happens with large batch releases - when a release has a bug, it tends to be attributed to the major contributor to that release, even if that association is untrue. 
  2. More automation. Continuous deployment requires living the mantra: "have every problem only once." This requires a commitment to realistic prevention and learning from past mistakes. That necessarily means an awful lot of automation. That's good for QA and for engineers. QA's job gets a lot more interesting when we use machines for what machines are good for: routine repetitive detailed work, like finding bug regressions. 
  3. Monitoring of real-world metrics. In order to make continuous deployment work, teams have to get good at automated monitoring and reacting to business and customer-centric metrics, not just technical metrics. That's a simple consequence of the automation principle above. There are huge classes of bugs that "work as designed" but cause catastrophic changes in customer behavior. My favorite: changing the checkout button in an e-commerce flow to appear white on a white background. No automated test is going to catch that, but it still will drive revenue to zero. Continuous deployment teams will get burned by that class of bug only once.
  4. Better handling of intermittent bugs. Most QA teams are organized around finding reproduction paths for bugs that affect customers. This made sense in eras where successful products tended to be used by a small number of customers. These days, even niche products - or even big enterprise products - tend to have a lot of man-hours logged by end-users. And that, in turn, means that rare bugs are actually quite exasperating. For example, consider a bug that happens only one-time-in-a-million uses. Traditional QA teams are never going to find a reproduction path for that bug. It will never show up in the lab. But for a product with millions of customers, it's happening (and being reported to customer service) multiple times a day! Continuous deployment teams are much better able to find and fix these bugs.
  5. Smaller batches. Continuous deployment tends to drive the batch size of work down to an optimal level, whereas traditional deployment systems tend to drive it up. For more details on this phenomena see Work in small batches and the section on the "batch size death spiral" in Product Development Flow
For those of you who are new to continuous deployment, these benefits may not sound realistic. In order to make sense of them, you have to understand the mechanics of continuous deployment. To get started, I recommend these three posts: Continuous deployment in 5 easy steps, Timothy Fitz's excellent Continuous Deployment at IMVU: Doing the impossible fifty times a day, and Why Continuous Deployment?

Let me close with a question. Imagine with me for a moment that continuous deployment doesn't prevent us from doing staged releases for customers, and it actually leads to higher quality software. What's preventing you from using it for your mission-critical application today? I hope you'll share your thoughts in a comment.
blog comments powered by Disqus