Monday, January 18, 2010

Case Study: Continuous deployment makes releases non-events

The following is a case study of one entrepreneur's transition from a traditional development cycle to continuous deployment. Many people still find this idea challenging, even for companies that operate solely on the web. This case presents a further complication: desktop software. Without being able transparently modify the software in situ, is it still possible to deploy on a continuous basis? Read on to find out.

Ash Maurya is the founder of WiredReach, a bootstrapped startup that he has been running for seven years. Recently, he was bitten by the lean startup bug and has started writing about his experiences attempting to apply lean startup and customer development principles. I've previously named his post Achieving Flow in a Lean Startup as one of my favorite blog posts of 2009. 

What follows is his own account of the challenges he faced as well as the solutions he adopted, lightly edited for style. If you're interested in contributing a case study for publication here, consider getting started by adding it to the Lean Startup Wiki case study section. -Eric

Of all the Lean Startup techniques, Continuous Deployment is by far the most controversial. Continuous Deployment is a process by which software is released several times throughout the day – in minutes versus days, weeks, or months. Continuous Flow Manufacturing is a Lean technique that boosts productivity by rearranging manufacturing processes so products are built end-to-end, one at a time (using singe-piece flow), versus the more prevalent batch and queue approach.

Continuous Deployment is Continuous Flow applied to software. The goal of both is to eliminate waste. The biggest waste in manufacturing is created from having to transport products from one place to another. The biggest waste in software is created from waiting for software as it moves from one state to another: Waiting to code, waiting to test, waiting to deploy. Reducing or eliminating these waits leads to faster iterations which is the key to success.

My transition to Continuous Deployment

Prior to adopting continuous deployment, I used to release software on a weekly schedule (come rain or shine) which I viewed as pretty agile, disciplined, and aggressive. I identified the must-have code updates on Monday, official code cutoff was on Thursday, and Friday was slated for the big release event. The release process took at least half a day and sometimes the whole day. Dedicating up to 20% of the week on releasing software is incredibly wasteful for a small team. This is not counting the ongoing coordination effort also needed in prioritizing the ever-changing release content for the week as new critical issues are discovered. Despite these challenges, I fought the temptation to move to a longer bi-weekly or monthly release cycle because I wanted to stay highly responsive to customers (something our customers repeatedly appreciate). Managing weekly releases got a lot harder once I started doing customer development. Spending more time outside the building, meant less time for coding, testing, and deploying. Things started to slip. That is when I devised a set of work hacks to manage my schedule (described here) and what drove me to adopting Continuous Deployment.

My transition from staged releases to continuous deployment took roughly 2 weeks. I read Eric Ries' 5 step primer to getting started with Continuous Deployment and found that I already had a lot of the necessary pieces. Continuous integration, deployment scripts, monitoring, and alerting are all best practices for any release process - staged or continuous.

The fundamental challenge with Continuous Deployment is getting comfortable with releasing all the time.
Continuous deployment makes releases non-events and checking in code is synonymous with triggering a release. On the one hand, this is the ultimate in customer responsiveness. On the other hand, it is scary as hell. With staged releases, time provides a (somewhat illusory) safety net. There is also comfort in sharing test responsibility with someone else (the QA team). No one wants to be solely responsible for bringing a production system down. For me neither was a consideration. I didn't have time or a QA team.

I took things easy at first - made small changes and audited the release process maniacally. I started relying heavily on functional tests (over unit tests) which allowed me to test changes as a user would. I also identified a set of events that would indicate something terribly going wrong (e.g. no users on the system) and built real-time alerting around them (using nagios/ganglia). As we built confidence, we started committing bigger and multi-part changes, each time building up our suite of testing and monitoring scripts. After a few iterations, our fear level was actually lower than how we used to feel after a staged release. Because we were committing less code per release, we could correlate issues to a release with certainty.

These days, we never wonder if unexpected errors could have been introduced as a result of a large code merge (since there is no branching. We also rely on more testing and monitoring automation, which is way more robust and consistent than what we were doing before.

All that said, mistakes are still made and we commit bad code now and then. None that have taken the system down (not yet anyway). Rather than seeing these as a shortcoming of the process, we view it as an opportunity to build up our Cluster Immune System. We try and follow a Five Whys approach to keep these errors from recurring. There is always some action to take: writing more tests, more monitoring, more alerts, more code, or more process.

Looking back, struggled to balance the opposing pulls of "outside the building" versus "inside the building" activities. Adopting Continuous Deployment has allowed me to build "flow" into my day which allows me to do both. But easier releases are not the only benefit of Continuous Deployment. Smaller releases lead to faster build/measure/learn loops. I've used these faster build/measure/learn loops to optimize my User Activation flow, delight customers with "near-instant" fixes to issues, and even eliminate features that no one was using.

While it is somewhat easier to continuously deploy web based software, with a little discipline, desktop based software too can be built to flow. Here's how I am implement continuous deployment for my desktop-based application (CloudFire).

My Continuous Deployment process

Don't push features

If you've followed a customer discovery process, identified a problem worth solving, and built out your minimum viable product, DON'T keep adding features until you've validated the MVP, or more specifically the unique value proposition of the MVP. Unneeded features are waste and not only create more work but can needlessly complicate the product and prolong the "customer validation" phase.

Every new feature should ideally be pulled by more than one customer before showing up in a release.
Build in response to a signal from the "customer", and otherwise rest or improve.
As a technologist, I too love to measure progress based on how much stuff I build. But instead of channeling all my energy towards building new features, I channel roughly 80% of it towards measuring and optimizing existing features. I am not advocating adding no features at all. Users will naturally ask for more stuff and your MVP by definition is minimal and needs more love. Just don't push it.

Code in small batches

I've previously described my 2 hour blocks of maker time for maximizing my work "flow". Prior to starting any maker activity, I clearly identify what needs to get done (the goal) and sketch out how it needs to get done (the design).

It is important to point out that the goal of the maker activity need not be a user facing feature or even a complete feature. There is inherent value in committing incremental work into production to diffuse future integration surprises. During the maker activity, I code, unit test, and create or update functional tests, as needed. At the end of the maker activity, I check-in code which automatically triggers a build on a continuous integration server that is then run through a battery of unit and functional tests. The artifacts created at the end of the build are installers for mac and windows (for new users) along with an Eclipse P2 repository (OSGI) for automatic software updates (for current users). The release process takes ~15 minutes and runs in the background.

Prefer functional tests over unit tests whenever possible

I don't believe in blindly writing unit tests just to achieve 100% code coverage as reported by some tool. To do that I would have to mock (simulate) too many critical components. I deem excessive unit testing a form of waste. Whenever possible, I rely on functional tests that verify user actions. I use Selenium, which lets me control the application on multiple browsers and OS platforms, just as a user would. One thing to be wary of is that functional tests are longer running than unit tests and will gradually increase the release-cyle-time. Parallelization of tests with multiple test machines is a way to address this. I am not at that point yet but Selenium Grid looks like a good option. So does Go Test It.

Always test the User Activation flow

After the integration tests are run and the software packaged, I always verify my User Activation flow before going live. The user activation flow is the most critical path towards achieving initial user gratification or product/market fit. My user activation flow is automatically tested on both a mac and windows machine.

Utilize automagic software updates

A major challenge with desktop-based (versus web-based) software is propagating software updates. Studies have shown that users find traditional software update dialogs annoying. To overcome this, I am using a software update strategy that works silently without ever interrupting the user, much like an appliance. Google Chrome utilizes a similar update process. The biggest risk with this approach is that users will find it Orwellian. So far no one has complained, and many users like the auto-update feature. It helps that CloudFire, being a p2web app, runs headlessly with a browser-based UI.

This is how the software update process currently works:
  1. At the end of each build, we push an Eclipse P2 repository (OSGI) which is a set of versioned plug-ins that make up the application. Because the application is composed of many small plug-ins, coupled with the fact that we commit small code batches, the size of each software update can be downloaded quickly.
  2. Every time the user starts up the application, it checks for a new update, downloads and installs one if available. Depending on the type of update, it could take effect immediately or require an application restart. If an application restart is required, we wait until the next user initiated relaunch of the application or trigger one silently when the system is idle.
  3. If the application is already running, it periodically polls for new updates. If an update is found, it is also downloaded and installed in the background (as above) without interrupting the user.
Alerts and monitoring

I use nagios and ganglia to implement both system and application level monitoring and alerting on the overall health of the production cluster. Examples of things I monitor are the numbers of user activations, active users, and aggregate page hits to user galleries. Any out-of-the-norm dip in these numbers immediately alerts us (via twitter/SMS) to a potential issue.

Application level diagnostics

Despite the best testing, defects still happen. More testing is not always the answer as some defects are intermittent and a function of the end-user's environment. It is virtually impossible to test all combinations of hardware, OS, browsers, and third party apps (e.g. Norton anti-virus, Zone Alarm, etc.).

Relying on users to report errors doesn't always work in practice. To compensate, we've had to build  basic diagnostics right into the application itself. They can notify both the user and us of unexpected errors, and allow us to pull configuration information and logs remotely. We can also do remote rollbacks this way.

Tolerate unexpected errors exactly once

Unexpected errors provide the opportunity to learn and bullet-proof a system early. Ignoring them or implementing quick-and-dirty patches inevitably lead to repeat errors which are another form of waste. I try and follow a formalized Five Why's process (using our internal wiki) for every error. This forces us to really stop, think, and fix the right problem(s).

My continuous deployment process is summarized below:

So why is Continuous Deployment so controversial?
Eric has addressed a lot of the objections already on his blog. One that I hear a lot is the belief that you need a massive team to pull off continuous deployment. I would argue that the earlier in the development cycle and the smaller the team, the easier it is to implement a continuous deployment process. If you are a start-up with a MVP, there is no better time to adopt a continuous deployment process than the present. You don't yet have hundreds of customers, dozens of peers, or dozens of features. It is a lot easier to lay the groundwork now with time on your side.

If you enjoyed Ash's writing in this case study, I suggest you subscribe to his blog. -Eric
blog comments powered by Disqus