Friday, February 20, 2009

Work in small batches

Software should be designed, written, and deployed in small batches.

Of all of the insights I've contributed to the companies I've worked at over the years, the one I am most proud of is the importance of working in small batches. It's had tremendous impact in many areas: continuous deployment, just-in-time scalability, and even search engine marketing, to name a few. I owe it originally to lean manufacturing books like Lean Thinking and Toyota Production System.

The batch size is the unit at which work-products move between stages in a development process. For software, the easiest batch to see is code. Every time an engineer checks in code, they are batching up a certain amount of work. There are many techniques for controlling these batches, ranging from the tiny batches needed for continuous deployment to more traditional branch-based development, where all of the code from multiple developers working for weeks or months is batched up and integrated together.

It turns out that there are tremendous benefits from working with a batch size radically smaller than traditional practice suggests. In my experience, a few hours of coding is enough to produce a viable batch and is worth checking in and deploying. Similar results apply in product management, design, testing, and even operations. Normally I focus on the techniques you need to reduce batch size, like continuous integration. Today, I want to talk about the reasons smaller batches are better. This is actually a hard case to make, because most of the benefits of small batches are counter-intuitive.

Small batches mean faster feedback. The sooner you pass your work on to a later stage, the sooner you can find out how they will receive it. If you're not used to working in this way, it may seem annoying to get interrupted so soon after you were "done" with something, instead of just working it all out by yourself. But these interruptions are actually much more efficient when you get them soon, because you're that much more likely to remember what you were working on. And, as we'll see in a moment, you may also be busy buidling subsequent parts that depend on mistakes you made in earlier steps. The sooner you find out about these dependencies, the less time you'll waste having to unwind them.

Take the example of a design team prepping mock-ups for their development team. Should they spend a month doing an in-depth set of specifications and then hand them off? I don't think so. Give the dev team your very first sketches and let them get started. Immediately they'll have questions about what you meant, and you'll have to answer them. You may surface assumptions you had about how the project was going to go that are way off. If so, you can immediately evolve the design to take the new facts into account. Every day, give them the updated drawings, always with the proviso that everything is subject to change. Sometimes that will require the team to build something over again, but that's rarely very expensive, because the second time is so much more efficient, thanks to the knowledge gained the first time through. And over time, the development team may be able to start anticipating your needs. Imagine not having to finish the spec at all, because the team has already found an acceptable solution. I've witnessed that dozens of times, and it's a huge source of time-savings.

Small batches mean problems are instantly localized. This is easiest to see in deployment. When something goes wrong with production software, it's almost always because of an unintended side-effect of some piece of code. Think about the last time you were called upon to debug a problem like that. How much of the time you spent debugging was actually dedicated to fixing the problem, compared to the time it took to track down where the bug originated?

Small batches reduce risk. An example of this is integration risk, which we use continuous integration to mitigate. Integration problems happen when two people make incompatible changes to some part of the system. This comes in all shapes and sizes. You can have code that depends on a certain configuration that's deployed on production. If that configuration changes before your code is deployed, the person who changes it won't know they've introduced a problem. Your code is now a ticking time bomb, waiting to cause trouble when it's deployed.

Or consider the case of code that changes the signature of a commonly-called function. It's easy to find collisions if you make a drastic change, but harder when we do things like add new default parameters. Imagine a branch-based development system with two different who each added a new, but different, default-value argument to the end of the signature, and then gone through and updated all its callers. Anyone who has had to spend hours late at night resolving one of these conflicts knows how painful they are. The smaller the batch size, the sooner these kinds of errors are caught, and the easier the integration is. When operating with continuous deployment, it's almost impossible to have integration conflicts.

Small batches reduce overhead. In my experience, this is the most counter-intuitive of its effects. Most organizations have their batch size tuned so as to reduce their overhead. For example, if QA takes a week to certify a release, it's likely that the company does releases no more than once every 30 or 60 days. Telling a company like that they should work in a two-week batch size will sounds absurd - they'd spend 50% of their time waiting for QA to certify the release! But this argument is not quite right. This is something so surprising that I didn't really believe it the first few times I saw it in action. It turns out that organizations get better at those things that they do very often. So when we start checking in code more often, release more often, or conduct more frequent design reviews, we can actually do a lot to make those steps dramatically more efficient.

Of course, that doesn't necessarily mean we will make those steps more efficient. A common line of argument is: if we have the power to make a step more efficient, why don't we invest in that infrastructure first, and then reduce batch size as we lower the overhead? This makes sense, and yet it rarely works. The bottlenecks that large batches cause are often hidden, and it takes work to make them evident, and even more work to invest in fixing them. When the existing system is working "good enough" these projects inevitably languish and get deprioritized.

Take the example of the team that needs a week to certify a new release. Imagine moving to a two-week release cycle, with the rule that no additional work can take place on the next iteration until the current iteration is certified. The first time through, this is going to be painful. But very quickly, probably even by the second iteration, the weeklong certification process will be shorter. The development team that is now clearly bottlenecked will have the incentive needed to get involved and help with the certification process. They'll be able to observe, for example, that most of the certification steps are completely automatic (and horribly boring for the QA staff) and start automating them with software. But because they are blocked from being able to get their normal work done, they'll have a strong incentive to invest quickly in the highest ROI tests, rather than overdesigning a massive new testing system which might take ages to make a difference.

These changes pay increasing dividends, because each improvement now direclty frees up somebody in QA at the same time as reducing the total time of the certification step. Those freed up QA resources might be able to spend some of that time helping the development team actually prevent bugs in the first place, or just take on some of their routine work. That frees up even more development resources, and so on. Pretty soon, the team can be developing and testing in a continuous feedback loop, addressing micro-bottlenecks the moment they appear. If you've never had the chance to work in an environment like this, I highly recommend you try it. I doubt you'll go back.

If you're interested in getting started with the transition to small batches, I'd recommend beginning with Five Whys.

(I have infuriated many coworkers by advocating for smaller batch sizes without always being able to articulate why they work. Usually, I have to resort to some form of "try it, you'll like it," and that's often sufficient. Luckily, I now have the benefit of a forthcoming book, The Principles of Product Development Flow. It's really helped me articulate my thinking on this topic, and includes an entire chapter on the topic of reducing batch size.)

11 comments:

  1. Interesting post. As a developer I find that forcing myself to work in small batches allows me to solve the problem quicker by avoiding the analysis paralysis. Each time I encounter a complex task I immediately force myself to think what are the independent small steps I can take to finish it. I try to keep each step solvable under 1-2 hours.

    ReplyDelete
  2. No doubt about it, small batches are the way to go

    RT
    www.anonymity.eu.tc

    ReplyDelete
  3. I'm a huge fan of small batches as well. This is a fantastic article.

    ReplyDelete
  4. Sounds very similar to agile development which is the way. Short iterations, quick commits (some harmful) but at least it's alive, moving, and nobody's losing interest.

    ReplyDelete
  5. Good analysis. Game development in general has worked this way for at least a decade, if not two. Junior developers tend to take a more traditional approach and have to be coached into following this line of thinking. The only disadvantage that comes to mind is the potential bottleneck of submission reviews, when that is deemed necessary.

    ReplyDelete
  6. So how many releases per year would you recommend?

    ReplyDelete
  7. Small batches also help deliver value earlier in the project. If you can start getting ROI on a feature in month one of a twelve month project versus waiting until the end, you've comparatively reduced the cost of development by the revenue generated by that feature over 11 months.

    ReplyDelete
  8. Small is beautiful.

    However, consider a situation where a team has many inexperienced/careless engineers who commit modifications in small batches...but unfortunately a very large percentage of those batches keep breaking the continuous build.

    So now trunk branch is broken almost all the time and progress is difficult because either you have to commit to an already broken tree (so you don't what else you might break) or you have to wait for it to get fixed...which might take a while because naturally the fixes tend to break something else.

    So now the organization decides that small batches are bad and two things happen:

    1. Modifications are gathered into large infrequent batches. Weeks/months of boredom are broken by blood-curling terror when the Big Batch lands.

    2. Everybody stops using trunk and squirrels their work away in private branches. Trunk is desiccated and all the cool action is somewhere else...if you only knew where.

    So what's a small batch advocate to do?

    I think small batch advocacy contains implicit assumptions (reasonable level of skill/discipline, desire to avoid breaking the build, desire to fix the build when it's broken) that may not be widely shared in an organization. And then it can get ugly.

    ReplyDelete
  9. @Anonymous - appreciate the thoughtful comment. I think what you've described is a common approach to getting quality (or other) problems under control: increase batch size, increase audits and other controls.

    But I think this is the wrong approach. The key to maintaining small batches is to ensure that the team works only as fast as it can reliably produce quality work, and no faster.

    So, for example, I would not allow anyone to commit to a broken trunk (or any other branch). Take a look at:

    http://startuplessonslearned.blogspot.com/2008/12/continuous-integration-step-by-step.html

    for a suggestion for how to implement this. Even if you have untrained, undisciplined engineers, you can create a system that incrementally, over time, addresses those root causes.

    ReplyDelete
  10. However, a technological solution can't necessarily resolve all human root causes.

    So, we have a system that disallows commits to broken branches. That doesn't prevent an engineer from breaking a branch, going home and leaving the mess for someone else to fix.

    If peer pressure doesn't work then it looks like a management problem. And I'm afraid those can only be solved by managers who are willing to solve them.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete