Monday, October 5, 2009

The curse of prevention

Beware! I have detected a secret virus in your CPU. Due to an interaction effect between your hardware, solar flares, and quantum flux, this virus will crash your computer and erase your hard drive sometime soon. There is only one way to prevent disaster: you must click the subscribe button over on the right there. Go ahead, I’ll wait.

Did you do it? Good. Now you’re safe from that dastardly virus. How do you know my solution worked? Just wait. See, no crashing. You should really say thank you.

Now, I know some of you didn’t believe my urgent virus warning, and therefore didn’t take my proposed solution. But you’re not safe. That virus is still out there, lurking. It could strike at any minute. And when your computer eventually crashes, you should feel bad that you didn’t listen to me.

OK, I admit it. There is no virus. I did my best to exaggerate this claim without saying anything disprovable, in order to illustrate the curse of prevention. Imagine for a moment that you believed my claim about the dangerous virus. After investing in my proposed solution, you probably would be grateful that I “prevented” the problem from happening. In an example this ludicrous, that hopefully sounds funny. But companies make this mistake repeatedly.

Let’s take a common real-world example. It’s important to invest in good architecture so that your website will scale once customers arrive. If you make that investment, and then customers arrive, and the site stays up, most companies will reward the people who built the architecture and, thus, prevented the scaling problems. That’s every bit as crazy as the bogus claim I made earlier. How do you know the problem was actually prevented? Isn’t it just as possible that it never would have occurred in the first place? Or, if it really was prevented, what was the opportunity cost of choosing to prevent it ahead of time?

In other words, there is a formula for evaluating the success of any proposed prevention:

cost of prevention < (probability of problem occurring) * (cost of problem)
do it
ignore it

The killer thing about this formula is that every single term in it is unknown. And in most situations, there is significant cost involved in negotiating over the right estimates to plug in.

I have been present for these kinds of negotiations many times in my career. They are usually among the most heated arguments a company has. Like other situations that I’ve written about, they tend to devolve into competing all-or-nothing camps. One side insists that we should build things the right way, and that failure to anticipate problems is an abdication of responsibility. But the other side wants to get things done, and doing things right somehow, always, every time seems to involve postponing useful work. Both sides suspect that deep down, secretly, the other side is using their arguments over architecture (or planning, or roadmaps, or specifications) to advance a secret agenda. Ever notice how people’s pet projects seem to be exempt?

Why do they harbor that paranoia? It’s easy to see. Say you want to derail someone else’s project. Just start enumerating corner cases. Imagine everything that might go wrong, and insist that those things be prevented before the project is launched. It’s a win-win: you either dramatically increase the proposed cost of the project, making it easier to get cancelled, or you can rely on some “I-told-you-so’s” when the project does launch and encounters inevitable problems, which gives you credibility in future such arguments. On the other side, if you want a project to go forward, you can suddenly "discover" all kinds of extra efficiencies that make this particular project an especially good deal. In the past, we invested in brilliant architecture, code reuse, refactoring, modular design, etc. that now makes it a simple matter to add this feature without much risk of corner cases. Right.

Managing these situations is hard for any company, but potentially lethal for a startup. There are just so many ways for a startup to fail. I’ve lived through the over-architecture failure – where attempting to prevent all kinds of problems wound up delaying the company from putting out any product at all. And I’ve seen companies fail the other way – the so-called Friendster effect: having a high-profile technical failure just when customer adoption is going wild.

Most of the advice I’ve heard on this topic has been a kind of split-the-difference approach. The theory is that there is some truth in both camps, and the right way to manage the disagreement is to sprinkle a little bit of both into our plans. A little planning, but not too much. Prevent some corner cases, but not others. The problem with this advice, as I’ve experienced it, is that it’s pretty hard to give a rationale for why we should anticipate this problem but ignore another one. To the people being managed that way, it feels like the boss is being capricious or arbitrary. And that feeds the conspiracy feeling that decisions have an ulterior motive.

So I’d like to lay out a systematic way to avoid death-by-corner-case without sacrificing the company’s ability to grow. In other words, a principled way to combine agility with stability.

The first shift required is a change in orientation from prevention to fast response. Many problems are catastrophic only if allowed to fester. Imagine you hear from an engineer that they are worried that a certain payment subsystem is unreliable, and will therefore double-charge some customers. One way to evaluate this fear is to spend time on analysis: how many customers will be affected? What is the maximum amount of overcharging that will happen? How upset will those customers be? How much will it cost to solve this problem now? In this framework, we’ll tend to either invest in the proposed prevention or do nothing.

But there is another way. Imagine we asked the following question: if this problem does materialize in the future, how will we know? In a lot of systems, it might take days or weeks to uncover a problem-in-action. Maybe we already have a mechanism for customers to report this kind of problem, or maybe we could invest in a simple alert counter that increments whenever the problem happens, and sends a notification if it happens often. Then, we’d know immediately if the problem ever manifests, and get a simultaneous report on its severity.

We can also ask: how would we fix the problem if it does occur? If we’re practicing continuous deployment, we can be confident that we’ll be able to rush an emergency fix into production without risking introducing further problems. If not, maybe an investment in that direction would be more warranted. In other words, you can always invest in process, batch size reduction, and agility as an alternative to preventing a specific problem.

There are two principal reasons why this second approach is better. The first is that it allows us to make variable-sized investments in response to a feared corner case. Instead of “do the fix” or “hack it up” we can choose increments of investment anywhere in-between. That gives teams a lot more flexibility in the face of the numerous corner cases that come up. Second, investing in fast response is a more resilient strategy. If we’re wrong about the corner case, the investments we’ve made in fast response will allow us to respond faster to whatever problems do appear. By contrast, most investments in traditional prevention are designed to anticipate and fix a specific problem.

But investing in fast response doesn’t solve the whole problem. That’s because there’s still a lot of judgment involved in choosing the right level of investment to make in any given case. It can feel  incongruous to people who are used to the traditional model because it has a built in paradox: you will encounter a lot of cases where you know a problem exists, and you know how to solve that problem, and you are investing time related to that problem but are not investing in the solution. To a lot of smart engineers, that sounds crazy.

That’s why it’s essential to pair the fast response aspect of this approach with a disciplined commitment to root cause analysis. Regular readers of this blog will know the specific methodology I recommend, called Five Whys. But regardless of the technique you use, it’s essential that you get regular feedback about how your prevention decisions are turning out in practice. When you’re heavily investing in prevention, you need to evaluate whether that’s causing your team to go faster. If it’s not, then you’re investing too much in one-off solutions and not enough in process. And if you’re having a lot of problems, you need to have a mechanism for ramping up your investment in prevention to avoid having your whole team dragged down into firefighting. Systems like Five Whys create a natural feedback loop: when you're going too fast, causing a lot of new problems, it slows you down to invest in prevention. As those preventative efforts pay off, the team naturally speeds up.

The most dangerous situation you can find yourself in is investing in prevention and also firefighting all the time. That’s why there is a third essential component to this approach. You need to have a long-term vision of where you’re headed. That’s because not all investments are created equal. In most real-world situations, any particular problem (or proposed problem) will have multiple kinds of solutions that you could invest in. Take your typical scalability bottleneck. It could be fixed by refactoring the code itself, or by partitioning the data horizontally or vertically, or by adding additional capacity at the point of the bottleneck, or by shaping end-user demand, or even by removing the feature itself. At any given point in time, which is the right solution? Here’s my belief: the right solution is always the one that moves you closest to your vision while simultaneously solving the problem. Thus it is unacceptable to choose a solution that solves the problem but makes not progress towards the end-state, just as it is unacceptable to invest in a solution that builds a beautiful vision but doesn’t solve today’s problem. Finding such a solution is sometimes challenging, but that’s the moment when it really pays to spend some time thinking through alternative approaches. In my experience, where there is a will to find a synthesis solution, there is always a way.

blog comments powered by Disqus