Thursday, July 2, 2009

How to conduct a Five Whys root cause analysis

In the lean startup workshops, we’ve spent a lot of time discussing the technique of Five Whys. It allows teams to diagnose sources of waste in their development process and continuously improve, reversing the usual trend of teams getting slower over time. With Five Whys, teams can accelerate, even as they scale.

In a previous post, I outlined the benefits of Five Whys: that it allows you to make large investments in infrastructure incrementally, takes advantage of the 80/20 rule to free up resources immediately, and helps organizations become built to learn. Today, I want to talk about the mechanics of Five Whys in greater detail.

First, a caveat. My intention is to describe a full working process, similar to what I’ve seen at IMVU and other lean startups. But as with all process changes, it should not be interpreted as a spec to be implemented right away. In fact, trying too much is just as dangerous at not doing enough. Just as the lean movement has taught us to build incrementally, it has also taught us to attempt process changes incrementally as well. You need to transition to a work flow of small batches – in small batches.

Five Whys involves holding meetings immediately following the resolution of problems the company is facing. These problems can be anything: development mistakes, site outages, marketing program failures, or even internal missed schedules. Any time something unexpected happens, we could do some root cause analysis. Yet it’s helpful to begin by tackling a specific class of problems. For example, a common place to get started is with this rule: any time we have a site outage of any duration, we will hold a post-mortem meeting immediately afterwards.

The first step is to identify three things about the meeting: what problem we are trying to solve, who will run the meeting, and who was affected by the problem. For the problem, it’s essential to hold the meeting immediately following a specific symptom. Five Why’s rarely works for general abstract problems like “our product is buggy” or “our team moves too slow.” Instead, we want to hold it for a specific symptom, like “we missed the Jan 6 deadline by two weeks” or “we had a site outage on Nov 10.” Have faith that if a large general problem is really severe, it will be generating many symptoms that we can use to achieve a general solution.

Always explicitly identify the person running the meeting. Some organizations like to appoint a “Five Whys master” for a specific area of responsibility. For example, at IMVU we had masters appointed for topics like site scalability or unit test failures. The advantage of having an expert run each meeting is that this person can get better and better at helping the team find good solutions. The downside is the extra coordination required to get that person in the room each time. Either way can work. In any event, nobody should hold a master position for too long. Rotation is key to avoid having a situation where one person becomes a bottleneck or single point of failure.

The person running the meeting does not have to be a manager or executive. They do need to have the authority necessary to assign tasks across the organization. That’s because Five Whys will often pierce the illusion of separate departments and discover the human problems that lurk beneath the surface of supposedly technical problems. In order to make Five Whys successful, the person running the meeting has to have the backing of an executive sponsor who has the requisite authority to back them up if they wind up stepping in political landmines. But this executive sponsor doesn’t need to be in the room – what matters is that everyone in the room understands that the person running the meeting has the authority to do so. This means that if you are trying to introduce Five Whys into an organization that is not yet bought-in, you have to start small.

In order to maximize the odds of success, we want to have everyone affected by the problem in the meeting. That means having a representative of every department or function that was affected. When customers are affected, try to have someone who experienced the customer problem first-hand, like the customer service rep who took the calls from angry customers. At a minimum, you have to have the person who discovered the problem. Otherwise, key details are likely to be missed. For example, I have seen many meetings analyzing a problem that took a long time to be diagnosed. In hindsight, the problem was obvious. If the people responsible for diagnosis aren’t in the post-mortem meeting, it’s too easy to conclude, “those people were just too stupid to find the problem” instead of focusing on how our tools could make problems more evident and easier to diagnose.

A root cause analysis meeting has a clear problem, leader, and stakeholders. The most important guideline for the meeting itself is that the purpose of the meeting is to learn and to improve, not to assign blame or to vent. Assume that any problem is preventable and is worth preventing. Problems are caused by insufficiently robust systems rather than individual incompetence. Even in the case of a person making a mistake, we have to ask “why do our tools make that mistake so easy to make?”

The heart of the meeting is the analysis itself. For each problem, we want to ask “why did that happen?” and “why wasn’t it prevented by our process?” We do that iteratively until we have at least five levels of analysis. Of course, the number five is not sacrosanct; it’s just a guideline. What’s critical is that we don’t do too few levels, and we don’t do too many. One hundred whys would be overwhelming. But if we stay stuck at the technical parts of the problem, and never uncover the human problems behind them, we’re not going far enough. So I would keep the meeting going until we’re talking about human problems, and preferably system-level problems. For example, a site outage may seem like it was caused by a bad piece of code, but: why was that code written? Why didn’t the person who wrote it know that it would be harmful? Why didn’t our tests/QA/immune system catch and prevent the problem? Why wasn’t it immediately obvious how to fix the problem?

Pay attention to whether people are comfortable “naming names” in the meeting. If people are afraid of blame, they’ll try to phrase statements in vague, generic terms or use the passive voice, as in “a mistake was made” rather than “So-and-so failed to push the right button.” There’s no easy fix to this problem. Trust takes time to build up, and my experience is that it may take months to establish enough trust that people are confident that there won’t be retribution for speaking up candidly. Stay patient, and be on alert for blame-type talk or for post-meeting revenge. I recommend a zero-tolerance policy for these behaviors – otherwise our Five Whys meetings can descend into Five Blames.

Another common issue is the tendency of root causes to sprout branches. Complex problems rarely have only one cause, and looking for the primary cause is easier in theory than in practice. The branching of causes is also a prime target for so-called “anchor draggers” – people who aren’t really on board with the exercise in the first place. An easy way to derail the meeting is to keep insisting that more and more lateral causes be considered, until the team is running around in circles. Even well intentioned people can wreak the same havoc by simply staying over-focused on technical or ancillary issues. Try to stay focused on just one line of inquiry. Remember, Five Whys is not about making an exhaustive survey of all the problems. It’s about quickly identifying the likely root cause. That’s why it’s more important to do Five Whys frequently than to get it exactly right. It’s a very forgiving practice, because the most wasteful problems will keep clamoring for attention. Have faith that you’ll have many more opportunities to tackle them, and don’t get hung up on any particular solution.

Once you’ve found approximately five levels of the problem, which includes at least one or two human-level issues, it’s time to turn to solutions. The overall plan is to make a proportional investment in each of the five levels. The two major guidelines are: don’t do too much, and don’t do nothing. Almost anything in between will work.

For example, I often cite a real example of a problem that has as its root cause a new employee who was not properly trained. I pick that example on purpose, for two reasons: 1) most of the companies I work with deal with this problem and yet 2) almost none of them have any kind of training program in place for new employees. The reason is simple: setting up a training program is seen as too much work to be justified by the problem. Yet in every situation where I have asked, nobody has been tasked with making a realistic estimate, either of the impact of this lack of training or the real costs of the solution. In fact, even the investigation itself is considered too much work. Five Whys is designed to avoid these nebulous arguments. If new employees are causing problems, that will be a routine topic. If those problems are minor, each time it happens we’ll budget a small amount of time to make progress on the solution.

Let’s imagine the ideal solution would be to spend six weeks setting up a training program for new employees. You can almost hear a manager now: “sure, if you want me to spend the next six weeks setting this up, just let me know. It’s just a matter of priorities. If you think it’s more important than everything else I’m working on, go right ahead and find someone else to take over my other responsibilities…” This logic is airtight, and has the effect of preventing any action. But Five Whys gives us an alternative. If we’ve just analyzed a minor problem that involved a new employee, we should make a minor investment in training. To take an extreme example, let’s say we’ve decided to invest no more than one hour in the solution. Even in that case, we can ask the manager involved to simply spend the first hour of the six-week ideal solution. The next time the problem comes up, we’ll do the next hour, and so on.

In fact, at IMVU, we did exactly that. We started with a simple wiki page with a few bullet points of things that new engineers had tripped over recently. As we kept doing root cause analysis, the list grew. In response to Five Whys that noticed that not all new engineers were reading the list, we expanded it into a new engineer curriculum. Soon, each new engineer was assigned a mentor, and we made it part of the mentor’s job to teach the curriculum. Over time, we also made investments in making it easier to get a new engineer set up with their private sandbox, and even dealt with how to make sure they’d have a machine on their desk when they started. The net effect of all this was to make new engineers incredibly productive right away – in most cases, we’d have them deliver code to production on their very first day. We never set out to build a world-class engineering-training process. Five Whys simply helped us eliminate tons of waste by building one.

Returning to the meeting itself, the person running the meeting should lead the team in brainstorming solutions for each of the problems selected. It’s important that the leader be empowered to pick one and only one solution for each problem, and then assign it to someone to get done. Remember that the cost of the solutions is proportional to the problem caused. This should make it easy to get buy-in from other managers or executives. After all, if it’s a severe problem like a site outage, do they really want to be seen as the person getting in the way of solving it? And if it’s a minor problem, are they really going to object to a few hours of extra work here and there, if it’s towards a good cause? My experience is: usually not.

There are no fixed rules for what constitutes a proportional investment. As teams get experience doing Five Whys, they start to develop rules of thumb for what is reasonable and what isn’t. To restate: the key is that all parties, including the non-technical departments, see the investments as reasonable. As long as we don’t veer to either extreme, the 80/20 rule will make sure that we don’t under-invest over the long term. Remember that if something is a serious problem, it will keep coming up over and over in these meetings. Each time, we’ll get to chip away at it, until it’s no longer a problem.

The last element of a good Five Whys process is to share the results of the analysis widely. I generally recommend sending out the results to the whole company, division, or business unit. This accomplishes two important things: it diffuses knowledge throughout the organization, and it provides evidence that the team in question is taking problems seriously. This latter point can eliminate a lot of waste. I have been amazed how many teams have severe inter-departmental trust issues caused by a lack of communication about problems. For example, engineering feels that they are constantly being pressured to take shortcuts that lower the quality of the product. At the same time, the very marketing people who are applying that pressure think the engineering team doesn’t take quality seriously, and doesn’t respond appropriately when their shoddy work leads to customer problems. Sharing Five Whys can alleviate this problem, by letting everyone know exactly how seriously problems are taken. I say exactly, because it may actually reveal that problems are not taken seriously. In fact, I have seen people in other departments sometimes catch sloppy thinking in a Five Whys report. By sharing the analysis widely, that feedback can flow on a peer-to-peer basis, quickly and easily.

Most organizations are unaware of how much time they spend firefighting, reworking old bugs, and generally expending energy on activities that their customers don’t care about. Yet getting a handle on these sources of waste is hard, especially because they are dynamic. By the time a tops-down company-wide review figured out the main problems, they’d have shifted to another location. Five Whys allows teams to react much faster and then constantly adapt. Without all that waste in their way, they simply go faster.

(If you’re new to Five Whys, I’m eager to hear your feedback. Does this help you get started? What questions or concerns do you have? Leave your thoughts in a comment. If you’ve tried Five Whys, please share your experiences so far. I’ll do my best to help.)

Reblog this post [with Zemanta]

15 comments:

  1. "Problems are caused by insufficiently robust symptoms"
    symptoms => systems

    ReplyDelete
  2. Great article as always, Eric. Have you considered putting all these into a book?

    One question - at what team size do you think that it makes sense to go through this process formally? In theory it works with any team size, but when you only have 2 or 3 people in the meeting, the formality of the 5-whys seems, intuitively, a bit stifling. Then again, perhaps it's worth it. What do you think? Would you apply this technique in a start-up that was just you and two other guys, for example?

    ReplyDelete
  3. I have not read the previous post(s) on 5 why's (about to now though...), but I wanted to comment and let you know that I thought this was really great stuff. Furthermore, I think most of what you are suggesting is applicable in companies/ industries that are not software development. I work in advertising and this system would work to analyze failures.

    Two questions:
    1) What happens when the root cause of the issue is clearly something that a/the client did?
    2) How can you keep employees from being fearful of these post-mortems? Obviously if you have a sub-par employee who makes too many mistakes over and over he/she has to be terminated ...a couple times that employees see "5-Whys = John Got Fired" and employees may start fearing the 5-Why and its benefits.

    Thanks again for great post.

    Eric
    http://www.pixelmaverick.com

    ReplyDelete
  4. Great article, Eric.

    I have one comment on style: could you break up your articles with subheadings? It would make article more approachable initially, and easier to scan when I go back to reference it.

    For instance, above the "Let’s imagine the ideal solution would ..." paragraph have the heading "Training New Hires", above the "Returning to the meeting itself ..." have the heading "Chipping away at the problem", etc.

    It's just a thought.

    Other than that, you consistently have good content. Keep it up.

    ReplyDelete
  5. @Anonymous - thanks!

    @Daniel - yes, I think it's worth doing even at very small sizes. You can be a lot less rigid about it, of course, but the fact that we are blind to sources of waste is something we all share. I recently met with a two-person-no-funding team that, after some hesitating, jumped in and tried it. They seem to be liking it.

    (oh, and on the book front, you'll be able to get all of my essays so far in ebook, print-on-demand, and kindle editions - coming soon. let me know if you're interested)

    ReplyDelete
  6. @Wyatt - if you are willing to suggest specific subheads for this article, I will add them. Hopefully I'll learn from the experience and will think of adding them in the future. Deal?

    @Eric -

    1) if the client is the cause of waste, that's good to know. You can ask "how can we serve this client while mitigating all this waste?" or even "is serving this client really worth it?"

    For example, periodically somebody at IMVU suggests that we stop supporting Windows 98 as a platform. It's so old and so buggy that it's a real pain to ship 3D software on it. But since we have a really good idea of the problems it causes, we can do the ROI analysis whenever somebody suggests that. We just add up the revenue we've made in the past few months from Win98 users, and compare to the pain that Win98 has caused as identified in 5Ys. So far, it's been a net win to keep supporting it.

    2) On the trust issue, you simply have to avoid punitive measures altogether. If you have somebody on your team who is not pulling their weight - everybody knows it. 5Ys is not going to change that, although it might give you an opportunity to help that person become more effective. You never know if somebody's incompetence is due to their skills/talents or due to a bad context. You'd be surprised how much more productive they can become if you invest in tools, process, and infrastructure.

    ReplyDelete
  7. Eric,

    Could you please give some concrete examples of Five Why's anaylsis? I understand the theory, but I am curious what it ends up like in practice once your hit the third or forth why.

    ReplyDelete
  8. Thanks for a great post.

    I have a question regarding the "chipping at the problem" approach.

    If solving the issue requires a "big" investment (Let's say 20 hours). But after each session of 5-whys I decide to invest only an hour in the solution. All the setup time and context swithcing may turn this 20 hours task into 40 or 60 hours task.
    Batching would have saved this setup time, but I invest my time sporadically, so there's a waste involved in this "chipping at the problem" approach also.

    What's your experience with this?

    ReplyDelete
  9. Joseph, you might find these articles helpful:

    http://entrepreneur.venturebeat.com/2009/06/05/palantir-keeps-it-lean-and-mean-on-five-year-journey-from-zero-to-150-employees/

    http://startuplessonslearned.blogspot.com/2009/07/how-to-conduct-five-whys-root-cause.html

    http://www.joelonsoftware.com/items/2008/01/22.html

    If you have a question about a specific technique/situation, let me know. I'm hoping to find other examples to share.

    ReplyDelete
  10. Eitan, that's a real issue. It makes sense to do prevention work in rational batch sizes, and teams that do routine root cause analysis get better at finding the right batch size over time. Luckily, in most prevention situations, even the first few steps in prevention can pay time-savings dividends quickly. So even if the total cost of a given solution is higher when done incrementally, this is offset by the reduced risk of preventing something unimportant.

    Plus, the real driver of context switching overall is firefighting and other bugs.

    ReplyDelete
  11. I have numerous ideas to work on, mostly related to solving the problems I face. But only a limited time.

    I think I should take the similar approach of limited investment to my problems, and when the problem recurs I put another bit of effort to improve my solution. Great approach! A sad state that we don't have more time.

    ReplyDelete
  12. Eric,
    Our site went down for about 30 minutes this weekend. I'm going to do a Five Why's analysis with the tech team today. Will let you know what happens.

    best
    steven
    TakeLessons.com

    ReplyDelete
  13. Hi, Eric. As usual, this is a great post.

    I'm a big fan of doing a Five Whys analysis, and wanted to add one tip. Instead of telling people to look for *the* root cause, I have them search for *a* or *some* root causes.

    My goal with this minor wording change is to avoid arguments over the right hierarchy and focus on the goal, getting some solutions to try. When people are wound up about a problem, and especially when dealing with engineers, it can be easy to analyze past the point of diminishing returns.

    ReplyDelete
  14. Hi Eric..I enjoyed your article. I work for a major pharma company. We use 5Whys (along with fishbone) as part of a 7Step root cause analysis approach. I am constantly amazed about how difficult people find the 5Whys tool. I am no expert myself and believe the tool IS difficult to use. In my mind it's so-called simplicity has been totally over publicised. The difficulty is that people tend to jump steps in the causal chain and start placing solutions as answers to whys. I have yet to see a 5whys that I would consider to be a good one - including anything I have generated myself. I have read extensively on the Toyota approach - they have root cause analysis sensai who challenge ever minute step of the analysis until it is correct - I feel we are a billion miles away from that. However your article makes good points about insufficiently robust systems - indeed most root causes lie in management failings - but all too often 'human error' is an easy fall back.

    ReplyDelete
  15. Eric,
    Thank you for this article. Like some of the other commenters, I see mainly obstacles to implementing this in my organization:

    dispersed teams - in some cases not all people are possible to be gotten into the same room, and in many cases this involves working with vendors. I suppose we could ask "why does our relationship with this vendor allow these problems to continually arise?"

    resistance/distrust to any sort of predetermined meeting structure--anything that seems canned or gimmicky. I feel like I would have to introduce this by stealth.

    I'd be interested to see more examples of how you would start to get this off the ground in a company--as a counterpoint to these examples of how it works in a mature form.

    Sam

    ReplyDelete