Lessons Learned: How to conduct a Five Whys root cause analysis

Thursday, July 2, 2009

How to conduct a Five Whys root cause analysis

In the lean startup workshops, we’ve spent a lot of time discussing the technique of Five Whys. It allows teams to diagnose sources of waste in their development process and continuously improve, reversing the usual trend of teams getting slower over time. With Five Whys, teams can accelerate, even as they scale.

In a previous post, I outlined the benefits of Five Whys: that it allows you to make large investments in infrastructure incrementally, takes advantage of the 80/20 rule to free up resources immediately, and helps organizations become built to learn. Today, I want to talk about the mechanics of Five Whys in greater detail.

First, a caveat. My intention is to describe a full working process, similar to what I’ve seen at IMVU and other lean startups. But as with all process changes, it should not be interpreted as a spec to be implemented right away. In fact, trying too much is just as dangerous at not doing enough. Just as the lean movement has taught us to build incrementally, it has also taught us to attempt process changes incrementally as well. You need to transition to a work flow of small batches – in small batches.

Five Whys involves holding meetings immediately following the resolution of problems the company is facing. These problems can be anything: development mistakes, site outages, marketing program failures, or even internal missed schedules. Any time something unexpected happens, we could do some root cause analysis. Yet it’s helpful to begin by tackling a specific class of problems. For example, a common place to get started is with this rule: any time we have a site outage of any duration, we will hold a post-mortem meeting immediately afterwards.

The first step is to identify three things about the meeting: what problem we are trying to solve, who will run the meeting, and who was affected by the problem. For the problem, it’s essential to hold the meeting immediately following a specific symptom. Five Why’s rarely works for general abstract problems like “our product is buggy” or “our team moves too slow.” Instead, we want to hold it for a specific symptom, like “we missed the Jan 6 deadline by two weeks” or “we had a site outage on Nov 10.” Have faith that if a large general problem is really severe, it will be generating many symptoms that we can use to achieve a general solution.

Always explicitly identify the person running the meeting. Some organizations like to appoint a “Five Whys master” for a specific area of responsibility. For example, at IMVU we had masters appointed for topics like site scalability or unit test failures. The advantage of having an expert run each meeting is that this person can get better and better at helping the team find good solutions. The downside is the extra coordination required to get that person in the room each time. Either way can work. In any event, nobody should hold a master position for too long. Rotation is key to avoid having a situation where one person becomes a bottleneck or single point of failure.

The person running the meeting does not have to be a manager or executive. They do need to have the authority necessary to assign tasks across the organization. That’s because Five Whys will often pierce the illusion of separate departments and discover the human problems that lurk beneath the surface of supposedly technical problems. In order to make Five Whys successful, the person running the meeting has to have the backing of an executive sponsor who has the requisite authority to back them up if they wind up stepping in political landmines. But this executive sponsor doesn’t need to be in the room – what matters is that everyone in the room understands that the person running the meeting has the authority to do so. This means that if you are trying to introduce Five Whys into an organization that is not yet bought-in, you have to start small.

In order to maximize the odds of success, we want to have everyone affected by the problem in the meeting. That means having a representative of every department or function that was affected. When customers are affected, try to have someone who experienced the customer problem first-hand, like the customer service rep who took the calls from angry customers. At a minimum, you have to have the person who discovered the problem. Otherwise, key details are likely to be missed. For example, I have seen many meetings analyzing a problem that took a long time to be diagnosed. In hindsight, the problem was obvious. If the people responsible for diagnosis aren’t in the post-mortem meeting, it’s too easy to conclude, “those people were just too stupid to find the problem” instead of focusing on how our tools could make problems more evident and easier to diagnose.

A root cause analysis meeting has a clear problem, leader, and stakeholders. The most important guideline for the meeting itself is that the purpose of the meeting is to learn and to improve, not to assign blame or to vent. Assume that any problem is preventable and is worth preventing. Problems are caused by insufficiently robust systems rather than individual incompetence. Even in the case of a person making a mistake, we have to ask “why do our tools make that mistake so easy to make?”

The heart of the meeting is the analysis itself. For each problem, we want to ask “why did that happen?” and “why wasn’t it prevented by our process?” We do that iteratively until we have at least five levels of analysis. Of course, the number five is not sacrosanct; it’s just a guideline. What’s critical is that we don’t do too few levels, and we don’t do too many. One hundred whys would be overwhelming. But if we stay stuck at the technical parts of the problem, and never uncover the human problems behind them, we’re not going far enough. So I would keep the meeting going until we’re talking about human problems, and preferably system-level problems. For example, a site outage may seem like it was caused by a bad piece of code, but: why was that code written? Why didn’t the person who wrote it know that it would be harmful? Why didn’t our tests/QA/immune system catch and prevent the problem? Why wasn’t it immediately obvious how to fix the problem?

Pay attention to whether people are comfortable “naming names” in the meeting. If people are afraid of blame, they’ll try to phrase statements in vague, generic terms or use the passive voice, as in “a mistake was made” rather than “So-and-so failed to push the right button.” There’s no easy fix to this problem. Trust takes time to build up, and my experience is that it may take months to establish enough trust that people are confident that there won’t be retribution for speaking up candidly. Stay patient, and be on alert for blame-type talk or for post-meeting revenge. I recommend a zero-tolerance policy for these behaviors – otherwise our Five Whys meetings can descend into Five Blames.

Another common issue is the tendency of root causes to sprout branches. Complex problems rarely have only one cause, and looking for the primary cause is easier in theory than in practice. The branching of causes is also a prime target for so-called “anchor draggers” – people who aren’t really on board with the exercise in the first place. An easy way to derail the meeting is to keep insisting that more and more lateral causes be considered, until the team is running around in circles. Even well intentioned people can wreak the same havoc by simply staying over-focused on technical or ancillary issues. Try to stay focused on just one line of inquiry. Remember, Five Whys is not about making an exhaustive survey of all the problems. It’s about quickly identifying the likely root cause. That’s why it’s more important to do Five Whys frequently than to get it exactly right. It’s a very forgiving practice, because the most wasteful problems will keep clamoring for attention. Have faith that you’ll have many more opportunities to tackle them, and don’t get hung up on any particular solution.

Once you’ve found approximately five levels of the problem, which includes at least one or two human-level issues, it’s time to turn to solutions. The overall plan is to make a proportional investment in each of the five levels. The two major guidelines are: don’t do too much, and don’t do nothing. Almost anything in between will work.

For example, I often cite a real example of a problem that has as its root cause a new employee who was not properly trained. I pick that example on purpose, for two reasons: 1) most of the companies I work with deal with this problem and yet 2) almost none of them have any kind of training program in place for new employees. The reason is simple: setting up a training program is seen as too much work to be justified by the problem. Yet in every situation where I have asked, nobody has been tasked with making a realistic estimate, either of the impact of this lack of training or the real costs of the solution. In fact, even the investigation itself is considered too much work. Five Whys is designed to avoid these nebulous arguments. If new employees are causing problems, that will be a routine topic. If those problems are minor, each time it happens we’ll budget a small amount of time to make progress on the solution.

Let’s imagine the ideal solution would be to spend six weeks setting up a training program for new employees. You can almost hear a manager now: “sure, if you want me to spend the next six weeks setting this up, just let me know. It’s just a matter of priorities. If you think it’s more important than everything else I’m working on, go right ahead and find someone else to take over my other responsibilities…” This logic is airtight, and has the effect of preventing any action. But Five Whys gives us an alternative. If we’ve just analyzed a minor problem that involved a new employee, we should make a minor investment in training. To take an extreme example, let’s say we’ve decided to invest no more than one hour in the solution. Even in that case, we can ask the manager involved to simply spend the first hour of the six-week ideal solution. The next time the problem comes up, we’ll do the next hour, and so on.

In fact, at IMVU, we did exactly that. We started with a simple wiki page with a few bullet points of things that new engineers had tripped over recently. As we kept doing root cause analysis, the list grew. In response to Five Whys that noticed that not all new engineers were reading the list, we expanded it into a new engineer curriculum. Soon, each new engineer was assigned a mentor, and we made it part of the mentor’s job to teach the curriculum. Over time, we also made investments in making it easier to get a new engineer set up with their private sandbox, and even dealt with how to make sure they’d have a machine on their desk when they started. The net effect of all this was to make new engineers incredibly productive right away – in most cases, we’d have them deliver code to production on their very first day. We never set out to build a world-class engineering-training process. Five Whys simply helped us eliminate tons of waste by building one.

Returning to the meeting itself, the person running the meeting should lead the team in brainstorming solutions for each of the problems selected. It’s important that the leader be empowered to pick one and only one solution for each problem, and then assign it to someone to get done. Remember that the cost of the solutions is proportional to the problem caused. This should make it easy to get buy-in from other managers or executives. After all, if it’s a severe problem like a site outage, do they really want to be seen as the person getting in the way of solving it? And if it’s a minor problem, are they really going to object to a few hours of extra work here and there, if it’s towards a good cause? My experience is: usually not.

There are no fixed rules for what constitutes a proportional investment. As teams get experience doing Five Whys, they start to develop rules of thumb for what is reasonable and what isn’t. To restate: the key is that all parties, including the non-technical departments, see the investments as reasonable. As long as we don’t veer to either extreme, the 80/20 rule will make sure that we don’t under-invest over the long term. Remember that if something is a serious problem, it will keep coming up over and over in these meetings. Each time, we’ll get to chip away at it, until it’s no longer a problem.

The last element of a good Five Whys process is to share the results of the analysis widely. I generally recommend sending out the results to the whole company, division, or business unit. This accomplishes two important things: it diffuses knowledge throughout the organization, and it provides evidence that the team in question is taking problems seriously. This latter point can eliminate a lot of waste. I have been amazed how many teams have severe inter-departmental trust issues caused by a lack of communication about problems. For example, engineering feels that they are constantly being pressured to take shortcuts that lower the quality of the product. At the same time, the very marketing people who are applying that pressure think the engineering team doesn’t take quality seriously, and doesn’t respond appropriately when their shoddy work leads to customer problems. Sharing Five Whys can alleviate this problem, by letting everyone know exactly how seriously problems are taken. I say exactly, because it may actually reveal that problems are not taken seriously. In fact, I have seen people in other departments sometimes catch sloppy thinking in a Five Whys report. By sharing the analysis widely, that feedback can flow on a peer-to-peer basis, quickly and easily.

Most organizations are unaware of how much time they spend firefighting, reworking old bugs, and generally expending energy on activities that their customers don’t care about. Yet getting a handle on these sources of waste is hard, especially because they are dynamic. By the time a tops-down company-wide review figured out the main problems, they’d have shifted to another location. Five Whys allows teams to react much faster and then constantly adapt. Without all that waste in their way, they simply go faster.

(If you’re new to Five Whys, I’m eager to hear your feedback. Does this help you get started? What questions or concerns do you have? Leave your thoughts in a comment. If you’ve tried Five Whys, please share your experiences so far. I’ll do my best to help.)

15 comments:

AnonymousJuly 3, 2009 at 12:29 AM
"Problems are caused by insufficiently robust symptoms"
symptoms => systems
ReplyDelete
Replies
Daniel TennerJuly 3, 2009 at 3:11 AM
Great article as always, Eric. Have you considered putting all these into a book?

One question - at what team size do you think that it makes sense to go through this process formally? In theory it works with any team size, but when you only have 2 or 3 people in the meeting, the formality of the 5-whys seems, intuitively, a bit stifling. Then again, perhaps it's worth it. What do you think? Would you apply this technique in a start-up that was just you and two other guys, for example?
ReplyDelete
Replies
Eric WilliamsonJuly 3, 2009 at 5:13 AM
I have not read the previous post(s) on 5 why's (about to now though...), but I wanted to comment and let you know that I thought this was really great stuff. Furthermore, I think most of what you are suggesting is applicable in companies/ industries that are not software development. I work in advertising and this system would work to analyze failures.

Two questions:
1) What happens when the root cause of the issue is clearly something that a/the client did?
2) How can you keep employees from being fearful of these post-mortems? Obviously if you have a sub-par employee who makes too many mistakes over and over he/she has to be terminated ...a couple times that employees see "5-Whys = John Got Fired" and employees may start fearing the 5-Why and its benefits.

Thanks again for great post.

Eric
http://www.pixelmaverick.com
ReplyDelete
Replies
Wyatt O'DayJuly 3, 2009 at 7:49 AM
Great article, Eric.

I have one comment on style: could you break up your articles with subheadings? It would make article more approachable initially, and easier to scan when I go back to reference it.

For instance, above the "Let’s imagine the ideal solution would ..." paragraph have the heading "Training New Hires", above the "Returning to the meeting itself ..." have the heading "Chipping away at the problem", etc.

It's just a thought.

Other than that, you consistently have good content. Keep it up.
ReplyDelete
Replies
EricJuly 3, 2009 at 8:25 AM
@Anonymous - thanks!

@Daniel - yes, I think it's worth doing even at very small sizes. You can be a lot less rigid about it, of course, but the fact that we are blind to sources of waste is something we all share. I recently met with a two-person-no-funding team that, after some hesitating, jumped in and tried it. They seem to be liking it.

(oh, and on the book front, you'll be able to get all of my essays so far in ebook, print-on-demand, and kindle editions - coming soon. let me know if you're interested)
ReplyDelete
Replies
EricJuly 3, 2009 at 8:31 AM
@Wyatt - if you are willing to suggest specific subheads for this article, I will add them. Hopefully I'll learn from the experience and will think of adding them in the future. Deal?

@Eric -

1) if the client is the cause of waste, that's good to know. You can ask "how can we serve this client while mitigating all this waste?" or even "is serving this client really worth it?"

For example, periodically somebody at IMVU suggests that we stop supporting Windows 98 as a platform. It's so old and so buggy that it's a real pain to ship 3D software on it. But since we have a really good idea of the problems it causes, we can do the ROI analysis whenever somebody suggests that. We just add up the revenue we've made in the past few months from Win98 users, and compare to the pain that Win98 has caused as identified in 5Ys. So far, it's been a net win to keep supporting it.

2) On the trust issue, you simply have to avoid punitive measures altogether. If you have somebody on your team who is not pulling their weight - everybody knows it. 5Ys is not going to change that, although it might give you an opportunity to help that person become more effective. You never know if somebody's incompetence is due to their skills/talents or due to a bad context. You'd be surprised how much more productive they can become if you invest in tools, process, and infrastructure.
ReplyDelete
Replies
Joseph TurianJuly 3, 2009 at 1:27 PM
Eric,

Could you please give some concrete examples of Five Why's anaylsis? I understand the theory, but I am curious what it ends up like in practice once your hit the third or forth why.
ReplyDelete
Replies
Eitan BurcatJuly 3, 2009 at 2:27 PM
Thanks for a great post.

I have a question regarding the "chipping at the problem" approach.

If solving the issue requires a "big" investment (Let's say 20 hours). But after each session of 5-whys I decide to invest only an hour in the solution. All the setup time and context swithcing may turn this 20 hours task into 40 or 60 hours task.
Batching would have saved this setup time, but I invest my time sporadically, so there's a waste involved in this "chipping at the problem" approach also.

What's your experience with this?
ReplyDelete
Replies
EricJuly 3, 2009 at 3:35 PM
Joseph, you might find these articles helpful:

http://entrepreneur.venturebeat.com/2009/06/05/palantir-keeps-it-lean-and-mean-on-five-year-journey-from-zero-to-150-employees/

http://startuplessonslearned.blogspot.com/2009/07/how-to-conduct-five-whys-root-cause.html

http://www.joelonsoftware.com/items/2008/01/22.html

If you have a question about a specific technique/situation, let me know. I'm hoping to find other examples to share.
ReplyDelete
Replies
EricJuly 3, 2009 at 3:45 PM
Eitan, that's a real issue. It makes sense to do prevention work in rational batch sizes, and teams that do routine root cause analysis get better at finding the right batch size over time. Luckily, in most prevention situations, even the first few steps in prevention can pay time-savings dividends quickly. So even if the total cost of a given solution is higher when done incrementally, this is offset by the reduced risk of preventing something unimportant.

Plus, the real driver of context switching overall is firefighting and other bugs.
ReplyDelete
Replies
UnknownJuly 3, 2009 at 11:23 PM
I have numerous ideas to work on, mostly related to solving the problems I face. But only a limited time.

I think I should take the similar approach of limited investment to my problems, and when the problem recurs I put another bit of effort to improve my solution. Great approach! A sad state that we don't have more time.
ReplyDelete
Replies
AnonymousJuly 6, 2009 at 7:48 AM
Eric,
Our site went down for about 30 minutes this weekend. I'm going to do a Five Why's analysis with the tech team today. Will let you know what happens.

best
steven
TakeLessons.com
ReplyDelete
Replies
William PietriJuly 10, 2009 at 11:55 PM
Hi, Eric. As usual, this is a great post.

I'm a big fan of doing a Five Whys analysis, and wanted to add one tip. Instead of telling people to look for *the* root cause, I have them search for *a* or *some* root causes.

My goal with this minor wording change is to avoid arguments over the right hierarchy and focus on the goal, getting some solutions to try. When people are wound up about a problem, and especially when dealing with engineers, it can be easy to analyze past the point of diminishing returns.
ReplyDelete
Replies
DOCSeptember 25, 2009 at 7:03 AM
Hi Eric..I enjoyed your article. I work for a major pharma company. We use 5Whys (along with fishbone) as part of a 7Step root cause analysis approach. I am constantly amazed about how difficult people find the 5Whys tool. I am no expert myself and believe the tool IS difficult to use. In my mind it's so-called simplicity has been totally over publicised. The difficulty is that people tend to jump steps in the causal chain and start placing solutions as answers to whys. I have yet to see a 5whys that I would consider to be a good one - including anything I have generated myself. I have read extensively on the Toyota approach - they have root cause analysis sensai who challenge ever minute step of the analysis until it is correct - I feel we are a billion miles away from that. However your article makes good points about insufficiently robust systems - indeed most root causes lie in management failings - but all too often 'human error' is an easy fall back.
ReplyDelete
Replies
samMarch 3, 2010 at 7:32 PM
Eric,
Thank you for this article. Like some of the other commenters, I see mainly obstacles to implementing this in my organization:

dispersed teams - in some cases not all people are possible to be gotten into the same room, and in many cases this involves working with vendors. I suppose we could ask "why does our relationship with this vendor allow these problems to continually arise?"

resistance/distrust to any sort of predetermined meeting structure--anything that seems canned or gimmicky. I feel like I would have to introduce this by stealth.

I'd be interested to see more examples of how you would start to get this off the ground in a company--as a counterpoint to these examples of how it works in a mature form.

Sam
ReplyDelete
Replies

Add comment

Lessons Learned

Thursday, July 2, 2009

How to conduct a Five Whys root cause analysis

15 comments:

Lean Startup Book

Get New Essays By Email

Get New Essays by Email

About The Author

About the Author

Blog Archive

Popular Posts

Recent & Upcoming Events

Recommended Reading

Facebook Fans

@ericries on Twitter

@ericries on Twitter

outbrain