Monday, September 15, 2008

The one line split-test, or how to A/B all the time

Split-testing is a core lean startup discipline, and it's one of those rare topics that comes up just as often in a technical context as in a business-oriented one when I'm talking to startups. In this post I hope to talk about how to do it well, in terms appropriate for both audiences.

First of all, why split-test? In my experience, the majority of changes we made to products have no effect at all on customer behavior. This can be hard news to accept, and it's one of the major reasons not to split-test. Who among us really wants to find out that our hard work is for nothing? Yet building something nobody wants is the ultimate form of waste, and the only way to get better at avoiding it is to get regular feedback. Split-testing is the best way I know to get that feedback.

My approach to split-testing is to try to make it easy in two ways: incredibly easy for the implementers to create the tests and incredibly easy for everyone to understand the results. The goal is to have split-testing be a continuous part of our development process, so much so that it is considered a completely routine part of developing a new feature. In fact, I've seen this approach work so well that it would be considered weird and kind of silly for anyone to ship a new feature without subjecting it to a split-test. That's when this approach can pay huge dividends.

Let's start with the reporting side of the equation. We want a simple report format that anyone can understand, and that is generic enough that the same report can be used for many different tests. I usually use a "funnel report" that looks like this:

Control Hypothesis A
Hypothesis B
Registered1000 (100%)
1000 (100%)
500 (100%)
650 (65%)
750 (75%)
200 (40%)
350 (35%)
350 (35%)
100 (20%)
100 (10%)
100 (10%)
25 (5%)

In this case, you could run the report for any time period. The report is set up to show you what happened to customers who registered in that period (a so-called cohort analysis). For each cohort, we can learn what percentage of them did each action we care about. This report is set up to tell you about new customers specifically. You can do this for any sequence of actions, not just ones relating to new customers.

If you take a look at the dummy data above, you'll see that Hypothesis A is clearly better than Hypothesis B, because it beats out B in each stage of the funnel. But compared to control, it only beats it up through the "Chatted" stage. This kind of result is typical when you ship a redesign of some part of your product. The new design improved on the old one in several ways, but these improvements didn't translate all the way through the funnel. Usually, I think that means you've lost some good aspect of the old design. In other words, you're not done with your redesign yet. The designers might be telling you that the new design looks much better than the old one, and that's probably true. But it's worth conducting some more experiments to find a new design that beats the old one all the way through. In my previous job, this led us to confront the disappointing reality that sometimes customers actually prefer an uglier design to a pretty one. Without split-testing, your product tends to get prettier over time. With split-testing, it tends to get more effective.

One last note on reporting. Sometimes it makes sense to measure the micro-impact of a micro-change. For example, by making this button green, did more people click on it? But in my experience this is not useful most of the time. That green button was part of a customer flow, a series of actions you want customers to complete for some business reason. If it's part of a viral loop, it's probably trying to get them to invite more friends (on average). If it's part of an e-commerce site, it's probably trying to get them to buy more things. Whatever its purpose, try measuring it only at the level that you care about. Focus on the output metrics of that part of the product, and you make the problem a lot more clear. It's one of those situations where more data can impede learning.

I had the opportunity to pioneer this approach to funnel analysis at IMVU, where it became a core part of our customer development process. To promote this metrics discipline, we would present the full funnel to our board (and advisers) at the end of every development cycle. It was actually my co-founder Will Harvey who taught me to present this data in the simple format we've discussed in this post. And we were fortunate to have Steve Blank, the originator of customer development, on our board to keep us honest.

To make split-testing pervasive, it has to be incredibly easy. With an online service, we can make it as easy to do a split-test as to not do one. Whenever you are developing a new feature, or modifying an existing feature, you already have a split-test situation. You have the product as it will exist (in your mind), and the product as it exists already. The only change you have to get used to as you start to code in this style, is to wrap your changes in a simple one-line condition. Here's what the one-line split-test looks like in pseudocode:

if( setup_experiment(...) == "control" ) {
// do it the old way
} else {
// do it the new way

The call to setup_experiment has to do all of the work, which for a web application involves a sequence something like this:
  1. Check if this experiment exists. If not, make an entry in the experiments list that includes the hypotheses included in the parameters of this call.
  2. Check if the currently logged-in user is part of this experiment already. If she is, return the name of the hypothesis she was exposed to before.
  3. If the user is not part of this experiment yet, pick a hypothesis using the weightings passed in as parameters.
  4. Make a note of which hypothesis this user was exposed to. In the case of a registered user, this could be part of their permanent data. In the case of a not-yet-registered user, you could record it in their session state (and translate it to their permanent state when they do register).
  5. Return the name of the hypothesis chosen or assigned.
From the point of view of the caller of the function, they just pass in the name of the experiment and its various hypotheses. They don't have to worry about reporting, or assignment, or weighting, or, well, anything else. They just ask "which hypothesis should I show?" and get the answer back as a string. Here's what a more fleshed out example might look like in PHP:

$hypothesis =
array(array("control", 50),
array("design1", 50)));
if( $hypothesis == "control" ) {
// do it the old way
} elseif( $hypothesis == "design1" ) {
// do it the fancy new way
In this example, we have a simple even 50-50 split test between the way it was (called "control") and a new design (called "design1").

Now, it may be that these code examples have scared off our non-technical friends. But for those that persevere, I hope this will prove helpful as an example you can show to your technical team. Most of the time when I am talking to a mixed team with both technical and business backgrounds, the technical people start worrying that this approach will mean massive amounts of new work for them. But the discipline of split-testing should be just the opposite: a way to save massive amounts of time. (See Ideas. Code. Data. Implement. Measure. Learn for more on why this savings is so valuable)

Hypothesis testing vs hypothesis generation
I have sometimes opined that split-testing is the "gold standard" of customer feedback. This gets me into trouble, because it conjures up for some the idea that product development is simply a rote mechanical exercise of linear optimization. You just constantly test little micro-changes and follow a hill-climbing algorithm to build your product. This is not what I have in mind. Split-testing is ideal when you want to put your ideas to the test, to find out whether what you think is really what customers want. But where do those ideas come from in the first place? You need to make sure you don't get away from trying bold new things, using some combination of your vision and in-depth customer conversations to come up with the next idea to try. Split-testing doesn't have to be limited to micro-optimizations, either. You can use it to test out large changes as well as small. That's why it's important to keep the reporting focused on the macro statistics that you care about. Sometimes, small changes make a big difference. Other times, large changes make no difference at all. Split-testing can help you tell which is which.

Further reading
The best paper I have read on split-testing is "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO" - it describes the techniques and rationale used for experiments at Amazon. One of the key lessons they emphasize is that, in the absence of data about what customers want, companies generally revert to the Highest Paid Person's Opinion (hence, HiPPO). But an even more important idea is that it's important to have the discipline to insist that any product change that doesn't change metrics in a positive direction should be reverted. Even if the change is "only neutral" and you really, really, really like it better, force yourself (and your team) to go back to the drawing board and try again. When you started working on that change, surely you had some idea in mind of what it would accomplish for your business. Check your assumptions, what went wrong? Why did customers like your change so much that they didn't change their behavior one iota?


  1. Thanks for writing this article. I found your specific insights on a/b testing to be quite valuable.

    Max Levchin of Slide and Paypal has noted that 10% of Slide's headcount is devoted to metrics only. This means optimizing everything possible in order to attract the maximum number of users for the longest amount of time.

    In traditional marketing there are three main objectives. These involve persuading people to:

    1. Buy a higher quantity
    2. of higher priced items
    3. More often

    This logic can be extrapolated to the web world to mean: more users, visiting your site more often, for longer amounts of time to maximize advertising exposure.

    Multivariate and A/B split testing combined with vision and lots of customer interactions are the best ways I know of to make something people want.

  2. Hi Eric! It's cool to hear you talk about this, because I just spent a week working on the experiment system at IMVU. What motivated this work was what I imagine is a common pitfall in experiment systems like the one you outlined:

    The experiment branch assignment that chooses hypothesis vs control for each customer needs to know what probabilities it gives each group.
    Every time you change those probabilities after the experiment is live, you bias the experiment.

    To explain why, I'll use the pathological example. Suppose your new experiment starts with 100% of users in the control branch. This is fairly common, because you're not ready to start the experiment yet. Then, a week later, you switch it to 50/50. But wait! During that first week, everyone was getting assigned to the control branch. Since you don't want to give anyone mixed behavior, all those users are stuck there. Only new users get the 50/50 split, so your hypothesis branch over represents new users and your control branch over represents existing ones.

  3. Hitchens,

    I don't see how that matters. When you test for statistical significance that all falls out.

    In your example the control samples will have a lower variance than the test sample, that's all.

  4. jesse,

    What you said is true, but the bias that hitchens is talking about would still exist if you include in your final measurement ALL the visitors who were assigned into either group.

    I think the change hitchens is suggesting, and which you may already be assuming, is pretty straightforward. In hitchens' example, where in month 1, 100% of visitors were assigned into the control group, and in month 2, the split is 50/50 between control and your new hypothesis, you'd get a biased result if at the end of month 2 you examined all the visitors' actions from both months 1 and 2. You remove the bias if you only look at the performance of visitors who were assigned into a group during month 2.

  5. Since you're looking for big improvements, the results should jump off the page, and slight changes in variance and other statistical measures should not be be a serious barrier to learning.

    Of course, when you set up your A/B comparison, you need critical thinking to ensure you are not making silly mistakes.

  6. Great post Eric! I saw you speak yesterday at the FB garage and it reminded me of this article.

    Our AB stuff always seemed to get pushed back a bit due to higher priority needs. So when we started chatting about AB testing here the discussion led to how fast an AB testing system could be implemented. To rectify this problem, a wager was made for a slurpee from 7-11 if a working AB test system was completed under 30 min.

    The results... Had to drop down to a medium slurpee for an extra 10 minutes, but it was only 40 minutes from "Experiment" table creation to having a working AB testing system as described. It's not ready for production yet (it's missing nice reporting), but it goes to show there's no reason NOT to implement and start using something like this ASAP.

    Thanks for the inspiration (and the slurpee)!

  7. If you are trying to find a product that sells, than each test would require a different product to be created (or at least a different sales page) many cases an example is required - demo, or in my case video guitar lesson along with google ads for quick test data. Maybe some things be sold with no sample for customers to see whatsover.

    This adds to the complexity of decision making since you don't know it's the ad, sales page, product, keywords and so on.

  8. I like the theory of A/B testing, but don't understand how it could be used for much more than cosmetic changes such as text and layout.

    For example, I have a service that pays people money for performing an online activity. For each user's reported activity, my service sends them a check every thirty days. Feedback at first was truly adoring and uptake was huge. Two years ago.

    Now new sign-ups are flat, even though the efficiency and reliability of my service has improved ten-fold. I'm thinking of building an online administrative section of my service where users can log in and actually see how much money they have earned historically and are earning in the current period. It will take more work to build this section than it took to build my original service. How can I use split-testing to determine whether I should build this new section before I actually spend time and money on it?

  9. Hi Todd,
    I think you're right that A/B testing is about optimization (choosing which option is optimal) - whereas you are asking a different question "Should I build an admin system at all?".

    This article on product discovery might be helpful -

    You still need quick feedback before committing a lot of resource to the opinion/guess/hunch that building an admin system is the right thing to do.