After talking with Barry O’Reilly, the creator of this format, I realized that I often conflate a hypothesis format with experiment design. This is a fair criticism, so, let’s get clear on the difference.

Google tells me that a hypothesis is defined as follows:

a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation.

Product teams that adopt an experimental mindset start with hypotheses rather than assuming their beliefs are facts.

Experiment design, on the other hand, is the plan that a product team puts in place to test a specific hypothesis.

I like that the “We believe…” hypothesis format is simple enough that it encourages teams to commit their beliefs to paper and encourages them to treat their beliefs as suppositions rather than as assumed facts.

My concerns are that the format encourages teams to test the wrong things and it doesn’t require that teams get specific enough to lead to sound experiment design.

Test Specific Assumptions, Not Ideas

The “We believe…” format does encourage teams to think about outcomes and particularly how to measure them. This is good. Too many teams still think producing features is adequate.

But I don’t like that this format starts with a statement about a capability. This keeps us fixated on our ideas, whereas we are better off identifying our key assumptions.

When we test an idea, we get stuck asking, “Will this feature work or not?” The best way to answer that question is to build it and test it. However, this requires that we spend the time to build the feature before we learn whether or not it will work.

Additionally, this is a “whether or not” question. Chip and Dan Heath remind us in Decisive that “whether or not” questions lead to too narrow of a framing. When we consider a series of “whether or not” questions—should we build feature A, should we build feature B, and so on—we forget to account for opportunity cost.

Instead, we should frame our questions as “compare and contrast” decisions: “Which of these ideas look most promising?” We should design our experiments to answer this broader question. The best way to do that is to test the assumptions that need to be true for each idea to work.

Because ideas often share assumptions, this allows us to experiment quickly, ruling out sets of ideas when we find faulty assumptions. Additionally, as we build support for key assumptions, we can use those assumptions as building blocks to generate new ideas.

Assumption testing is a faster path to success than idea testing. – Tweet This

As soon as we shift our focus to testing assumptions, the “We believe…” format falls apart. It’s rare that an outcome is dependent upon a single assumption, so the second and third parts of the hypothesis don’t hold up.

A more accurate format might be:

We believe that [Assumption A] and [Assumption B]… and [Assumption Z] are true

Therefore we believe [this capability]

Will result in [this outcome]

We will have confidence to proceed when [we see a measurable signal]

But again, it’s not the idea you should be testing. You should be testing each of the assumptions that need to be true for your idea to work. So we need a hypothesis format that works for each assumption.

Test each of the assumptions that need to be true in order for your idea to work. – Tweet This

Let’s look at an example. Imagine you are working at Facebook before they added the additional reaction options (e.g. love, sad, haha, sad, wow, angry). I suspect Facebook was inundated with “dislike button” requests, as I heard this complaint often.

Imagine you started with this modified hypothesis:

We believe that

Assumption A: People either like or dislike a story.

Assumption B: People don’t want to click like on a story they dislike.

Assumption C: Some people who dislike a story would engage with the story if it was easier to do than having to write a comment.

Therefore we believe that adding a dislike button

Will result in more engagement on newsfeed stories

We will have confidence to proceed when we see a 10% increase in newsfeed clicks.

Now imagine you do what most teams do and you test your new capability. You add a dislike button and you see a 5% increase in newsfeed clicks.

You didn’t see the engagement you expected, but you aren’t sure why. Is one of your assumptions false? Are they all true and they just didn’t have the effect size you expected? You have to do more research to answer these questions.

Now imagine you tested each of your assumptions individually. To test Assumption A, you could have a Facebook user review the stories in their newsfeed and share out loud their emotional reaction to each story. You’d uncover pretty quickly that Assumption A is not true. People have many different emotional responses to newsfeed stories.

Now I’m not saying that you wouldn’t have figured this out by running your capability test. You could easily run the same think-aloud study as we did in the assumption test after you built the dislike button. However, you will have learned what won’t work after you have already built the wrong capability.

The advantage of testing the individual assumptions is that you avoid building the wrong capability in the first place.

I don’t like that the “We believe…” format encourages us to test capabilities and not assumptions. However, this isn’t my only concern with the “We believe…” format.

Align Around Your Experiment Design Before You Run Your Experiment

Have you ever run an experiment only to have key stakeholders or other team members argue with the results? I see it all the time.

You run an A/B test. Your variable loses to your control and the designer argues you tested with the wrong audience. Or an engineer argues that it will perform better once it’s optimized. Or the marketing team argues that even though it lost, it’s better for the brand. And so on.

We’ve all been there.

Here’s the thing. If you are going to ignore your experiment results, you might as well skip the experiment in the first place.

If you are going to ignore your results, you might as well skip the experiment. – Tweet This

Now that doesn’t mean that these objections to the experiment design aren’t valid. They may be. But if the objections arise after the experiment was run, it’s difficult to separate valid concerns from confirmation bias.

Remember, as we invest in our ideas, we fall in love with them (we all do this, it’s a human bias called the escalation of commitment). The more committed we are to an idea, the more likely we are to only see the data that confirms the idea rather than the disconfirming data, no matter how much disconfirming data there is. This is another human bias, commonly known as confirmation bias.

Our goal should be to surface these objections before we run the experiment. That way we can modify the design of our experiment to account for them.

Let’s return to our hypothesis:

We believe that adding a dislike button

Will result in more engagement on Facebook

We will have confidence to proceed when we see clicks on news feed stories increase by 10%

This looks like a good hypothesis. It includes a clear outcome (i.e. more engagement) and it defines a clear threshold for a specific metric (i.e. 10% increase of clicks on news feed stories).

Remember, we tested this capability and we got a 5% increase in engagement, not a 10% increase. If we trust our experiment design, we need to conclude based on our data that our hypothesis is false as we didn’t clear our threshold.

But for most teams, this is not how they would interpret the results.

If you like the change, you’ll argue:

We didn’t run the test for long enough.

People didn’t have enough time to learn that the dislike button exists.

The design was bad. People couldn’t find the dislike button.

People hate all change for the first little while.

Maybe the percentage we tested with are all optimists, liking everything.

The news cycle that day was overly positive and skewed the results.

5% is pretty good, we can optimize our way to 10%.

Any increase is good, let’s release it.

If you don’t like the change, you’ll argue:

It didn’t work. We didn’t get to 10%.

People don’t want to dislike things.

Facebook is a happy place where people want to like things.

Any increase is not good, because more options detract from the UI. We need to only add things that move the needle a lot.

And where do you end up? Exactly where you were before you ran the experiment—with a team who still can’t agree on what to do next.

Now this confusion isn’t necessarily because we didn’t frame our hypothesis well. It’s because we didn’t get alignment from our team on a sound experiment design before we ran the experiment. If everyone agreed that the experiment design was sound, we’d have no choice but to conclude based on our data that our hypothesis was false.

Now this isn’t a problem with the “We believe…” format per se, but I see many teams conflate a good hypothesis with a good experiment design, just like I did. They believe they have a sound hypothesis and therefore they conclude their experiment design is sound as well. However, this is not necessarily true.

Invest the Time to Get Your Experiment Design Right

To ensure that your team won’t argue with your experimental results, take the time to define and get alignment around the following elements:

The Assumption: Be explicit about the assumption you are testing. Be specific.

Experiment Design: Describe the experiment stimulus and/or the data you plan to collect.

Participants: Define who is participating in the experiment. Be specific. All customers? Specific types of customers? And be sure to include how many.

Key Metrics and Thresholds: Be explicit about how you will evaluate the assumption. Define which metric(s) you will use and any relevant thresholds. For example, “increase engagement” is not specific enough. How do you measure engagement? “Increase clicks on newsfeed stories by 10%” is more specific and sets a clear threshold. For some types of metrics, it is also important to define when you will take the measurement. For example, if you are measuring open rates on an email, you’ll need to define how long you’ll give people to open the email (e.g. 3 days after it was sent).

Have a clear rationale for why your experiment design/data collected will impact your metric. Don’t over test. Be sure to have a strong theory for why you think this metric will move. Many teams get too enthusiastic about testing and test every variation without any rhyme or reason. Changing the button color from blue to red increased conversions so now they want to try green, purple, yellow, and orange. However, doing this will increase your chance of false positives and lead to many wasted experiments.

Decide upfront how you will act on the data you collect. Before you run your experiment, define what you will do if your assumption is supported, if it’s refuted, or in the case of a split test, if the results are flat. If the answer is the same in all three instances, skip your experiment and take action now. If you don’t know how you will use the data, you aren’t ready to run your experiment.

This list is more complex than the “We believe…” format and I don’t expect it to spread like wildfire. However, if you want to get more out of your experiments (and you want to build more trust amongst your team in your experimental results), defining (and getting alignment around) these elements upfront will help.

I’ll be sharing a one-page experiment design template with my mailing list members. If you want to snag a copy, use the form below to sign up.

Subscribers receive:

A monthly article or video on product discovery/continuous innovation.

A monthly newsletter with book recommendations and worthy reads from around the web.

Experiment Design Template

This one page template will help your team align around an experiment design, building trust and helping to guard against confirmation bias. Get more from your experiments, by downloading the template today.

A special thanks to Barry O’Reilly who read an early draft of this blog post. His feedback led to significant revisions that made this article better. Barry’s a thoughtful product leader. Be sure to check out his blog.

Hi Teresa! Thanks for this article, it arrived just in the right moment. It was particularly interesting (and funny!) to read the part of all the explanations we as a team can find when we see the real outcome of a test. I think I’ve heard them all at some point or said something similar myself.
Thanks for your insightful work!
Amaia

I could not agree more. When I read Lean Startup a couple of years ago, I liked the overall approach, but I was concerned about the oversimplification of hypothesis testing. At the time, I felt that product teams that did not include a member with a solid understanding of proper market research and experimentation techniques are likely to make poor decisions (and/or waste a ton of time). When I wrote our latest round of lean hypotheses for our startup, I created clear, specific metrics, but I failed to establish a plan for each possible outcome of the experiment. Some of our hypotheses were validated, while others were not. We were forced to make decisions based on inconclusive data, and those decisions were likely biased by escalation of commitment. If we had decided how to act on our data ahead of time, we probably would have discovered the “inconclusive” outcome and modified the experiment to avoid it. Lesson learned!

Thank you for sharing your thoughtful analysis and the template. Arguing over results based on existing cognitive biases is a common issue. Looking forward to trying your template and sharing results. I also like how it puts some seriousness-tax on running an experiment and should inherently stop a team from running the redundant ones.

One question: The template seems to fit well for an optimization experiment where we have a running product and we would like to change or improve it, which is great. However, discovery is also about new product initiatives when there’s no data to directly compare. Do you have any thoughts or advice on that?

I use the same template for experiments on new products as well. It’s harder to determine a threshold when you don’t have baseline data, but you can use analogous products (or past experience) to set baselines on new products. Other than that, there is no real difference.

great article, thank you!
having a background in scientific research myself I used to underestimated the importance of alignment on all these elements, because for me they had become a natural part of my thinking process.
and having the results debated after the perfectly designed and clear tests had been done was indeed the reason tests weren’t fully acted upon, exactly as you said. A great reminder of a human factor, the main factor 🙂

Teresa is a product coach helping teams adopt user-centered, hypothesis-driven product development practices. She works with companies of all sizes on integrating user research, experimentation, and the right analytics into the product development process resulting in better product decisions. Read More…

Search Product Talk

Teresa is a product coach helping teams adopt user-centered, hypothesis-driven product development practices. She works with companies of all sizes on integrating user research, experimentation, and the right analytics into the product development process resulting in better product decisions. Read More…