Experiment

One sure way to kill a joke is to explain it. I hate to kill this great and clever joke on statistical significance, but here it goes. May be you want to just read the joke, love it, treasure it and move on without reading rest of the article.

Love it? I love this one for its simple elegance. You can leave now if you do not want to see this dissected.

First the good things.

The “scientists” start with hypotheses external to the data source, collect data and test for statistical significance. They likely used 1-tailed t-test and ran a between groups experiment.

One group was control group that did not eat the jelly beans. Other group was the treatment group that was treated with jelly beans.

The null hypothesis H0 is, “ Any observed difference between the number of occurrences of acne between the two groups is just due to coincidence”.

The alternative hypothesis H1 is, “The differences are statistically significant. The jelly beans made a difference”.

They use p value of 0.05 (95% confidence). p-value of 0.05 means there is only 5% probability the observed result can be entirely due to chance. If the computed p-value is less than 0.05 (p<0.05), they reject H0 and accept H1. If the computed p-value is greater than 0.05 (p>0.05) H0 cannot be rejected, it is all random.

They run a total of 21 experiments.

The first is the aggregate. They likely used a large jar of mixed color jelly beans and ran the test and found no compelling evidence to overthrow the null hypothesis that it was just coincidence.

Then they run 20 more experiments, one for each color. They find that in 19 of the experiments (with 19 different colors) they cannot rule out coincidence. But in one experiment using green jelly beans they find p less than 0.05. They reject H0 and accept H1 that green jelly beans made a difference in the number of occurrences of acne.

In 20 out of 21 experiments (95.23%), the results were not significant to toss out coincidence as the reason. In 1 out of 21 experiments (4.77% ) it was and hence green was linked to acne.

In other words, there was 95.23% probability (p=0.9523) that any observed link between jelly bean and acne is just random.

However the media reports, “Green jelly beans linked to acne at 95% confidence level”, because that experiment found p<0.05.

Green color is the spurious variable. The fact that the Green experiment had p<0.05 could easily be because this experiment run happened to have high concentration of random flukes in it.

The very definition of statistical significance testing using random sampling is just that.

If we have not seen the first experiment or the 19 other experiments that had p>0.05, we would be tempted to accept the link between green jelly beans and acne. Since we saw all the negative results, we know better.

In reality, we don’t see most if not all of the negative findings. Only the positive results get written up – be it the results of an A/B test that magically increased conversion or scientific research.
After all it is not interesting to read how changing just one word did not have a effect on conversion rates. Such negative findings deserve their place in the round filing cabinet.

By rejecting all negative findings and choosing only postive findings, the experimenters violate the rule of random sampling and highlight the high concentration of random flukes as breakthrough.

The next step in this slippery slope of pseudo statistical testing is Data Dredging. Here one skips the need for initial hypotheses and simply launches into data for finding “interesting correlations”.
Data Dredging is slicing up data in every possible dimension to find something – anything.

For example, “Eating Green jelly beans with left hand while standing up on Tuesdays” causes acne.

If you find this claim is so ridiculous that you will not fall for it, consider all the articles you have seen about the best days to run email marketing and best days to tweet OR How to do marketing like certain brands.

In this age of instant publication, easy experimentation, Big Data and social media echo chamber, how can you spot and stay clear of Random Flukes reported as scientific research?

You can start with this framework:

What are the initial hypotheses before collecting data? If there is none, thanks but no thanks. (Data Dredging)

How are these hypotheses arrived at? If these were derived from the very data they are tested with, keep moving. A great example of this is the class of books, “7 Habits of …”

See a study from extremely large samples? Many of the random flukes that do not show up in small samples do get magnified in large samples. It is just due to the mathematical artifact. Again, thanks but no thanks.

Very narrow finding? It is the Green jelly bean again ask about other dimensions.

Or you can just plain ignore all these nonsensical findings camouflaged in analytics.

“ We all have an internal price that we are willing to pay for a product. It is our Willingness To Pay (WTP) We have one such number, for each product. We buy a product as long as its price is just below our willingness to pay”.

For all practical purposes, they assume that this is a static and magical number. It differs from customer to customer, no explanation why, but the marketers are told to deal with it.

Hence there are methods and experiments that try to elicit what this distribution is across customers. Once you gather enough data on how many customers are willing to buy your product at different price points, you will have the demand curve.

So we see the many different ways marketers go about asking customers about their willingness to pay, be it for a soda or a webapp.

One common and incorrect method is through surveys that ask for our attitudinal willingness to pay, that is what is our intention

On a scale of 1 to 10, how likely are you to buy …

Of the following 5 prices, please indicate the price above which you will not buy

For a product that delivers x, y and z how much more will you pay.

These fail to take into account that actual customer behavior is much different from their stated intention. For instance, the context they are in while answering the survey is different from their buying context.

In general these studies overestimate the price customers are willing to pay.

Then there are experimental method that is popular with web startups. These include,

Showing different prices to different customers and measure conversion

Show multiple versions, present them in different order and use visual nudges to find one with higher conversion

These are variations of what brands like P&G and Unilever used to do in the offline world. These are definitely better than survey based approach but still fail to uncover true customer willingness to pay.

The experimental price points could still be way off. It also cannot find whether the customer will really end up buying the product at the stated price given everything else that are competing for their wallet.

If you are a web startup and want to use experiments to find demand curve, there is one very simple and straightforward method I recommend (see research reference here).

This is most suited for web startups, especially those that allow a free trial period or free version (freemium). This is best done during your long beta period or with those freeloaders who steadfastly remain on the free version even after an year.

Let the users signup for free and use the product for a period of time.Then, next time they use the product put up an ultimatum question to them

You have been using the product for 3 months. We want you to upgrade to paid version. Please think of a price from 1 to 20 that you are willing to pay for continued use of the product.

While you do that we will pick a random number R from 1 to 20 as well.

If your price P is less than R: For example you picked 4 and we picked 13. Sorry, we have to let you go. It has been great journey and we are happy to have delivered you value over the past 3 months.

If your price P is more than R: For example you picked $14 and we picked $7. Congratulations, you only need to pay $7 to use our product.

Customers are most likely to reveal their true willingness to pay for your product. If they were to say a number below their true value they risk not using the product. If there were to say a higher number then they risk paying it. So the right option for them to state the price at which they will continue to be delighted to use your product.

Once you collect enough data points you have an almost perfect demand curve, way more accurate than what is done through A/B testing.

Not to mention this is also a great way to find segment-version fit (if you have only one version, then Product-Market fit).

Once you know the demand curve, you know the price that maximizes your profit (or prices for different segments).

But all these are methods that continue to treat willingness to pay as an immovable and given number, something that marketer cannot control. Despite the research and analytics, all these methods ignore the value to the different customer segments and fail to ask the question, “What job do I want my target segment to hire my product for?”

Like this:

Be it a new menu item in a fast chain or a new pricing scheme for your product lines before rolling out sweeping changes experiment first on a limited scale for a limited time to test your theories. New York City Mayor, Bloomberg has proposed dropping fare prices on certain bus routes all the way to $0. His argument, based on models built by his team, is that most people commute for free now using their transfer passes and there is going to savings by eliminating delays by people searching for exact fares. The NYTimes article points out the problems in the argument, the most significant flaw is the possibility that a large number of people would commute by the busses just because it is free.

Whether or not the models are true or the claims of increased uptake are true need not be debated, they can simply be tested by rolling out a limited experiment for a limited time.

Experiments will go a long way to test your hypotheses and make course corrections, a far cheaper and effective method than managing unexpected consequences from a full-fledged roll-out and rolling back the changes.