Designed To Fail: Why Many Tests Give You Meaningless Results

You built out your new ad copy, tested out a bidding strategy, measured web and store sales to measure the online to offline effect; however, in the end you got the worst outcome possible – inconclusive results.

A negative result would have been better; at least you would have known that your hypothesis was wrong or that your strategy was not effective. But an inconclusive result tells you nothing, which can be incredibly frustrating as a marketer.

There are many reasons why a well-designed test might fail. For example, seasonal effects might be ignored, the dataset might be too small or the marketplace might change during the test.

However, a very common error in test design is not accounting for volatility – fluctuations in performance due to unpredictable events in the marketplace.

In this post, I shall delve into the issue of volatility, how it might lead to a test with inconclusive results and finally, how you can mitigate its effect on your test.

A Thought Experiment

To understand the issue better, let us assume that you want to test the hypothesis that online SEM spending leads to offline store sales. To test this hypothesis, you ramp up your online budgets in increments every week.

Your plan is to run the test for 5 weeks, collect the data, do a regression analysis and answer the question, “What does one dollar spent online lead to in offline sales?”. Now let us, put some real numbers in this thought experiment.

Daily offline store Revenue= $550,000

Daily baseline online SEM spend=$10,000

Your plan is to spend $10,000, $15,000, $20,000, $30,000, $40,000 per day on SEM in weekly increments, i.e. you spend $10,000 per day on week 1, $15,000 per day on week 2 and so on.

Taking the thought experiment further, let us assume that one dollar spend online leads to $3 in offline revenue. If this were the case, then your offline store revenue would look as follows:

Week

Daily Store Revenue

Daily Online Attributable

Store Revenue

Daily Total Store Revenue

1

$550,000

$30,000

$580,000

2

$550,000

$45,000

$595,000

3

$550,000

$60,000

$610,000

4

$550,000

$90,000

$640,000

5

$550,000

$120,000

$670,000

If we were to plot the numbers in a chart, we would get the following graph. The slope of the graph is 3 telling us that one dollar spent online leads to $3 in offline revenue.

Volatility & Its Effect On Your Experiments

Volatility means that your store revenue would never be exactly $550,000 every day. Instead, it will be a number close to the $550,000 average and will fluctuate daily.

It also means that the online contribution will never be exactly $3 for every dollar spent but a number that fluctuates close to $3. Let us assume that the daily volatility in the online store revenue is 15% of average. The experimental results will now look something like this:

It is unclear from the graph if there is any relationship at all. Further, even the statistical confidence measure (R squared) is 5.25% indicating that we are not confident that the regression is meaningful.

So why did this happen? The 15% volatility in offline store revenue masked any effect that the online SEM spend had.

For instance: If the SEM spend contributed to $60,000 in store revenue but the store revenue was $60,000 lower on the same day due to volatility, the two effects would cancel out each other and you would see no change in the total revenue.

Clearly this would be an expensive, time consuming experiment which would lead to inconclusive results. Moreover, this could happen for any experiment including an ad-copy test, a landing page test, a promotion etc.

What Can Advertisers Do To Prevent This?

Before running the experiment, measure the volatility on a measured variable. In our example, we would measure the volatility on total store revenue.

Check to see the minimum impact your test would need to have to be measurable. In our example, we would need to estimate the minimum impact the online spend can have on store revenue in order for us to measure the effect.

Another parameter to experiment is the number of days you want to run the experiment. Experiment duration is always a trade-off and there are always conflicting issues to be considered. Running the experiment for a longer time period might give you a more robust answer but would you be willing to wait longer for the result? Further, would marketplace forces such as CPC inflation and seasonality lead to more or less volatility in the longer duration?

Test your assumptions before running an experiment. You can build a simple experimental simulation in Excel or a statistical package like R and check to see if your test will give you meaningful results.

Following these steps will help you avoid the heartache of expensive, time consuming and inconclusive tests.

Some opinions expressed in this article may be those of a guest author and not necessarily Search Engine Land. Staff authors are listed here.

Sponsored

http://docsheldon.com Doc Sheldon

Interesting, Siddharth. Analytics not being my strong point, I would never have predicted such a scatter as shown in your second example. Plotting out projections in Excel definitely sounds like a worthwhile exercise, before starting the experiment.

http://www.epiphanysolutions.co.uk SteveBaker

Hi Siddharth,

Unforseen events have wreaked havoc on many tests that we’ve run in the past.

Most recently, we were running a test to improve the conversion rate on a client’s home page, but at the end of February, the agency managing their banner advertising found that they had some spare budget, and ramped up the traffic dramatically. Inevitably, this crashed the conversion rate, making any kind of significance test meaningless.

Fortunately, as we’ve been caught out like this before, we tracked the performance of the various test versions for each major traffic source, so we were able to deal with it (this is a good idea if possible for website optimisation tests, since the results may be different traffic sources anyway).

Advert tests can also be a problem for a PPC campaign. Clearly, changing your bids changes your positions, and hence your click through rates. As a result, we very rarely run advert tests in the formative stages of account optimisation, when changes to the bids are likely to be greater and more frequent…

The assumption of stable means is obviously critical to an effective significance test, but in the ‘real world’, it tends to be the first thing to go…

http://bit.ly/VS_Blog Wilson Kanaday

Siddharth –

Would there be a way to use multiple regression analysis to look at the different variables affecting the test?

http://braddlibby.wordpress.com Bradd Libby

“the statistical confidence measure (R squared) is 5.25% indicating that we are not confident that the regression is meaningful.”

R-squared is not a measure of statistical confidence. It only says what portion of the variability in a given data set is explainable by the regression model. To assert anything else is appallingly ill-informed.

http://searchengineland.com Siddharth Shah

Bradd,

Nice catch. R^2 being 5.25% means that only that much variability is explained by the independent variable. Actually the p values for that regression coefficient are well over 0.3. So my point about the confidence is still valid.

Attend Our Conferences

Attend Marketing Land's SocialPro conference and learn fresh new strategies and tactics from some of the savviest brands and digital marketing agencies managing earned, owned and paid social media marketing campaigns across multiple platforms. Visit the SocialPro site to learn more..