About Us

How to Measure A/B Test Performance for Display Advertising Campaigns

Alice Wu

Marketing Strategy

The mission for the Performance Targeting team is to continuously improve the machine learning systems that drive the advertising products offered by Quantcast. The goal of our machine learning systems is to target the most relevant audiences for our advertisers, and as we improve these systems, we usually face the following problems:

What metrics should we use to measure the effectiveness of our ad targeting?

The performance of machine learning systems is not deterministic. In this setting, can we be confident that an observed improvement is not due to random luck or some seasonality effect, but is a true improvement?

Even if a new machine learning algorithm fares better on average across all the advertising campaigns that we tested, there could be specific campaigns for which the algorithm might be detrimental. Can we correctly identify these campaigns?

Measuring Ad Targeting Effectiveness

Let’s illustrate these issues with an example. Assume that we have developed a new prediction algorithm to estimate the probability that a person on the web will buy a product after seeing a display ad for a media campaign. We would now like to test whether this algorithm will improve the relevancy of our ad targeting. Let’s call this new algorithm the algorithm A and our existing algorithm in production the algorithm B.

A commonly used metric to estimate the effectiveness of a display advertising campaign is the cost per acquisition (CPA) of that campaign:

In CPA we define an acquisition (or “conversion”) as a purchase of a product on the web for that advertiser. The CPA is a measure of how much we have to spend to get a given number of conversions. As our algorithms improve, we should get more conversions for the same cost, and our CPA should decrease.

To assess which algorithm performs better on a given campaign, we can run an A/B test on that campaign. To this effect, one metric we want to look at is the CPA lift:

Assuming the budgets are equal, if the lift is equal to 1, then the two algorithms have driven the same number of customers for the campaign. If the lift is greater than 1, then algorithm A has driven more customers, and vice versa.

We also do not have to restrict ourselves to running our A/B test on only one campaign. We could test our new algorithm on any of the thousands of campaigns that are run daily at Quantcast. Using the lift metric, we can identify how many campaigns outperformed by using algorithm A, how many underperformed and how many were neutral. We can also compute the average or median lift across all of our campaigns.

Random Luck vs. True Improvement

Let’s assume we ran an A/B test across multiple campaigns as described above, and we found that our average lift is 1.3. This means our new algorithm A is on average 30% better than our old algorithm B. However, is this performance improvement real, or is it just due to random fluctuations in real-time bidding markets? If we were to repeat the same experiment under the same conditions, would we get a 30% increase again?

A typical answer to this question is to do hypothesis testing, where our null hypothesis would be that algorithms A and B have the same average performance, and our two-tailed alternative hypothesis would be that the average performance of algorithms A and B are different. If we assume that the campaign lifts are independent and identically distributed random variables, then this could be done using a simple t-test, from which we could derive confidence intervals for a given significance level.

However, the “identically distributed” assumption is very unlikely to be true. Different advertising campaigns have different budgets, different impression constraints and in general very different behaviors. Let’s consider two advertisers:

First advertiser: a company selling pizzas across the United States on a CPM (cost-per-thousand impressions) contract to show 10 million impressions per month.

Second advertiser: a company selling cars in California on a CPM contract to show 100,000 impressions per month.

Clearly the spend behavior of the two campaigns will be different, as we will show 100 times more impressions for the first advertiser than for the second. Moreover, the pizza campaign will most likely have a lot more conversions than the car campaign, as pizzas are cheaper and customers are more likely to order multiple times per month.

One way to remove the identically distributed assumption is to numerically estimate the actual lift distribution for each campaign. This will not only provide a way to estimate confidence intervals for each campaign’s lift, but also estimate the distribution of the average lift across campaigns. Here are steps that one could follow to do so:

Assume a distribution either on the lift value itself or on the CPA of each group (e.g. conversions of group A follow a Binomial distribution with respect to impressions…).

Sample points from each campaign lift distribution and compute the mean. Do this a large number of times (for example, 100,000 times).

Use the simulated mean values to construct a distribution of the average lift across campaigns.

Once we have the distribution of the average lift, we are free to compute the mean of that distribution as well as confidence intervals, and use those to check whether our average lift is statistically significant.

New Algorithm, New Problems

However, outperforming the current production algorithm based on the average lift, even if statistically significant, is not a sufficient condition to promote the new algorithm to production. We also want to identify campaigns that underperformed significantly when running under the new algorithm. Such campaigns may need a closer inspection in order to figure out what exactly went wrong, and how the new model can be improved upon. A key point to keep in mind here is that analyzing those campaigns is “expensive” in terms of resources (time, human intervention, etc.) and hence we want to minimize the number of “false positives” as much as possible, while still being able to generate enough possible underperformers.

The above method of using confidence intervals and doing hypothesis testing for each campaign individually breaks down as the number of campaigns gets bigger and bigger. Under the null hypothesis (experiment has same performance as production), we expect a certain fraction of the campaigns to over perform and/or underperform significantly. Moreover, that fraction stays more or less constant with the number of campaigns (since it is based on the individual confidence interval only), and hence more and more campaigns could be “false positive” while looking like “true positives” — meaning they look like they are underperforming, while they are not. The impact of this effect gets more pronounced as our business accelerates and we continue working with more and more clients. So how can we correct for it?

Alternative Methods

Here’s an alternative approach using a multiple hypothesis testing framework. First, for each campaign, we compare the individual lift values we get for the experimental model with that campaign’s lift distribution under the null hypothesis (experiment is identical to production). From here we can compute a p-value for each campaign, which corresponds to the probability of observing a lift at least as extreme as the one measured if we were to accept the null hypothesis.

There are several methods we can use to come up with a set of “significant underperformers”. Maybe the simplest one would be to use the Bonferroni correction, namely to classify as “significant” all underperforming campaigns with p-value less than α/N, where α is the desired statistical significance if we were looking at only one campaign. This guarantees that the Family-Wise Error Rate (FWER = the probability of at least one false positive overall) is below α. In practice, this is overly conservative and would generate many false negatives (i.e. will come up with too few underperformers). There are a few other methods of controlling the FWER, but in the end the result is still the same, too conservative for our goals.

Instead of requiring an almost certainty that there are no false positives, a less restrictive approach is to require that not too many of the potential underperforming campaigns are false positives. In other words, we can pick an acceptable false discovery rate (FDR = expected number of “false positives” / number of campaigns identified). For instance, if FDR = 0.1, then we expect 90% of the identified campaigns to be “true underperformers”. We can then use the Benjamini–Hochberg(–Yekutieli) procedure to identify such campaigns for further testing.

Summary

The methodologies described above are but one way of promoting a better model to production status. The end result is the same, a binary decision: should we keep the current production model or switch to the new model? A possible alternative is to come up with a “continuous” decision, for instance by using a multi-armed bandit approach, wherein we run both models simultaneously on different user bases whose sizes depend on individual model performance. More details to come in a future blog post!