Data science, statistics or machine learning in broken English

In many cases of digital marketing especially if it's online, marketers or analysts usually love to apply A/B tests in order to find the most influential metric on KGI/KPIs from a huge set of explanatory metrics, such as creative components of UI, choice of ads, background images of the page, etc.

Such influential metrics are sometimes called "golden feature" or "golden metric" -- even though it sounds ridiculous -- and many people are looking for it very hard, as they firmly believe "once the metric is found, we can very easily raise revenue and/or profit with just raising the golden metric!!". Ironically, not a few A/B tests are run on such a basis.

But, is it really true? If you find any kind of such golden metrics, can you really raise revenue, gather more users, or get more conversions? Yes, in some cases it may be true; however you have to see a case that theoretically it cannot be.

Below is a link to a dataset we use here. Please download "men.txt" and "women.txt" and import them as "men" and "women" respectively.

Result Result of the match (0/1) - Referenced on Player 1 is Result = 1 if Player 1 wins (FNL.1>FNL.2) FSP.1 First Serve Percentage for player 1 (Real Number) FSW.1 First Serve Won by player 1 (Real Number) SSP.1 Second Serve Percentage for player 1 (Real Number)
SSW.1 Second Serve Won by player 1 (Real Number)
ACE.1 Aces won by player 1 (Numeric-Integer) DBF.1 Double Faults committed by player 1 (Numeric-Integer)
WNR.1 Winners earned by player 1 (Numeric)
UFE.1 Unforced Errors committed by player 1 (Numeric)
BPC.1 Break Points Created by player 1 (Numeric)
BPW.1 Break Points Won by player 1 (Numeric)
NPA.1 Net Points Attempted by player 1 (Numeric)
NPW.1 Net Points Won by player 1 (Numeric) FSP.2 First Serve Percentage for player 2 (Real Number) FSW.2 First Serve Won by player 2 (Real Number) SSP.2 Second Serve Percentage for player 2 (Real Number)
SSW.2 Second Serve Won by player 2 (Real Number)
ACE.2 Aces won by player 2 (Numeric-Integer) DBF.2 Double Faults committed by player 2 (Numeric-Integer)
WNR.2 Winners earned by player 2 (Numeric)
UFE.2 Unforced Errors committed by player 2 (Numeric)
BPC.2 Break Points Created by player 2 (Numeric)
BPW.2 Break Points Won by player 2 (Numeric)
NPA.2 Net Points Attempted by player 2 (Numeric)
NPW.2 Net Points Won by player 2 (Numeric)

To determine "golden" metrics or to build a model from the men's dataset

To predict women's results from a model given by rules or built with the men's dataset

The result is to be evaluated using confusion matrix.

A/B testing and rule-based prediction

OK, first let's run a t-test as an univariate analysis on each explanatory variable. In this manner of analytics, we expect that finally we have some "golden" metrics and we can determine rules in order to predict outcome from new datasets. Below is a structure of the men's dataset.

In principle we have to run a t-test on each pair such as FSP.1 and FSP.2 in one-by-one manner, and then if the test shows a significant difference of mean value between them, we can take them as one of "golden" metrics and set up a rule-based predictor as below.

> table(dw$Result,ifelse(dw$FSP.1>=dw$FSP.2,1,0))

This is a very simple rule-based predictor that returns 1 (won) if FSP.1 >= FSP.2 and 0 (lost) vice versa. Let's run a series of t-tests.

OMG, even though it's a "paired" t-test, no significant difference of the mean appeared. This result is not so surprising: see a plot below, just showing the mean and the standard deviation as error bar of each metric.

Almost all metrics show too large error bars. :( Just for your information, I tried to build a rule-based predictor with metrics showing the lowest p-value, "BPC.1 and BPC.2".

It appears that even metrics with non-significant difference of the mean can predict women's result to some extent... but do you want to conclude that these match stats are never useful for predicting results of match?

Multivariate modelings

In short, I don't think so. I know in such a case multivariate modelings work well. Below are examples of such multivariate modelings.

Actually already I tried a wide variety of machine learning classifiers and this one was the best model for this tennis datasets :P)

Conclusions

The result told us that univariate stats and rule-based predictors given by usual hypothesis testing on them sometimes fail, while multivariate modelings work well given by (generalized) linear models or machine learning classifiers.

In general, multi-dimensional and multivariate features usually represent more complex information and internal structure of datasets than univariate features. But in many situations in marketing, not a few people neglect an importance of multivariate information and even persist in running univariate A/B tests and looking for "golden features or metrics".

Even when multiple features have "partial" correlations, such univariate A/B testing can be wrong because partial correlation easily affects outcome of usual univariate correlation (and also univariate testing).

If you have multivariate datasets, please try multivariate modelings and don't persist in univariate A/B testing.