I have a number of sources of traffic with different conversion rates. I have good evidence that conversion rates vary based on the source. For each traffic source I have how many leads we got, and how many converted. I want an estimate of each source's conversion rate. I do not need individual sources to be accurately estimated (this would be impossible given that many traffic sources only have 0-5 measured conversions). Instead I would like to be able to say that, across all traffic sources estimated at 1%-1.5% average conversion rates, the average future conversion rate is likely to be somewhere in that general ballpark.

The obvious approach of using past performance to predict future performance for each individual source fails badly because of the well-known phenomena of regression to the mean. For instance, among small traffic sources with no conversions there is a reasonable expectation of future performance being better than that. And likewise small traffic sources with lots of conversions are unlikely to maintain their record.

My naive idea is to take my data, and use it to produce a reasonable Bayesian prior for the true conversion rate of a random traffic source. Then for each source I can start with that prior and produce a posterior distribution for the true conversion rate of that source. And then my estimate of the average conversion rate for that source will be the average of the posterior.

My initial idea for how to fit data to a reasonable prior works like this. My prior will be the sum of piecewise linear functions, which will individually be 0 to a starting point, rise to a midpoint, then fall to the ending point after which it is 0 again. For a given division of the likely range of conversion rates into these intervals and midpoints, the prior I'd produce would be the one to maximize the sum of the logs of the likelihoods that each traffic source would have the conversions that it did.

My problem is that the more pieces I divide my interval into, the more closely I can fit my existing data set. But at some point I'm clearly over-fitting. Are there any guidelines that I can use to get a sense when I'm over-fitting my prior to my data?

I'm thinking there might be something like a statistic that I can compute to test how closely my measured data fits the prior - if the measured data has a better fit to the prior than a "random" data set should, then I've probably gone too far.

I would be grateful if anyone can suggest a statistic, an alternate rule of thumb to avoid over fitting, or a different approach to the original problem. Since I do not have access to a university library, please only suggest specific books or papers behind paywalls if they are guaranteed to be very relevant to my problem.

1 Answer
1

A data driven prior is not necessarily what you need. It sounds more like you need a hierarchical prior or a multi-level model. This involves the use of so called "random effects" in your model. The most basic form of such a model is given as

Where i indexes your source, and j indexes the specific measured conversion. We then specify a prior for the $u_i$ by using another parameter:

$$(u_i|\tau^2)\sim\mathcal{N}(0,\tau^2)$$

Then you complete the prior by specifying a distribution for $p(\beta_0,\sigma^2,\tau^2)$. Usually uniform is fine. What this structure does is to cause the predictions for a to be a weighted average of the prediction using all the data and the predictions using only the data from the ith source. This sounds like what you're looking for.

A hierarchical prior is theoretically better but computationally harder for me. I don't have a 1-parameter family of reasonable distributions. If I just use the same "fit linear curve" as above with, say, 5 parameters (one of which depends on the other 4) then at a grid of 1% probabilities I'd have 4,598,126 possible priors where all 5 parameters are at least 0. Each one of which I have to do a complex computation over all of my observations for...
–
btillyDec 30 '12 at 19:43

Upon further thought, most of those priors will be extremely unlikely. I may have computationally feasible way to discover which of those priors are likely enough to matter, and then do the computation over just those. It should be doable, though more computational work than I expected. I'd still like an answer to my original problem though, because as the number of traffic sources goes up, the number of parameters I'd want also goes up, and this full analysis quickly becomes infeasible.
–
btillyDec 30 '12 at 20:32