Saturday, May 17, 2008

A drawer is filled with socks and you remove eight of them randomly. Four are black and four are white. How confident are you in estimating the proportion of white and black socks in the drawer?

The standard statistical approach is to assume that the number of socks in the drawer is infinite, and to use the formula for the standard error of a proportion: SQRT([proportion] * [(1 - [proportion])/[number taken out]) or, more simply, SQRT(p*q/n). In this case, the standard error is SQRT(0.5*0.5/8) = 17.7%

However, this approach clearly does not work in all cases. For instance, if there are exactly eight socks in the drawer, then the sample consists of all of them. We are 100% sure that the proportion is exactly 50%.

If there are ten socks in the drawer, then the proportion of black socks ranges from 4/10 to 6/10. These extremes are within one standard error of the observed average. Or to phrase it differently, any reasonable confidence interval (80%, 90%, 95%) contains all possible values. The confidence interval is wider than what is possible.

What does this have to do with business problems? I encountered essentially the same situation when looking at the longitudinal behavior of patients visiting physicians. I had a sample of patients who had visited the physicians and was measuring the use of a particular therapy for a particular diagnosis. Overall, about 20-30% of all patients where in the longitudinal data. And, I had pretty good estimates of the number of diagnoses for each physician.

There are several reasons why this is important. For the company that provides the therapy, knowing which physicians are using it is important. In addition, if the company does any marketing efforts, they would like to see how they perform. So, the critical question is: how well does the observed patient data characterize the physician behavior.

This is very similar to the question posed earlier. If the patient data contains eight new diagnoises and four start on the therapy of interest, how confident am I that the doctor is starting 50% of new patients on the therapy?

If there are eight patients in total, then I am 100% confident, since all of them managed to be in my sample. On the other hand, if the physician has 200 patients, then the statistical measures of standard error are more appropriate.

The situation is exacerbated by another problem. Although the longitudinal data contains 20%-30% of all patients, the distribution over the physicians is much wider. Some physicians have 10% of their patients in the data and some have 50% or more.

The solution is actually quite simple, but not normally taught in early statistics or business statistics courses. There is something called the finite population correction for exactly this situation.[stderr-finite] = [stderr-infinite]*fpcfpc = SQRT(([population size]- [sample size])/([population size] - 1))

So, we simply adjust the standard error and continue with whatever analysis we are using.

There is one caveat to this approach. When observed proportion is 0% or 100%, then the standard error will always be 0, even with the correction. In this case, we need to have a better estimate. In practice, I add or subtract 0.5 from the proportion to calculate the standard error.

This problem is definitely not limited to physicians and medical therapies. I think it becomes an issue in many circumstances where we want to project a global number onto smaller entities.

So, an insurance company may investigate cases for fraud. Overall, they have a large number of cases, but only 5%-10% are in the investigation. If they want to use this information to understand fraud at the agent level, then some agents will have 1% investigated and some 20%. For many of these agents, the correction factor is needed to understand our confidence in their customers' behavior.

The problem occurs because the assumption of an infinite population is reasonable over everyone. However, when we break it into smaller groups (physicians or agents), then the assumption may no longer be valid.

Saturday, May 3, 2008

A few days ago, a reader of this blog used the "ask a data miner" link on the right to mail us this question. (Or, these questions.)

Question:

When modeling rare events in marketing, it has been suggested by many to take a sample stratified by the dependent variable(s) in order to allow the modeling technique a better chance of detecting a difference (or differences in the case of k-level targets). The guidance for the proportion of the event in the sample seems to range between 15-50% for a binary outcome (no guidance have I seen for a k-level target). I am confused by this oversampling and have a couple questions I am hoping you can help with:

Is there a formula for adjusting the predicted probability of the event (s) when there has been oversampling on the dependent?

Is this correction only needed when ascertaining the lift of the model or comparing to other models which were trained on a dataset with the same oversampling proportion OR would you need the adjustment to be done to the predicted value when you score a new dataset – such as when you train a model on a previous campaign and use the model to score the candidates for the upcoming campaign?

I use logistic regression and decision trees for classification of categorical dependent variables – does the adjustment from (question 1) apply to both of these? Does the answer to (question 2) also apply to both of these techniques?

Especially with decision tree models, we suggest using stratified sampling to construct a model set with approximately equal numbers of each of the outcome classes. In the most common case, there are two classes and one, the one of interest, is much rarer than the other. People rarely respond to direct mail; loan recipients rarely default; health care providers rarely commit fraud. The difficulty with rare classes is that decision tree algorithms keep splitting the model set into smaller and smaller groups of records that are purer and purer. When one class is very rare, the data passes the purity test before any splits are made. The resulting model always predicts the common outcome. If only 1% of claims are fraudulent, a model that says no claims are fraudulent will be correct 99% of the time! It will also be useless. Creating a balanced model set where half the cases are fraud forces the algorithm to work harder to differentiate between the two classes.

To answer the first question, yes, there is a formula for adjusting the predicted probability produced on the oversampled data to get the predicted probability on data with the true distribution of classes in the population. Suppose there is 1% fraud in the population and 50% fraud in the model set. Each example of fraud in the model set represents a single actual case of fraud in the population while each non-fraud case in the model set represents 99 cases of fraud in the population. We say the oversampling rate is 99. So, if a certain leaf in a decision tree built on the balanced data has 95 fraudulent cases and 5 non-fraudulent cases, the actual probability of fraud predicted by that leaf is 95/(95+5*99) or about 0.16 because each of the 5 non-fraudulent cases represents 99 such cases. We discuss this at length in Chapter 7 of our book, Mastering Data Mining. You can also arrive at this result by applying the model to the original, non-oversampled data and simply counting the number of records of each class found at each leaf. This is sometimes called backfitting the model.

To answer the second question, this calculation is only necessary if you are actually trying to estimate the probability of the classes. If all you want to do is generate scores that can be used to rank order a list or compare lift for several models all built from the oversampled data, there is no need to correct for oversampling because the order or results will not change.

Using oversampled data also changes the results of logistic regression models, but in a more complicated way. As it happens, this is a particular interest of Professor Gary King, who taught the only actual class in statistics that I have ever taken. He has written several papers on the subject.

Thursday, May 1, 2008

If I want to test the effect of return of investment on a mail/ no mail sample, however, I cannot use a parametric test since the distribution of dollar amounts do not follow a normal distribution. What non-parametric test could I use that would give me something similar to a hypothesis test of two samples?

Recently, we received an email with the question above. Since it was addressed to bloggers@data-miners.com, it seems quite reasonable to answer it here.

First, I need to note that Michael and I are not statisticians. We don't even play one on TV (hmm, that's an interesting idea). However, we have gleaned some knowledge of statistics over the years, much from friends and colleagues who are respected statisticians.

Second, the question I am going to answer is the following: Assume that we do a test, with a test group and a control group. What we want to measure is whether the average dollars per customer is significantly different for the test group as compared to the control group. The challenge is that the dollar amounts themselve do not follow a known distribution, or the distribution is known not to be a normal distribution. For instance, we might only have two products, one that costs $10 and one that costs $100.

The reason that I'm restating the problem is because a term such as ROI (return on investment) gets thrown around a lot. In some cases, it could mean the current value of discounted future cash flows. Here, though, I think it simply means the dollar amount that customers spend (or invest, or donate, or whatever depending on the particular business).

The overall approach is that we want to measure the average and standard error for each of the groups. Then, we'll apply a simple "standard error" of the difference to see if the difference is consistently positive or negative. This is a very typical use of a z-score. And, it is a topic that I discuss in more detail in Chapter 3 of my book "Data Analysis Using SQL and Excel". In fact, the example here is slightly modified from the example in the book.

A good place to start is the Central Limit Theorem. This is a fundamental theorem for statistics. Assume that I have a population of things -- such as customers who are going to spend money in response to a marketing campaign. Assume that I take a sample of these customers and measure an average over the sample. Well, as I take more an more samples, the distribution of the averages follows a normal distribution regardless of the original distribution of values. (This is a slight oversimplification of the Central Limit Theorem, but it captures the important ideas.)

In addition, I can measure the relationship between the characteristics of the overall population and the characteristics of the sample:

(1) The average of the sample is as good an approximation as any of the average of the overall population.

(2) The standard error on the average of the sample is the standard deviation of the overall population divided by the square root of the size of the sample. Alternatively, we can phrase this in terms of variance: the variance of the sample average is the variance of the population average divided by the size of the sample.

Well, we are close. We know the average of each sample, because we can measure the average. If we knew the standard deviation of the overall population, then we could get the standard error for each group. Then, we'd know the standard error and we would be done. Well, it turns out that:

(3) The standard deviation of the sample is as good an approximation as any for the standard deviation of the population. This is convenient!

Let's assume that we have the following scenario.

Our test group has 17,839 customers, and the overall average purchase is $85.48. The control group has 53,537 customers, and the average purchase is $70.14. Is this statistically different?

We need some additional information, namely the standard deviation for each group. For the test group, the standard deviation is $197.23. For the control group, it is $196.67.

The standard error for the two groups is then $197.23/sqrt(17,839) and $196.67/sqrt(53,537), which comes to $1.48 and $0.85, respectively.

So, now the question is: is the difference of the means ($85.48 - $70.14 = $15.34) significantly different from zero. We need another formula from statistics to calculate the standard error of the difference. This formula says that the standard error is the square root of the sums of the squares of standard errors. So the value is $1.71 = sqrt(0.85^2 + 1.48^2).

And we have arrived at a place where we can use the z-score. The difference of $15.34 is about 9 standard deviations from 0 (that is, 9*1.71 is about 15.34). It is highly, highly, highly unlikely that the difference includes 0, so we can say that the test group is significantly better than the control group.

In short, we can apply the concepts of normal distributions, even to calculations on dollar amounts. We do need to be careful and pay attention to what we are doing, but the Central Limit Theorem makes this possible. If you are interested in this subject, I do strongly recommend Data Analysis Using SQL and Excel, particularly Chapter 3.