Oversampling for rare event

This tutorial describes the effects of oversampling on a rare event model. Suppose you are building a logistic regression model in which % of events (desired outcome) is very low (less than 1%). You need to make a treatment to make the model robust so that enough events would be used to train the model. Oversampling is one of the treatment to deal rare-event problem.

Effects of Oversampling

Oversampling

Suppose you are working on a retail customer attrition (churn) problem for a telecom company. You started building a logistic regression model in which target (dependent) variable is defined as whether a customer is active or not. If a customer is NOT active, it is 1 in target variable. Otherwise it is 0. You calculated attrition percentage (i.e. mean of the target variable) and found it's 1% of 10,000 customer base. It means there are 9900 active customers and 100 attritors in 10k cases of your target variable. Since the distribution of target variable is highly skewed, you need to oversample the event (attritors).By oversampling, it meansdecreasing the volume of non-events so that proportion of events and non-events gets balanced or less skewed.

You take a small proportion from non-event cases and a large proportion (or all the records) from the relatively few event cases

When should we perform Oversampling

In logistic regression, it is required to have a minimum of 10 events per independent variable. Many people get confused about this thumb rule of having minimum number of independent variables (predictors). It's a commonly asked question ' Does this rule apply before variable selection or after variable selection?' The answer is : Suppose you have 30 events. You are running a stepwise selection method for variable selection (i.e. adding a variable one by one and checking the significance level of the previous variables at every addition). This rule would be applied on all the candidate variables in the stepwise algorithm but you should limit the algorithm to consider only 3 independent variables. Some researchers do not follow this rule strictly and do oversampling even if this rule is met. It is advisable to build two models (with or without oversampling) and test the model on non-oversampling population and compare the accuracy of the models.

Terminologies related to oversampling

Prior Probability : The prior probability is a probability of events before you have oversampled data.

Posterior Probability : The posterior probability is a probability of events after you have oversampled data. It is a conditional distribution because it conditions on the observed data.

Posterior probability is normally calculated by updating the prior probability by using Bayes' theorem.

In the code above, we are performing stratified sampling. The option n = (number of 0s you want to keep, number of 1s you want to keep). Instead, you can use rate option - rate = (50,50). It means you want to retain 50% of 0s and 50% of 1s.

Method II : Without PROC SURVEYSELECT

data sub;
set full;
if y=1 or (y=0 and ranuni(75302)<1/9) then output;
run;

Note : 1/9 means 10% of events and 90% non-events in the original data (before sampling). After running the above code, distribution of events and non-events would be 50:50.

Effect of oversampling

Oversampling does not effect the slopes (parameter estimates), but it effects the intercepts (make it too high). In other words, parameter estimates remain same after sampling but intercepts increases very much after sampling.

Predicted probabilities are affected as it is calculated taking both paramter estimates and intercept (incorrect intercept as stated above). It increases after sampling as intercept is overestimated.

Oversampling does not affect sensitivity or specificity measures but false positive and negative rates are affected.

ROC curve is not affected by oversampling.

Oversampling does not affect rank ordering (sorting based on predicted probability) because adjusting oversampling is just a linear transformation. Hence, it does not affect Gain and Lift charts if you score on out of time sample or unsampled validation dataset. However, if you compare lift of unsampled and sampled data of training dataset, gain charts and lift charts are affected as proportion of events got changed. For example, predicted probability score is 80% in one observation. After oversampling, ratio is 50:50. The lift on the sampled data is 80%/50% = 1.6. After adjusting probability, the adjusted probability score is 30.8%. The lift on the original data is 3.08 (30.8% / 10%).

Correcting Confusion Matrix

Suppose, π0 is the proportion of non-events before sampling . π1 is the proportion of events before sampling. ρ1 is the proportion of events after sampling. ρ0 is the proportion of non-events after sampling.

Note : You do not need to adjust oversampling if your goal is to select the top 30% customers based on their high predicted probability. It is because it is just a linear transformation and it does not affect rank ordering. It should be performed only when you need to know the “correct” probability of customers.

I. Implementing Offset Method in SAS :

You can use the PRIOREVENT= option in the SCORE statement to specify the prior event probability.

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like banking, Telecom, HR and Health Insurance.

While I love having friends who agree, I only learn from those who don't.

Hi, I have come across similar problem where I have 1.4 % churn rate (event) for around 3 million obs. I have taken 50-50 (all events and some non events). So in this case is it correct to use priorevent=0.016 in the score statement ( because my event rate was 1.6% before over sampling )?. Another question, if I do oversampling on training data and NOT on validation data, wouldn't event rate be very low in the validation dataset for sas to do validation? Many thanks.

Yes, priorevent = 0.016 is correct. The idea of using validation dataset is to validate the model and fitting equation derived from the training dataset on validation dataset. You have built your model on training data and now you are checking whether model works well on data outside training. If you do oversampling on validation data as well, it would NOT be a right method of validation of your model. It is because the real desired outcome rate (event rate) is 1.6% which you are trying to predict for the future population. Hope it helps!

Another question, does event rate matter if you have enough volume of events in the model? I am working on Churn model for telecom (as you have given the example), churn (event) rate is 0.7% but I have around 10,000 event volume for around 1 million observations. I am am testing around 20 variables in the model and final model has around 10 variables. My understanding is that if you have enough Event volume like in this case 10K, based on number of independent variables, low event rate should not matter?

yes, your understanding is correct. Low event rate does not matter if you have enough events dependending on the number of variables. This rule applies only to Logistic Regression. It's not safe to generalize for all the algorithms.

Hi Deepanshu, If I have a case where I am using sample of 150k from the base and my churn rate is 1%, so 1500 cases of churners (events), do I really need to oversample if I am testing around 30 variables and final model has <20 variables. Also, as my probabilities are very low, my confusion matrix is super screwed at 0.4 cut off. How do I explain this ?