Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.

Intuition: Disease Screening Example

Let’s say your client is a leading research hospital, and they’ve asked you to train a model for detecting a disease based on biological inputs collected from patients.

But here’s the catch… the disease is relatively rare; it occurs in only 8% of patients who are screened.

Now, before you even start, do you see how the problem might break? Imagine if you didn’t bother training a model at all. Instead, what if you just wrote a single line of code that always predicts ‘No Disease?’

A crappy, but accurate, solution

Python

1

2

3

defdisease_screen(patient_data):

# Ignore patient_data

return'No Disease.'

Well, guess what? Your “solution” would have 92% accuracy!

Unfortunately, that accuracy is misleading.

For patients who do not have the disease, you’d have 100% accuracy.

For patients who do have the disease, you’d have 0% accuracy.

Your overall accuracy would be high simply because most patients do not have the disease (not because your model is any good).

This is clearly a problem because many machine learning algorithms are designed to maximize overall accuracy. The rest of this guide will illustrate different tactics for handling imbalanced classes.

Important notes before we begin:

First, please note that we’re not going to split out a separate test set, tune hyperparameters, or implement cross-validation. In other words, we’re not going to follow best practices (which are covered in our Data Science Primer).

Instead, this tutorial is focused purely on addressing imbalanced classes.

In addition, not every technique below will work for every problem. However, 9 times out of 10, at least one of these techniques should do the trick.

Balance Scale Dataset

For this guide, we’ll use a synthetic dataset called Balance Scale Data, which you can download from the UCI Machine Learning Repository here.

This dataset was originally generated to model psychological experiment results, but it’s useful for us because it’s a manageable size and has imbalanced classes.

Import libraries and read dataset

Python

1

2

3

4

5

6

7

8

9

importpandas aspd

importnumpy asnp

# Read dataset

df=pd.read_csv('balance-scale.data',

names=['balance','var1','var2','var3','var4'])

# Display example observations

df.head()

The dataset contains information about whether a scale is balanced or not, based on weights and distances of the two arms.

It has 1 target variable, which we've labeled
balance .

It has 4 input features, which we've labeled
var1 through
var4 .

The target variable has 3 classes.

R for right-heavy, i.e. when
var3*var4>var1*var2

L for left-heavy, i.e. when
var3*var4<var1*var2

B for balanced, i.e. when
var3*var4=var1*var2

Count of each class

Python

1

2

3

4

5

df['balance'].value_counts()

# R 288

# L 288

# B 49

# Name: balance, dtype: int64

However, for this tutorial, we're going to turn this into a binary classification problem.

We're going to label each observation as 1 (positive class) if the scale is balanced or 0 (negative class) if the scale is not balanced:

Transform into binary classification

Python

1

2

3

4

5

6

7

8

# Transform into binary classification

df['balance']=[1ifb=='B'else0forbindf.balance]

df['balance'].value_counts()

# 0 576

# 1 49

# Name: balance, dtype: int64

# About 8% were balanced

As you can see, only about 8% of the observations were balanced. Therefore, if we were to always predict 0, we'd achieve an accuracy of 92%.

The Danger of Imbalanced Classes

Now that we have a dataset, we can really show the dangers of imbalanced classes.

First, let's import the Logistic Regression algorithm and the accuracy metric from Scikit-Learn.

Import algorithm and accuracy metric

Python

1

2

fromsklearn.linear_model importLogisticRegression

fromsklearn.metrics importaccuracy_score

Next, we'll fit a very simple model using default settings for everything.

Train model on imbalanced data

Python

1

2

3

4

5

6

7

8

9

# Separate input features (X) and target variable (y)

y=df.balance

X=df.drop('balance',axis=1)

# Train model

clf_0=LogisticRegression().fit(X,y)

# Predict on training set

pred_y_0=clf_0.predict(X)

As mentioned above, many machine learning algorithms are designed to maximize overall accuracy by default.

We can confirm this:

Python

1

2

3

# How's the accuracy?

print(accuracy_score(pred_y_0,y))

# 0.9216

So our model has 92% overall accuracy, but is it because it's predicting only 1 class?

Python

1

2

3

# Should we be excited?

print(np.unique(pred_y_0))

# [0]

As you can see, this model is only predicting 0, which means it's completely ignoring the minority class in favor of the majority class.

Next, we'll look at the first technique for handling imbalanced classes: up-sampling the minority class.

1. Up-sample Minority Class

Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.

There are several heuristics for doing so, but the most common way is to simply resample with replacement.

First, we'll import the resampling module from Scikit-Learn:

Module for resampling

Python

1

fromsklearn.utils importresample

Next, we'll create a new DataFrame with an up-sampled minority class. Here are the steps:

First, we'll separate observations from each class into different DataFrames.

Next, we'll resample the minority class with replacement, setting the number of samples to match that of the majority class.

This time, the new DataFrame has fewer observations than the original, and the ratio of the two classes is now 1:1.

Again, let's train a model using Logistic Regression:

Train model on downsampled dataset

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

# Separate input features (X) and target variable (y)

y=df_downsampled.balance

X=df_downsampled.drop('balance',axis=1)

# Train model

clf_2=LogisticRegression().fit(X,y)

# Predict on training set

pred_y_2=clf_2.predict(X)

# Is our model still predicting just one class?

print(np.unique(pred_y_2))

# [0 1]

# How's our accuracy?

print(accuracy_score(y,pred_y_2))

# 0.581632653061

The model isn't predicting just one class, and the accuracy seems higher.

We'd still want to validate the model on an unseen test dataset, but the results are more encouraging.

3. Change Your Performance Metric

So far, we've looked at two ways of addressing imbalanced classes by resampling the dataset. Next, we'll look at using other performance metrics for evaluating the models.

Albert Einstein once said, "if you judge a fish on its ability to climb a tree, it will live its whole life believing that it is stupid." This quote really highlights the importance of choosing the right evaluation metric.

For a general-purpose metric for classification, we recommend Area Under ROC Curve (AUROC).

We won't dive into its details in this guide, but you can read more about it here.

Intuitively, AUROC represents the likelihood of your model distinguishing observations from two classes.

In other words, if you randomly select one observation from each class, what's the probability that your model will be able to "rank" them correctly?

We can import this metric from Scikit-Learn:

Area Under ROC Curve

Python

1

fromsklearn.metrics importroc_auc_score

To calculate AUROC, you'll need predicted class probabilities instead of just the predicted classes. You can get them using the
.predict_proba()function like so:

Get class probabilities

Python

1

2

3

4

5

6

7

8

9

10

11

12

# Predict class probabilities

prob_y_2=clf_2.predict_proba(X)

# Keep only the positive class

prob_y_2=[p[1]forpinprob_y_2]

prob_y_2[:5]# Example

# [0.45419197226479618,

# 0.48205962213283882,

# 0.46862327066392456,

# 0.47868378832689096,

# 0.58143856820159667]

So how did this model (trained on the down-sampled dataset) do in terms of AUROC?

AUROC of model trained on downsampled dataset

Python

1

2

print(roc_auc_score(y,prob_y_2))

# 0.568096626406

Ok... and how does this compare to the original model trained on the imbalanced dataset?

AUROC of model trained on imbalanced dataset

Python

1

2

3

4

5

prob_y_0=clf_0.predict_proba(X)

prob_y_0=[p[1]forpinprob_y_0]

print(roc_auc_score(y,prob_y_0))

# 0.530718537415

Remember, our original model trained on the imbalanced dataset had an accuracy of 92%, which is much higher than the 58% accuracy of the model trained on the down-sampled dataset.

However, the latter model has an AUROC of 57%, which is higher than the 53% of the original model (but not by much).

Note: if you got an AUROC of 0.47, it just means you need to invert the predictions because Scikit-Learn is misinterpreting the positive class. AUROC should be >= 0.5.

4. Penalize Algorithms (Cost-Sensitive Training)

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class.

A popular algorithm for this technique is Penalized-SVM:

Support Vector Machine

Python

1

fromsklearn.svm importSVC

During training, we can use the argument
class_weight='balanced' to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

We also want to include the argument
probability=True if we want to enable probability estimates for SVM algorithms.

Let's train a model using Penalized-SVM on the original imbalanced dataset:

Train Penalized SVM on imbalanced dataset

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

# Separate input features (X) and target variable (y)

y=df.balance

X=df.drop('balance',axis=1)

# Train model

clf_3=SVC(kernel='linear',

class_weight='balanced',# penalize

probability=True)

clf_3.fit(X,y)

# Predict on training set

pred_y_3=clf_3.predict(X)

# Is our model still predicting just one class?

print(np.unique(pred_y_3))

# [0 1]

# How's our accuracy?

print(accuracy_score(y,pred_y_3))

# 0.688

# What about AUROC?

prob_y_3=clf_3.predict_proba(X)

prob_y_3=[p[1]forpinprob_y_3]

print(roc_auc_score(y,prob_y_3))

# 0.5305236678

Again, our purpose here is only to illustrate this technique. To really determine which of these tactics works best for this problem, you'd want to evaluate the models on a hold-out test set.

5. Use Tree-Based Algorithms

The final tactic we'll consider is using tree-based algorithms. Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

Now, let's train a model using a Random Forest on the original imbalanced dataset.

Train Random Forest on imbalanced dataset

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

# Separate input features (X) and target variable (y)

y=df.balance

X=df.drop('balance',axis=1)

# Train model

clf_4=RandomForestClassifier()

clf_4.fit(X,y)

# Predict on training set

pred_y_4=clf_4.predict(X)

# Is our model still predicting just one class?

print(np.unique(pred_y_4))

# [0 1]

# How's our accuracy?

print(accuracy_score(y,pred_y_4))

# 0.9744

# What about AUROC?

prob_y_4=clf_4.predict_proba(X)

prob_y_4=[p[1]forpinprob_y_4]

print(roc_auc_score(y,prob_y_4))

# 0.999078798186

Wow! 97% accuracy and nearly 100% AUROC? Is this magic? A sleight of hand? Cheating? Too good to be true?

Well, tree ensembles have become very popular because they perform extremely well on many real-world problems. We certainly recommend them wholeheartedly.

However:

While these results are encouraging, the model could be overfit, so you should still evaluate your model on an unseen test set before making the final decision.

Note: your numbers may differ slightly due to the randomness in the algorithm. You can set a random seed for reproducible results.

Honorable Mentions

There were a few tactics that didn't make it into this tutorial:

Create Synthetic Samples (Data Augmentation)

Creating synthetic samples is a close cousin of up-sampling, and some people might categorize them together. For example, the SMOTE algorithm is a method of resampling from the minority class while slightly perturbing feature values, thereby creating "new" samples.

*Update: One of our readers, Marco, brought up a great point about the risks of using SMOTE without proper cross-validation. Check out the comments section for more details or read his blog post on the topic.

Combine Minority Classes

Combining minority classes of your target variable may be appropriate for some multi-class problems.

For example, let's say you wished to predict credit card fraud. In your dataset, each method of fraud may be labeled separately, but you might not care about distinguishing them. You could combine them all into a single 'Fraud' class and treat the problem as binary classification.

Reframe as Anomaly Detection

Anomaly detection, a.k.a. outlier detection, is for detecting outliers and rare events. Instead of building a classification model, you'd have a "profile" of a normal observation. If a new observation strays too far from that "normal profile," it would be flagged as an anomaly.

Conclusion & Next Steps

In this guide, we covered 5 tactics for handling imbalanced classes in machine learning:

Up-sample the minority class

Down-sample the majority class

Change your performance metric

Penalize algorithms (cost-sensitive training)

Use tree-based algorithms

These tactics are subject to the No Free Lunch theorem, and you should try several of them and use the results from the test set to decide on the best solution for your problem.