Scale, Standardize, or Normalize with Scikit-Learn

Scale, Standardize, or Normalize with Scikit-LearnWhen to use MinMaxScaler, RobustScaler, StandardScaler, and NormalizerJeff HaleBlockedUnblockFollowFollowingMar 4Many machine learning algorithms work better when features are on a relatively similar scale and close to normally distributed.

Which method you need, if any, depends on your model type and your feature values.

This guide will highlight the differences and similarities among these methods and help you learn when to reach for which tool.

ScalesAs often as these methods appear in machine learning workflows, I found it difficult to find information about which of them to use when.

Commentators often use the terms scale, standardize, and normalize interchangeably.

However, their are some differences and the four scikit-learn functions we will examine do different things.

First, a few housekeeping notes:The Jupyter Notebook on which this article is based can be found here.

In this article, we aren’t looking at log transformations or other transformations aimed at reducing the homoscedasticity of the errors.

This guide is current as of scikit-learn v0.

20.

3.

What do These Terms Mean?Scale generally means to change the range of the values.

The shape of the distribution doesn’t change.

Think about how a scale model of a building has the same proportions as the original, just smaller.

That’s why we say it is drawn to scale.

The range is often set at 0 to 1.

Standardize generally means changing the values so that the distribution standard deviation from the mean equals one.

It outputs something very close to a normal distribution.

Scaling is often implied.

Normalize can be used to mean either of the above things (and more!).

I suggest you avoid the normalize, because it has many definitions and is prone to create confusion.

If you use any of these terms in your communication, I strongly suggest you define them.

Why Scale, Standardize, or Normalize?Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.

Examples of such algorithm families include:linear and logistic regressionnearest neighborsneural networkssupport vector machines with radial bias kernel functionsprincipal components analysislinear discriminant analysisScaling and standardizing can help features arrive in more digestible form for these algorithms.

The four scikit-learn preprocessing methods we are examining follow the API shown below.

X_train and X_test are the usual numpy ndarrays or pandas DataFrames.

from sklearn import preprocessingmm_scaler = preprocessing.

MinMaxScaler()X_train_minmax = mm_scaler.

fit_transform(X_train)mm_scaler.

transform(X_test)We’ll look at a number of distributions and apply each of the four scikit-learn methods to them.

Original DataI created four distributions with different characteristics.

The distributions are:beta — with negative skewexponential — with positive skewnormal_p — normal, platykurticnormal_l — normal, leptokurticbimodal — bimodalThe values all are of relatively similar scale, as can be seen on the X axis of the Kernel Density Estimate plot (kdeplot) below.

Then I added a fifth distribution with much larger values (normally distributed) — normal_big.

Now our kdeplot looks like this:Squint hard at the monitor and you might notice the tiny green bar of big values to the right.

Here are the descriptive statistics for our features.

Alright, let’s start scaling!MinMaxScalerFor each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range.

The range is the difference between the original maximum and original minimum.

MinMaxScaler preserves the shape of the original distribution.

It doesn’t meaningfully change the information embedded in the original data.

Note that MinMaxScaler doesn’t reduce the importance of outliers.

The default range for the feature returned by MinMaxScaler is 0 to 1.

Here’s the kdeplot after MinMaxScaler has been applied.

Notice how the features are all on the same relative scale.

The relative spaces between each feature’s values have been maintained.

MinMaxScaler is a good place to start unless you know you want your feature to have a normal distribution or want outliers to have reduced influence.

Different types of scalesRobustScalerRobustScaler transforms the feature vector by subtracting the median and then dividing by the interquartile range (75% value — 25% value).

Like MinMaxScaler, our feature with large values — normal-big — is now of similar scale to the other features.

Note that RobustScaler does not scale the data into a predetermined interval like MinMaxScaler.

It does not meet the strict definition of scale I introduced earlier.

Note that the range for each feature after RobustScaler is applied is larger than it was for MinMaxScaler.

Use RobustScaler if you want to reduce the effects of outliers, relative to MinMaxScaler.

Now let’s turn to StandardScaler.

StandardScalerStandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance.

Unit variance means dividing all the values by the standard deviation.

StandardScaler does not meet the strict definition of scale I introduced earlier.

StandardScaler results in a distribution with a standard deviation equal to 1.

The variance is equal to 1 also, because variance = standard deviation squared.

And 1 squared = 1.

StandardScaler makes the mean of the distribution 0.

About 68% of the values will lie be between -1 and 1.

In the plot above, you can see that all four distributions have a mean close to zero and unit variance.

The values are on a similar scale, but the range is larger than after MinMaxScaler.

Deep learning algorithms often call for zero mean and unit variance.

Regression-type algorithms also benefit from normally distributed data with small sample sizes.

StandardScaler does distort the relative distances between the feature values, so it’s generally my second choice in this family of transformations.

Now let’s have a look at Normalizer.

NormalizerNormalizer works on the rows, not the columns!.I find that very unintuitive.

It’s easy to miss this information in the docs.

By default, L2 normalization is applied to each observation so the that the values in a row have a unit norm.

Unit norm with L2 means that if each element were squared and summed, the total would equal 1.