Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning

Supervised machine learning algorithms can best be understood through the lens of the bias-variance trade-off.

In this post, you will discover the Bias-Variance Trade-Off and how to use it to better understand machine learning algorithms and get better performance on your data.

Let’s get started.

Gentle Introduction to the Bias-Variance Trade-Off in Machine LearningPhoto by Matt Biddulph, some rights reserved.

Overview of Bias and Variance

In supervised machine learning an algorithm learns a model from training data.

The goal of any supervised machine learning algorithm is to best estimate the mapping function (f) for the output variable (Y) given the input data (X). The mapping function is often called the target function because it is the function that a given supervised machine learning algorithm aims to approximate.

The prediction error for any machine learning algorithm can be broken down into three parts:

Bias Error

Variance Error

Irreducible Error

The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.

In this post, we will focus on the two parts we can influence with our machine learning algorithms. The bias error and the variance error.

Get your FREE Algorithms Mind Map

Also get exclusive access to the machine learning algorithms email mini-course.

Bias Error

Bias are the simplifying assumptions made by a model to make the target function easier to learn.

Generally, parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn, they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.

Low Bias: Suggests less assumptions about the form of the target function.

High-Bias: Suggests more assumptions about the form of the target function.

Variance Error

Variance is the amount that the estimate of the target function will change if different training data was used.

The target function is estimated from the training data by a machine learning algorithm, so we should expect the algorithm to have some variance. Ideally, it should not change too much from one training dataset to the next, meaning that the algorithm is good at picking out the hidden underlying mapping between the inputs and the output variables.

Machine learning algorithms that have a high variance are strongly influenced by the specifics of the training data. This means that the specifics of the training have influences the number and types of parameters used to characterize the mapping function.

Low Variance: Suggests small changes to the estimate of the target function with changes to the training dataset.

High Variance: Suggests large changes to the estimate of the target function with changes to the training dataset.

Generally, nonparametric machine learning algorithms that have a lot of flexibility have a high variance. For example, decision trees have a high variance, that is even higher if the trees are not pruned before use.

Bias-Variance Trade-Off

The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance.

You can see a general trend in the examples above:

Parametric or linear machine learning algorithms often have a high bias but a low variance.

Non-parametric or non-linear machine learning algorithms often have a low bias but a high variance.

The parameterization of machine learning algorithms is often a battle to balance out bias and variance.

Below are two examples of configuring the bias-variance trade-off for specific algorithms:

The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute t the prediction and in turn increases the bias of the model.

The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

There is no escaping the relationship between bias and variance in machine learning.

Increasing the bias will decrease the variance.

Increasing the variance will decrease the bias.

There is a trade-off at play between these two concerns and the algorithms you choose and the way you choose to configure them are finding different balances in this trade-off for your problem

In reality, we cannot calculate the real bias and variance error terms because we do not know the actual underlying target function. Nevertheless, as a framework, bias and variance provide the tools to understand the behavior of machine learning algorithms in the pursuit of predictive performance.

Further Reading

This section lists some recommend resources if you are looking to learn more about bias, variance and the bias-variance trade-off.

Nice work!!! I have one query if we decrease the variance then we observe the bias increases and vice versa, but is the rate of fall and rise of these parameters is same or constant or it is dependent of specific algorithms used. Can we tune a model based on bias-variance trade off?

Hello, I am still a little confused about this. Please help me out by reading the following situation.

Suppose there are 5 parties standing in election and we want to make prediction beforehand about who will win. We choose 5 people from a community and ask them separately, the name of the party for which they are going to vote. Suppose all 5 of them chooses all 5 different parties. If we treat each person as a machine learning model, their answer as training data and make predictions accordingly then for all 5 models, we will end up predicting 5 different outputs. This makes it a high variance case because results are varying and high bias because we chose only a particular type of people (all are from a particular community). Next if we choose 1000 people for poll and there are suppose 200 people for each party. Now if we treat all 1000 people as model, we will have 200 predictions for each party. We can call it a low bias because we chose a larger group now and hence they are more representative of their population but this is still high variance, right? Because there are still equally varied results. Lastly, if 700 of those people chose one party (and that party actually wins) and rest 300 are distributed in other parties, is this what we call low variance? What will we call it if that party loses?

A question on this statement. You are saying if there are only two choices, then bias and variance don’t mean much. Do you mean two choices as in the outputs? The reason I’m asking is for various sentiment analysis ideas, wherein you have two choices (outputs): 0 or 1. Basically “does not predict” or “does predict.”

But it would seem that in such binary situations, bias could creep in. Such as in data that uses movie reviews, wherein a bias may creep in if the word “bad” was previously only used in clearly negative reviews. But then a new review comes in that says “Wanted to see this movie really bad. Not disappointed.” Here the use of “bad” is not in a negative context, but might be subject to bias.

Yes, I suppose.The reason I am saying it a high variance is because high variance is the spread of predictions by different models from target output. And in this case out of 1000 only 200 will be on target and rest 800 varied. But probably I am getting it all wrong, I will think about it some more. Thanks for the quick reply.

Bias is when we assume certain things about the training data (it’s shape for example) and we choose a model accordingly. But, then we get predictions far away from the exptected values and we realise that we did a mistake in certain assumptions of our training data?

This post is clear and easy to understand. I have one question regarding your statement about how SVM manages variance issue. As you said in this article, through increasing penalty parameter C, SVM could decrease its variance.

From my perspective, C is the penalty parameter, and it is different from the regularization lambda, through decreasing C, we could narrow the margin, and the learner could go a little underfitting, which would decrease the variance.

So from the description can i say, that Linear algorithms (Linear/Logistic Regression & LDA) will only under-fit and never face over-fitting problem, because you said that these algorithms have high bias and low variance and vice-versa for non-linear problems (decision tree, KNN, SVM)

I am struggling to calculate the bias/discrimination of the ‘Adult Dataset’, downloaded from the UCI machine learning repository. Do you know how to calculate the discrimination using a matrix? Thanks in advance.