Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

I've heard that decision trees can have a high amount of variance, and that for a data set $D$ split into test/train the decision tree could be quite different depending on how the data was split. Apparently, this provides motivation for algorithms such as Random Forest.

Is this correct? Why does a decision tree suffer from high variability?

edit

just to note that I don't really follow the current answer
and haven't been able to solve that in the comments.

2 Answers
2

The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.

Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.

The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.

$\begingroup$"the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.$\endgroup$
– baxxMar 28 '19 at 18:29

$\begingroup$@baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.$\endgroup$
– MediaMar 28 '19 at 19:34

$\begingroup$if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)$\endgroup$
– baxxMar 28 '19 at 20:08

$\begingroup$Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have$\endgroup$
– MediaMar 28 '19 at 20:13

$\begingroup$intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.$\endgroup$
– MediaMar 28 '19 at 20:14

It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.

A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.

Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:

IF

player X is on the field AND

team A has a home game AND

the weather is sunny AND

the number of attending fans >= 26000 AND

it is past 3pm

THEN team A wins.

If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.

Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).

Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.

That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.

(See e.g. here for more on how random forests can help with this further.)