Data science, statistics or machine learning in broken English

Notice

Currently {mvpart} CRAN package was removed from CRAN due to expiration of its support. For installation, 1) please download the latest (but expired) package archive from the old archive site and 2) install it following the procedure shown below.

For Windows, first you have to edit PATH to include R folder, and then please type as below on Command Prompt or PowerSchell.

R CMD INSTALL XYZ.tar.gz

In the previous post, we saw how to evaluate a machine learning classifier using typical XOR patterns and drawing its decision boundary on the same XY plane. In this post, let's see how Decision Tree, one of the lightest machine learning classifier, works.

A brief trial on a short version of MNIST datasets

I think it's good to see how each classifier works on the typical MNIST datasets. Prior to writing this post, I prepared a short version of MNIST datasets and uploaded onto my GitHub repository below.

OMG, but don't worry; Decision Tree works better with various ensemble methods rather than itself alone.

Algorithm summary

In general, the word "Decision Tree" may be referred to CART, a representative algorithm of Decision Tree. It can be summarized as below*1:

While searching independent variables of the samples based on a certain criterion (e.g. Gini index),

the algorithm determines how to split them in order both to make a "purity" of the samples with dependent and independent variables at each step and to exclude "deviant" sample sets from the rest and,

repeats the same procedure,

it finished if a certain terminal condition is satisfied,

until all splitting procedure are terminated.

At any rate, such an algorithm with multiple splitting results in a "tree" shaped classification structure, that's why this algorithm is called Decision "Tree".

Of course almost the same procedure can be applied to regression problems and in such a case it's called "Regression Tree". Unlike the other regression methods, regression tree produces step-like regression functions.

How it works on XOR patterns

Next, let's see what kind of decision boundary Decision Tree draws. Please download "xor_simple.txt" and "xor_complex.txt" from the repository below.

Oh my god, what's going on??? This decision boundary looks much crazy... it almost never follows the assumed true boundary.

Pruning

Please remember I defined this kind of decision boundary as "overfitted". If so, we have to make it "generalized"... how can we do so for the classifier?

The primary solution for decision tree is "pruning": a method with evaluating complexity of the tree based on its cross-validation error rate and determining a hyper parameter for the best model. In {mvpart} package, we can run it with plotcp() function and "cp" argument of rpart() function.

I feel it's still a little overfitted, but at any rate it got a little more generalized, i.e. some meaningless decision boundaries were removed from the previous model.

How it works on linearly separable patterns

As you see as above, decision tree can classify linearly non-separable patterns. But you may wonder how it works on usual linearly separable patterns. OK, let's see it. Please download "linear_bi.txt" and "linear_multi.txt" from the GitHub repository mentioned above and import them.