For experimental purpose, I want to intentionally over-fit my training data with CART. But with rpart in R. I cannot achieve 100% accuracy. Why?

table(d$classes,predict(fit,d, type="class"))
1 2
1 2544 21
2 33 2402

The data is generated from 2 Gaussian, so, there is no chance the two data points with different class label would overlap, and we set the complexity parameter to 0 and min split is 1. As discussed in the comment. I tried every combinations with the control (not shown in the code), but not helpful.

Why there are still pruning happend on the tree? Or why the tree stop to grow to achivive 100% accuracy?

$\begingroup$Happy times. I appreciate a person willing to abuse an algorithm knowingly. Not enough depth though for this distribution and sample size; the maxdepth in rpart cannot go above 30. If you used a sample of say 1600, your confusion matrix on the training-set would be diagonal. (+1 for unruly statistical behaviour)$\endgroup$
– usεr11852Nov 29 '16 at 22:24

$\begingroup$gbm is much easier to over-fit.$\endgroup$
– EngrStudentNov 29 '16 at 22:32

$\begingroup$@usεr11852 $2^{30}$ is a huge number. and I think the tree we had is far less than 30 depth (As you can see in CP plot, the tree size is <1000). so, the depth may not be a problem?$\endgroup$
– Haitao DuNov 29 '16 at 22:52

$\begingroup$Well.. you do use only 1509 leafs (sum((fit$frame$var) == '<leaf>')). So clearly you are not using that number (or that depth).$\endgroup$
– usεr11852Nov 29 '16 at 23:32

1 Answer
1

I figured out the reason: it is the maxdepth problem as suggested by @usεr11852.

We thought max depth is $30$ is a big enough, since $2^{30}$ is a huge number. However, in many cases, depth $30$ is not enough since the tree is not a complete binary tree, which has $2^n$ terminal nodes, if we have $n$ layer.

Here is the verification:

There is a hidden function in rpart can produce the depth of the tree. As suggested in this post.

nodes=as.numeric(rownames(fit$frame))
max(rpart:::tree.depth(nodes))

Using this function we can get the tree size is $30$ !! And if we plot it, it also verifies the results and from the figure we can see, the tree is far away from complete binary tree.

What we learned from this experiment:

RPART documentation on max depth says:

Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Values greater than 30 rpart will give nonsense results on 32-bit machines.

This may not be accurate, since the tree can be far away from complete binary tree, so, values than 30 will make since in many cases !! and it should allow user to set a bigger number

$\begingroup$+1 because it is quite useful but I am a bit uncertain that you answer the question fully... You answer the question of the where you get the pruning not how or why the pruning happens. For example simply rewriting rpart.control to accept maxdepth arguments above 30, say 130, does not suffice to use a larger tree proprely. It really seems that something is hard-coded within the C code of rpart as using maxdepth larger than 30+ produces fit$frame$yval that has zeros (effectively predicting a third class - nice undefined behaviour there).$\endgroup$
– usεr11852Dec 2 '16 at 22:01