#================================================================
TREE-BASED REGRESSION
We make an excursion into the very opposite of smoothing:
CART -- Classification and Regression Trees
It might as well be called "roughing".
In what follows we consider the regression version only.
This is our first example of "multiple regression",
apart from the bivariate smoother in the end of the previous section.
Tree-based regression is simpler than multiple linear regression,
but more unusual for those who grew up with the latter.
The idea:
When data are heterogeneous, clustered, messy,
it might not be the best idea to fit smooth functions;
in particular, it might not be a good idea to fit linear functions,
as is done in linear regression.
Instead, one might consider breaking the data up into
subsets that are homogeneous.
Once we have homogeneous subsets, we might do the simplest possible thing:
fit a constant, that is, a mean.
Therefore, tree-based regression does local averaging, like smoothers,
but this time on non-overlapping subsets.
Q: How do we arrive at homogeneous subsets?
A: By repeatedly breaking the data up along axis splits.
# But how would we choose such axis splits?
# Example: In the following plot, where would we reasonably place a split?
dim(boston)
pairs.plus(boston, pch=16, cex=.1, gap=0)
plot(boston[,"LSTAT"], boston[,"MEDV"],
pch=16, xlab="LSTAT", ylab="MEDV", cex=.5)
# Here is a handchosen split at LSTAT=10:
spl = spl
plot(boston[,"LSTAT"], boston[,"MEDV"], xlab="LSTAT", ylab="MEDV",
col=c("red","green")[sel.left+1], pch=16, cex=.5)
# We said we will fit only the simplest possible model, a
# mean in each subset:
lines(c(spl,spl), c(0,50))
lines(c(0,spl), rep(mean(boston[sel.left,"MEDV"]), 2), lwd=2)
lines(c(spl,40), rep(mean(boston[sel.right,"MEDV"]), 2), lwd=2)
#
# How do we formalize a rule for selecting the axis splits?
# That's the big question.
# Here is a recipe that is used most often:
#
# For a given split on a variable "xj" (="LSTAT" in the plot)
# and threshold "th" (=10 in the plot),
# evaluate a criterion, such as the residual sum of squares,
# after fitting a separate mean left and right of the threshold:
sum((boston[sel.left,"MEDV"] - mean(boston[sel.left,"MEDV"]))^2) +
sum((boston[sel.right,"MEDV"] - mean(boston[sel.right,"MEDV"]))^2)
# which should be the same as
var(boston[sel.left,"MEDV"]) * (sum(sel.left)-1) +
var(boston[sel.right,"MEDV"]) * (sum(sel.right)-1)
# Indeed!
#
# Then:
# For all possible splits (505 for "boston")
# of all x-variables ( 13 for "boston"),
# evaluate the criterion,
# and pick the split that has the lowest value of the criterion:
lstat.vals 6.94
# [small/normal homes] [large homes]
# The 85% of smaller homes is split on LSTAT, % of lower status people,
# 51% with LSTAT < 14.55, 15% with LSTAT > 14.55
# [^^few poor people^^^] [^^more poor people^^]
# The 15% of larger homes is split on RM again,
# 9.1% with RM < 7.42, 5.9% with RM > 7.42
# [medium large homes] [largest homes]
# This makes some sense:
# among large homes, size drives the price;
# among small/normal homes, social environment drives the price.
# We had found something like this when studying scatterplots of MEDV, RM, LSTAT.
# The variable of interest was really NOX, but it is only of
# low importance in the tree.
# Let's move on to one of our new methods:
# It tries to repeatedly peel off subsets that have
# HIGH response values:
boston.XM higher fertility
# more highly educated ==> lower fertility
# - The tree is interesting because it seems to suggest that
# Agriculture does in fact have an unexpected correlation with
# fertility after adjustment for Education and Catholic!
# ==> a case of Simpson's paradox:
# Marginally, the higher agriculture, the higher fertility,
# but conditionally on things like education and religion,
# it's reverse: the higher agriculture, the lower fertility.
# Now we use another option in our routine to look for high Fertility buckets:
swiss.XM 23.2]
# - Again, at the top a high Education bucket is broken off,
# followed by a high Examination bucket,
# followed by two buckets with low Catholic.
# - There are two splits on Agriculture, once again both in the unexpected
# direction: high Agriculture - low Fertility.
# We may have to conclude that other things being equal (education, religion),
# agriculture in 1888 was such a taxing environment that it
# depressed fertility
#
# Looking ahead, let's use a linear model fit to see whether the
# overall qualitative behaviors are confirmed, in particular that
# of Agriculture:
summary(lm(Fertility ~ ., data=swiss))
# (We'll say more about the linear model function "lm" later.)
#------------------------ output begin
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
Agriculture -0.17211 0.07030 -2.448 0.01873 *
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05 ***
Catholic 0.10412 0.03526 2.953 0.00519 **
Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
...
#------------------------ output end
# Yes, Agriculture has a significant negative slope!
# Note another feature of the linear model, though:
# Examination is insignificant!
# After looking at the trees and the plots we know better than that.
# Here is why Education towers over Examination: Education
# isolates the Geneva area
plot(Fertility ~ Education, data=swiss)
swiss.geneva 40 & Catholic<60), data=swiss))
# Let us remind ourselves of a couple of plots:
windows(width=7, height=14)
par(mfrow=c(2,1), mgp=c(1.5,.5,0), mar=c(2.5,2.5,1,1))
plot(swiss[,c("Examination","Fertility")], pch=16)
plot(swiss[,c("Education","Fertility")], pch=16)
plot(swiss[,c("Education","Fertility")], pch=16)
# Education comes in so strong because of the 7 high Education provinces
# By comparison, Examination has a weaker but more robust correlation
# with Fertility: no 7 provinces would account for the correlation.
#----------------
# FURTHER PERSPECTIVES ON TREES:
#
# - The name 'CART' (Breiman et al. 1982) stands for
# 'Classification And Regression Trees'.
# We only considered regression trees where the response is quantitative.
# Classification trees address the case of categorical responses.
# Actually, there are two types of trees addressing this case:
# (1) CLASSIFICATION TREES: They 'estimate' actual class labels.
# (2) CLASS PROBABILITY TREES: They estimate class probabilities.
# Note: 'Class' = one of the category labels of the categorical response.
# You can always create a classification tree from a class probability tree
# by picking the class with the highest class probability as the class estimate.
# The reverse is not possible: a class label is insufficient information
# to derive a set of class probabilities.
# The is one more difference between classification and class probability trees:
# in the way in which their performance is measured.
# (1) Classification trees are judged by misclassification rates
# (possibly cost-weighted).
# (2) Class probability trees are judged by measures of class purity
# in the leaves. 'Class purity' can be something like the
# 'Gini index' which is p*(1-p) in the 2-class case, or
# 'entropy', which is -p*log(p)-(1-p)*log(1-p).
# Both look very similar as a function of p in [0,1].
#
#
# - Note that non-cascading CART trees really discover interactions
# only. Every time there is a switch from one splitting variable
# to the next descending down the tree, one forms implicitly a
# product of a dummy with two other dummies:
# products of predictors = interactions
# Cascading trees that re-use the same predictor down a cascade
# avoid this and instead create a dose-response effect in that
# predictor.
#
# - We focused on the interpretability aspect of trees:
# By using unconventional tree-splitting criteria,
# we can create surprisingly interpretable unbalanced
# cascading trees that capture 'dose-response' effects.
#
# - If the game is prediction, then raw trees may not be the best
# approach. Here is Breiman's proposal:
# 'BOOTSTRAP AGGREGATION' or 'BAGGING'
# Simply put:
# . Generate Nboot=500 (e.g.) trees from bootstrap datasets.
# . Average their predictions.
# Q: Why would this be successful?
# A: Because it creates smoother prediction functions.
# See an AoS article by Bühlmann & Yu, 2002, "Analyzing Bagging".
#
# - Another development by Yali Amit and Breiman is
# RANDOM FORESTS
# The idea is to tweak bagging by making the bootstrap trees
# more independent of each other: On every bootstrap dataset,
# allow only a random subset of the predictors to be used.
# This de-correlates the trees somewhat with apparently
# beneficial effects.
# Breiman recommends growing the trees all the way to the
# bottom, so each leaf contains just one observation.
# The smoothing is achieved by the averaging across trees.
# Breiman also uses the fact that each bootstrap sample
# leaves out a fraction 1/e of the data each time for
# a form of cross-validation of the forest performance.
# The left-out cases are called 'out-of-bag' or OOB.
#
# - Yet another development by Schapire and Freund is
# BOOSTING
# This was initially meant for classification, not regression.
# Here the idea is to grow many shallow trees and average them,
# but unlike bagging, the cases are reweighted at each step.
# Schapire and Freund's idea was to upweight the misclassified
# cases so the next tree can get them right.
# Because this is about classification, each tree produces
# a class label in the leaves. Their final classifier
# is a 'voting' scheme: classify a case according to the
# majority vote of trees.
#
# - Random forests work with many overfitting trees, averaging
# them to reduce variance.
# Boosting works with many underfitting biased trees, creating
# many predictors that jointly reduce bias.
#
# - In 2000 Friedman, Hastie and Tibshirani published an article
# in AoS, 28 (2), 337-407, with title
# "Additive logistic regression: a statistical view of boosting".
# They showed that boosting can be interpreted as a form of
# 'stagewise' logistic regression, using not the logistic loss
# but a so-called 'exponential loss'.
# 'Stagewise' is in contra-distinction to 'stepwise':
# In stagewise regression we add a new predictor
# WITHOUT UPDATING THE PREVIOUS PREDICTORS!
# That is, we form a residual and fit to the residual.
#
# - Note on classification: .. deep learning
#================================================================