Wednesday, June 6, 2012

Predictive Analytics: Decision Tree and Ensembles

Continue from my last post of walking down the list of machine learning technique. In this post, I will covered Decision Tree and Ensemble methods. We'll continue using the iris data we prepare in this earlier post.

Decision Tree

Decision Tree model is one of the oldest machine learning model and is usually used to illustrate the very basic idea of machine learning. Based on a tree of decision nodes, the learning approach is to recursively divide the training data into buckets of homogeneous members through the most discriminative dividing criteria. The measurement of "homogeneity" is based on the output label; when it is a numeric value, the measurement will be the variance of the bucket; when it is a category, the measurement will be the entropy or gini index of the bucket..

During the training, various dividing criteria based on the input will be tried (using in a greedy manner); when the input is a category (Mon, Tue, Wed ...), it will first be turned into binary (isMon, isTue, isWed ...) and then use the true/false as a decision boundary to evaluate the homogeneity; when the input is a numeric or ordinal value, the lessThan, greaterThan at each training data input value will be used as the decision boundary.

The training process stops when there is no significant gain in homogeneity by further split the Tree. The members of the bucket represented at leaf node will vote for the prediction; majority wins when the output is a category and member’s average is taken when the output is a numeric.

The good part of Tree is that it can take different data type of input and output variables which can be categorical, binary and numeric value. It can handle missing attributes and outliers well. The decision tree is also good in explaining reasoning for its prediction and therefore gives good insight about the underlying data.

The limitation of decision tree is that each decision boundary at each split point is a concrete binary decision. Also the decision criteria only consider one input attributes at a time but not a combination of multiple input variables. Another weakness of Tree is that once learned it cannot be updated incrementally. When new training data arrives, you have to throw away the old tree and retrain every data from scratch. In practice, standalone decision tree and rarely used in practice as its predictive accuracy and relatively low. Tree ensembles (described below) are the common way to use decision trees.

Tree Ensembles

Instead of picking a single model, Ensemble Method combines multiple models in a certain way to fit the training data. There are two primary ways: “bagging” and “boosting”. In “bagging”, we take a subset of training data (pick n random sample out of N training data with replacement) to train up each model. After multiple models are trained, we use a voting scheme to predict future data.

“Boosting” is another approach in Ensemble Method. Instead of sampling
the input features, it samples the training data records, but it puts
more emphasis on the training data that is wrongly predicted in previous
iterations. Initially each training data is equally weighted. At each
iteration, the data that is wrongly classified will have its weight
increased. The final prediction will be voted by each tree learned at
each iteration weighted by the accuracy of that tree.

Random Forest

Random Forest is one of the most popular bagging models; in additional to selecting n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result.