What’s decision tree?

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules.

Overview

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules.

A decision tree consists of 3 types of nodes:

Decision nodes - commonly represented by squares

Chance nodes - represented by circles

End nodes - represented by triangles

An example of decision tree goes below:

We get a decision tree using training data: Abe, Barb, Colette and Don. And we can get the Home value of Sally using this decision tree.

Metrics

Information Gain

What’s entropy ?

The entropy (very common in Information Theory) characterizes the impurity of an arbitrary collection of examples.

We can calculate the entropy as follows:

\(entropy(p)=\sum\limits_{i=1}^n{P_i}\log{P_i}\)

For example, for the set \(R = \{a, a, a, b, b, b, b, b\}\)

\(entropy(R)=-\frac{3}{8}\log\frac{3}{8}-\frac{5}{8}\log\frac{5}{8}\)

What’s information entropy?

In general terms, the expected information gain is the change in information entropy H from a prior state to a state that takes some information as given:

\(IG(D,A)=\Delta{Entropy}=Entropy(D)-Entropy(D|A)\)

Take the image below as an example:

The entropy before spliting: \(Entropy(D)=-\frac{14}{30}\log\frac{14}{30}-\frac{16}{30}\log\frac{16}{30}\approx{0.996}\)

The entropy after spliting: \(Entropy(D|A)=-\frac{17}{30}(\frac{13}{17}\log\frac{13}{17}+\frac{4}{17}\log\frac{4}{17})-\frac{13}{30}(\frac{1}{13}\log\frac{1}{13}+\frac{12}{13}\log\frac{12}{13})\approx{0.615}\)

So the information gain should be: \(IG(D,A)=Entropy(D)-Entropy(D|A)\approx0.381\)

Information Gain Ratio

Problem of information gain approach

Biased towards tests with many outcomes (attributes having a large number of values)

E.g: attribute acting as unique identifier

Produce a large number of partitions (1 tuple per partition)

Each resulting partition D is pure entropy(D)=0, the information gain is maximized

What’s information gain ratio?

Information gain ratio overcomes the bias of information gain, and it applies a kind of normalization to information gain using a split information value.

The split information value represents the potential information generated by splitting the training data set D into v partitions, corresponding to v outcomes on attribute A:

Decision Tree Algorithms

ID3 Algorithm

ID3 (Iterative Dichotomiser 3) is an algorithm used to generate a decision tree from a dataset.

The ID3 algorithm begins with the original set S as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S and calculates the entropy H(S) (or information gain IG(A)) of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S is then split by the selected attribute (e.g. age < 50, 50 <= age < 100, age >= 100) to produce subsets of the data.

The algorithm continues to recur on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases:

every element in the subset belongs to the same class (+ or -), then the node is turned into a leaf and labelled with the class of the examples;

there are no more attributes to be selected, but the examples still do not belong to the same class (some are + and some are -), then the node is turned into a leaf and labelled with the most common class of the examples in the subset;

Let's take the following data as an example:

Day

Outlook

Temperature

Humidity

Wind

Play ball

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Strong

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cool

Normal

Weak

Yes

D10

Rain

Mild

Normal

Weak

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No

The information gain is calculated for all four attributes:

\(IG(S, Outlook)=0.246\)

\(IG(S, Temperature)=0.029\)

\(IG(S, Humidity)=0.151\)

\(IG(S, Wind)=0.048\)

So, we'll choose Outlook attribute for the first time.

For the node where Outlook = Overcast, we'll find that all the attribute(Play ball) of end nodes are the same.
For another two nodes, we should split them once more.
Let's take the node where Outlook = Sunny as an example:

\(IG(S_{sunny}, Temperature)=0.570\)

\(IG(S_{sunny}, Humidity)=0.970\)

\(IG(S_{sunny}, Wind)=0.019\)

So, we'll choose Humidity attribute for this node.
And repeat these steps, you'll get an decision tree as below:

C4.5 Algorithm

C4.5 builds decision trees from a set of training data in the same way as ID3, using an extension to information gain known as gain ratio.

Improvements from ID3 algorithm

CART Algorithm

The CART(Classification & Regression Trees) algorithm is a binary decision tree algorithm. It recursively partitions data into 2 subsets so that cases within each subset are more homogeneous. Allows consideration of misclassification costs, prior distributions, cost-complexity pruning.