Select an attribute for root node and create a branch for each possible
attribute value.

Split the instances into subsets (one for each branch extending from the
node).

Repeat the procedure recursively for each branch, using only instances
that reach the branch (those that satisfy the conditions along the path
from the root to the branch).

Stop if all instances have the same class.

A criterion for attribute selection

Basic idea: choose the attribute which will result in the smallest tree.

Heuristic: choose the attribute that produces the “purest” nodes.

Properties we require from a purity measure:

When node is pure, measure should be zero.

When impurity is maximal (i. e. all classes equally likely), measure should
be maximal.

The measure should obey multistage property (i. e. decisions can
be made in several stages). For example, assume [2,3,4] is the distribution
of three classes in a set of 9 instances. Then, this property states that
measure([2,3,4])
= measure([2,7]) + (7/9)*measure([3,4]).

Entropy is the only function that satisfies all three properties!

Given a probability distribution (P1,P2,...,Pn),
the information required to predict an event is the distribution’s entropy.

Entropy(P1,P2,...,Pn) = -P1*
log(P1)-P2*log(P2) - ...- Pn*log(Pn).
When the base of log is 2, then entropy is in bits.

assign a leaf label for a node with lower than a prespecified error level.

stop splitting when gain get lower than a prespecified threshold.

stop when a node represents fewer than some threshold number of instances
(say 10, or 5% of the total training set)

Postpruning (cutting subtrees after the complete tree has been built).
Usually error-based: replace a subtree with a leaf node, if the error on
the test data is the same or lower (use cross-validation). Computationally
expensive.

Generating rules from decision trees

Direct approach: each node is represented by a rule with antecedent including
all tests along the path from the root to the particular node.

Rule optimization: deleting conditions form a rule if this does not affect
the error rate.

Discussion

Basic ideas of TDIDT are developed in 60's (CLS, 1966).

Algorithm for top-down induction of decision trees using information gain
for attribute selection (ID3) was developed by Ross Quinlan (1981).

Gain ratio and other modifications and improvements led to development
of C4.5, which can deal with numeric attributes, missing values, and noisy
data, and also can extract rules from the tree (one of the best concept
learners).

There are many other attribute selection criteria (but almost no difference
in accuracy of result.)

Covering algorithms

General strategy: for each class find rule set that covers all instances
in it (excluding instances not in the class). This approach is called a
covering
approach because at each stage a rule is identified that covers some of
the instances.

General to specific rule induction (PRISM algorithm):

For each class C

Initialize E to the instance set

While E contains instances in class C

Create a rule R with an empty left-hand side that predicts class C

While R covers instances from classes other than C do:

For each attribute A not mentioned in R, and each value v,

Consider adding the condition A = v to the left-hand side of R

Select A and v to maximize the accuracy, i.e. (# of instances from C)/(total
# of instances covered by R). For same accuracies choose the condition
providing the largest coverage.

Add (A = v) to R

Remove the instances covered by R from E

Example: covering class "play=yes" in weather data.

Rule 1: If {outlook=overcast} Then play=yes (error=0/4;
covered 4 out of 9)

Rule 2: If {humidity=normal, windy=false} Then
play=yes (error=0/3; covered 3 out of 5)