perClass sdtree classifier offers a fast and scalable decision tree
implementation. Decision tree is build by finding the best threshold on one
of the features that improves class separation. The process is applied
recursively until the stopping condition is met. If the tree is fully
built, each data sample ends in a separate terminal node. This solution,
however, does not yield good generalization in case of class overlap. As
with other classifiers, perClass strives to provide good solution by
default. Therefore, sdtree applies pruning strategy that stops tree
growth at the earlier stage. The pruning uses a separate validation set to
estimate tree generalization error.

To illustrate the basic use of the decision tree, lets consider the
fruit_large data set.

We train the tree on the training subset tr. By default, the data set
passed to sdtree is split internally into two subsets. The first part is
used to grow the tree and the second part to limit its growth by
identifying the sufficient number of thresholds. This process happens
inside sdtree. We will see later, how to take closer control over the
pruning process.

Inspecting the pipeline steps, we can see that the number of thresholds and
the features used are available for direct query. Note how the number of
thresholds is much higher than when pruning the tree.

>>[p(1).thresholds p2(1).thresholds]
ans =
14 97

We may compare the fully grown tree with the tree built using the default
pruning strategy:

We may control the sdtree pruning algorithm in several ways. Firstly, we
may specify the fraction of the input data set used for training the tree
using the 'trfrac' option. By default, 80% of input data is used to grow
the tree and 20% to stop the growth process.

We may also inspect the pruning process in detail. The second output
parameter, returned by sdtree, is a structure containing the full tree
and the error criterion estimated from the validation set at each tree
threshold.

The res.full_tree field contains the fully grown tree. To be more
precise, it contains the tree grown as much as possible depending on the
'maxsamples' option. By default, at least 10 samples much be present in a
node to continue growing process.

To prune the tree manually, we may select a specific number of thresholds
by passing tree pipeline to the sdtree function. Here, we select 10
thresholds:

Decision trees optimize class separability considering all
features. Therefore, a trained tree allows us to identify the features that
provided most separability in our problem. In other words, we may use it
for feature selection.

Random forest classifier sdrandforest combines a large number of
specifically-built decision trees. Each tree is built by considering only
randomly selected subset of features at each tree node. The combination of
their outputs is based on the sum rule.

By default, sdrandforest builds 20 trees and considers a subset of 20% of
features at each node. Let us train a random forest classifier on the
medical data set. We use data of the first five patients as a test set and
remaining data as a training set: