Seed for random numbers to select the NXTRYX-variables and NUMITSTRY units; default 0

OWNBSELECT = string token

Indicates whether or not your own version of the BSELECT procedure is to be used, as explained in the Method section (yes, no); default no

OUTOFBAGERROR = scalar

Saves the “out-of-bag” error rate

CONFUSION = matrix

Saves the confusion matrix

SAVE = pointer

Saves details of the forest that has been constructed

Parameters

X = factors or variates

X-variables available for constructing the tree

ORDERED = string tokens

Whether factor levels are ordered (yes, no); default no

IMPORTANCE = scalars

Saves the importance of each x-variable

Description

The data to construct a random classification forest is a sample of individuals from several groups. The characteristics of the individuals are described in Genstat by a set of factors or variates which are specified by the X parameter of BCFOREST. The GROUPS option of BCFOREST defines the group to which each individual in the sample belongs, and the aim is to be able to identify the groups to which new individuals belong.

A random classification forest is a set of classification trees that are used collectively to identify the group to which an individual specimen belongs (see e.g. Breiman 2001). The identification is obtained by running a new individual through each tree to obtain that tree’s “vote” for the group of the individual. The identification is then taken as the group with most votes.

Each classification tree is formed using a random sample of the X variables in the data set, and a bootstrap random sample of their units (i.e. sampled with replacement). The NXTRY option defines how many X variables to select, and the NUNITSTRY option defines how many units to take. The default for NXTRY is the square root of the number of variables, and the default for NUNITSTRY is two thirds of the number of units. The SEED option specifies a seed for the random numbers that are used to select the variables and to select the units. The default of zero continues an existing sequence of random numbers, if any of the random functions (GRSELECT etc) has already been used in the current Genstat run. Otherwise a seed is chosen at random.

A classification tree progressively splits the individuals into subsets based on their values for the factors or variates. Construction starts at a node known as the root, which contains all of the individuals. A factor or variate is chosen to use there that “best” divides the individuals into two subsets. Suppose the available X vectors are all factors with two levels: the first subset will then contain the individuals with level 1 of the factor, and the second will contain those with level 2. Also any individual with a missing value for the factor is put into both groups; so you can use a missing value to denote either variable or unknown observations. Factors may have either ordered or unordered levels, according to whether the corresponding value ORDERED parameter is set to yes or no. For example, a factor called Dose with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine', 'Amidone', 'Phenadoxone' and 'Pethidine' of a factor called Drug would be regarded as unordered. For unordered factors, all possible ways of dividing the levels into two sets are tried. With variates or ordered factors with more than 2 levels, a suitable value p is found to partition the individuals into those with values less than or greater than p. The tree is then extended to contain two new nodes, one for each of the subsets, and factors or variates are selected for use at each of these nodes to subdivide the subsets further.

The effectiveness of the factor or variate to be chosen for each node depends on how the groups are split between the resulting subsets – the aim is to form subsets that are each composed of individuals from the same group. By default, this is assessed using Gini information (see Breiman et al. 1984, Chapter 4) but you can set option METHOD=mpi to use the mean posterior improvement criterion devised by Taylor & Silverman (1993). The ANTIENDCUTFACTOR option allows you to request Taylor & Silverman’s adaptive anti-end-cut factors (by default these are not used). The process stops when either no factor or variate provides any additional information, or the subset contains individuals all from the same group, or the subset contains fewer individuals than a limit specified by the NSTOP option (default 1). These nodes where the construction ends are known as terminal nodes.

The resulting forest (and its associated information) can be saved using the SAVE option. This can then be used in the BCFDISPLAY procedure to produce further output, or in the BCFIDENTIFY procedure to identify the groups for new values of the x-variables..

The OUTOFBAGERROR option can save the “out-of-bag” error rate. This is calculated using the individuals that were not involved in the construction of each tree. So, it gives an independent measure of the reliability of the forest. The idea is to put each individual through all of the trees where it was not used, and accumulate its votes for each of the groups. The individual is then identified by taking the group where it had most votes, and the error rate is calculated by comparing the identifications of the individuals with their true group (as defined by the GROUPS factor).

The CONFUSION option can save the confusion matrix. This is a groups-by-groups matrix that can be calculated at the same time as the out-of-bag error. The rows represent the true groups, and the columns represent the out-of-bag identifications obtained using the forest. The diagonal of the matrix records the number of individuals correctly identified in each group, while the off-diagonal elements show the numbers that have been identified incorrectly (i.e. that have been “confused” with other groups).

The IMPORTANCE parameter can save a variate giving the “importance” of each X variate or factor in the forest. This is calculated by accumulating the sum of the values of the selection function (see METHOD) over the times when the X variable is used in the forest.

Printed output is controlled by the PRINT option, with settings:

outofbagerror

out-of-bag error rate,

confusion

confusion matrix,

importance

importance ratings of the X variates and factors,

orderedimportance

importance ratings of the X variates and factors in decreasing order, and

Method

BCFOREST calls procedure BCONSTRUCT to form the tree. This uses a special-purpose procedure BSELECT, which is customized specifically to select splits for use in classification trees. You can use your own method of selection by providing your own BSELECT and setting option OWNBSELECT=yes. In the standard version of BSELECT, the BASSESS directive is used to assess the potential splits.