MENU

By default CART will not allow Cross Validation (CV) for any dataset that has more than 3000 observations. The n-fold cross-validation technique is designed to get the most out of datasets that are too small to accommodate a hold-out or test sample. Once you have 3,000 records or more, we recommend that a separate test set be used.

For large datasets, it is recommended that a separate error set be used, either by manually splitting the dataset into learn and test samples (ERROR TEST or ERROR SEPVAR) or by using a randomly-selected test set (ERROR PROPORTION).

However, you can persist in using CV with the command:

BOPTIONS CVLEARN = n

The default value for n is 3000 but it can be reset to a larger value. For example, if you have 50,000 observations and want to use the entire dataset in a cross-validation run, issue the command:

CART is capable of determining the number of records in your data sets, and uses this information to predict the memory and workspace requirements for trees that you build. Also, CART will read your entire data set each time a tree is built. At times these actions may be problematic, especially if you have enormous data sets.

If you only wish to use the first N records, perhaps due to memory limitations or because you wish for faster turnaround during early exploratory analysis, you can direct CART to treat your data sets as if they have fewer records than actually exist in the data. (Another option is to contact Salford Systems regarding a memory compile upgrade so that CART can accommodate all your data; CART can be compiled to utilize up to 32 gigabytes of RAM.

For further info on problem sizes and scalability, see CART Technical Overview - Scalability).

There are two options on the LIMIT command to consider: LIMIT DATASET = N, ERRORSET = N

These options tell CART to act as if your main data set (and error/test data set if you have one) has fewer observations than it actually does. For instance, if your data set has 500,000 observations but you wish to only use the first 25,000, issue the command: LIMIT DATASET = 25000

Similarly, if you have an enormous separate test set and wish to only use 75,000 records from it, issue the command: LIMIT ERRORSET = 75000

CART will now treat these data sets as if they were only 25000 and 75000 records in length. Any other records will be totally ignored.

CART uses strictly binary, or two-way, splits that divide each parent node into exactly two child nodes by posing questions with yes/no answers at each decision node. CART searches for questions that split nodes into relatively homogenous child nodes, such as a group consisting largely of responders, or high credit risks, or people who bought sport-utility vehicles. As the tree evolves, the nodes become increasingly more homogenous, identifying important segments. Other methods, such as CHAID, favor multi-way splits that can paint visually appealing trees but that can bog models down with less accurate splits.

One of the strengths of CART is that, for ordered predictors, the only information CART uses are the rank orders of the data – not the actual value of the data. In other words, if you replace a predictor with its rank order, the CART tree will be unchanged.

Thus, CART splits cannot be affected by any transform of the data that preserves order (monotone transform). For instance, AGE, log(AGE), AGE^2, etc., all would yield the same split. If you have a nominal variable with values 1,2,3,4,5, as long as the value &ldquol;5" properly represents the highest value in the data and "1" represents the lowest value, and so forth, then monotone transforms of the data — transforms that preserve rank ordering — will not alter how the predictor acts in the tree.

Of course, if a variable is categorical (discrete, unordered, nominal) then the values are just arbitrary labels. Simply indicating to CART that a predictor is variable is sufficient for proper handling.

RESAMPLE: If the node is subsampled, this will choose a new subsample and generate new splits.

Several important caveats about interactive splitting:

You must be in command mode to use this feature. Future versions of CART will enable this feature via the GUI.

Because the interactively split tree is an exploratory tree, it will not be pruned back. To avoid growing a tree that is too large, be sure to limit the size of the tree by setting the complexity, depth, or number of nodes prior to building the tree.

If you only want to interactively split the top three nodes, use ABOVE DEPTH=2 to avoid having to interactively split other nodes on the left side of the tree before returning to the second node on the right side of the tree.

CART supports three "improvement penalties." The "natural" improvement for a splitter is always computed according to the CART methodology. A penalty may be imposed, however, that causes the improvement to be lessened depending, affecting the penalized splitter´s relative ranking among competitor splits. If the penalty is enough to cause the top competitor to be replaced by a competitor, the tree is changed.

Improvement Penalties

Variable-Specific Penalty

This penalizes a given predictor (perhaps because it is expensive to collect and we do not want it serving as a splitter unless it is a really powerful predictor). If the user-defined variable-specific penalty is in the range [0,1] inclusive, then the natural improvement is adjusted as:

improv-adj = improve * (1 - variable_specific_penalty)

If the user-specified penalty falls outside of [0,1] then no penalty is imposed.

Missing-value Penalty

This penalizes the improvement of a competitor based on the proportion of missing values for the competitor in the node in question. This makes it difficult, but not impossible, for a competitor with many missing values in a node to rise to the top of the competitor list and assume the role of primary splitter. If there are missing values, the improvement is adjusted as:

improve-adj = improve * SW1 * [ (Ngood/N} ^ SW2 ]

in which SW1 and SW2 are controlled in the PENALTY command, N is the size of the node, and Ngood is the number of records with nonmissing values for the variable in question. If there are no missing values (NGOOD=N), no adjustment is made.

High level Categorical Penalty

This penalizes a categorical variable that has many levels relative to the size (unweighted N) of the node in question. For a categorical variable:

ratio = log_base_2(N) / (Nlevels - 1)

in which NLevels is the number of levels for the categorical predictor and N is the number of learn sample records in the node.

improve-adj = improve * [ 1 - SW3 + SW3 * (ratio ^ SW4) ]

in which SW3 and SW4 are controlled on the PENALTY command.

Note that all three penalties can be in effect, in which case they all serve to decrease the "freely computed" improvement, resulting in a "adjusted" improvement, which is what appears in the competitor table and is used to rank the competitors.

These penalties are first used in adjusting the improvements evaluated for the competitors in a node. When generating surrogates, the penalties will affect the improvements computed for the surrogates in the same way — unless PENALTY SURROGATES=NO is specified, in which case improvements are not adjusted for surrogates even if missing values or high level categoricals are involved.

Note that the associations for surrogates are not penalized, so these penalties will not change the ordering of surrogates for a given primary splitter. They will only affect the improvement listed for a surrogate.

Unlike many data-mining tools, CART can accommodate situations in which some misclassifications, or cases that have been incorrectly classified, are more serious than others. CART users can specify a higher penalty for misclassifying certain data, and the software will steer the tree away from that type of error. Further, when CART cannot guarantee a correct classification, it will try to ensure that the error it does make is less costly. If credit risk is classified as low, moderate, or high, for example, it would be much more costly to classify a high-risk person as low-risk than as moderate-risk. Traditional data mining tools cannot distinguish between these errors.

CART handles missing values in the database by substituting "surrogate splitters," which are back-up rules that closely mimic the action of primary splitting rules. Suppose that, in a given model, CART splits data according to household income. If a value for income is not available, CART might substitute education level as a good surrogate.

The surrogate splitter contains information that is typically similar to what would be found in the primary splitter. Other products' approaches treat all records with missing values as if the records all had the same unknown value; with that approach all such "missings" are assigned to the same bin. In CART, each record is processed using data specific to that record. This allows records with different data patterns to be handled differently, which results in a better characterization of the data.

By using surrogates to stand in for missing values, CART generates robust and reliable predictive models, even when applied to very large databases with hundreds of variables and many missing values. CART's identification of surrogate predictor variables also provides an effective way to discover low-cost predictive mechanisms. If the best splitting criterion in a tree involves an expensive or difficult-to-obtain measure, a less-expensive surrogate can be considered instead.

CART uses two test procedures to select the "optimal" tree, which is the tree with the lowest overall misclassification cost, thus the highest accuracy. Both test disciplines, one for small datasets and one for large, are entirely automated, ensuring that the optimal tree model will accurately classify existing data and predict results.

For smaller datasets and cases when an analyst does not wish to set aside a portion of the data for test purposes, CART automatically employs cross validation. While this frequently occurs in medical research, a shortage of training data can occur in the study of any rare event, such as specific types of fraud. In cross validation, ten different trees are typically grown, each built from a different ten percent of the total sample. When the results of the ten trees are put together, a highly reliable determination of the optimal tree size is obtained. For large datasets, CART automatically selects test data or uses pre-defined test records or test files to self-validate results.

You wish to apply your results to new data, but CASE will not accept the data.

SOLUTION 1: Check For TR1 Tree File

Make sure the .TR1 tree file, in which CART stores the tree structure, variable list and category specifications, exists.

SOLUTION 2: Check Variable List on USE File

Every variable on the .TR1 tree file, except the target variable, must also exist on the USE file. The variables need not all have non-missing values, but they must exist. If the tree was originally grown with many extraneous variables of low or zero importance, you may wish to re-build the tree with a smaller MODEL list.

If CART determines that the optimal tree has a very large number of nodes and therefore is too complex for practical use or easy intelligibility, two solutions are possible.

SOLUTION 1: Pick a Smaller Tree

CART is designed to identify a set of candidate predictive trees complete with honest estimates of costs and standard errors. There is no reason not to decide to accept a higher error rate in exchange for a simpler tree, so long as you remain aware of the costs of the simplification.

SOLUTION 2: Parametric Model

Consider fitting a parametric model using CART to select variables and possible interactions. We have often used CART regression trees to partition a sample into subsets on which separate linear models are fit. CART can assist in specifying a switching regression.

CART will only search over all possible subsets of a categorical predictor for a limited number of levels. Beyond a threshold set by computational feasibility, CART will simply reject the problem. You can control this limit with the BOPTION NCLASSES = m command, but be aware that for m larger than 15, computation times increase dramatically.

SOLUTION: Convert The Variable Into Dummies

The ideal solution is to work with a supercomputer implementation of Salford Systems CART, because this will provide the optimal tree. Other alternatives are compromises that might not yield satisfactory results. One such compromise is to break the categorical variable into a vector of dummies. For example, a 50-level occupation variable could be coded into 50 separate indicators.

The use of multiple trees in a committee of experts is a relatively new technique, and one of CART's creators has developed a dramatically effective way of combining trees in CART. Prediction errors can be reduced as much as 50 percent by directing CART to draw 50 or more different random samples from the training data, grow a different tree on each sample, and then allow the different trees to "vote" on the best classification. When appropriate, combining trees can yield a substantial performance edge over any other data mining procedure. For more information, see Committee of Experts.

A decision tree is a flow chart or diagram representing a classification system or predictive model. The tree is structured as a sequence of simple questions, and the answers to these questions trace a path down the tree. The end point reached determines the classification or prediction made by the model, which can be a qualitative judgment (e.g., these are responders) or a numerical forecast (e.g., sales will increase 15 percent).

CART is an acronym for Classification and Regression Trees, a decision-tree procedure introduced in 1984 by world-renowned UC Berkeley and Stanford statisticians, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Their landmark work created the modern field of sophisticated, mathematically- and theoretically-founded decision trees. The CART methodology solves a number of performance, accuracy, and operational problems that still plague many other current decision-tree methods.

Cross-validation is a method for estimating what the error rate of a sub-tree (of the maximal tree) would be if you had test data. Regardless of what value you set for V-fold cross validation, CART grows the same maximal tree. The monograph provides evidence that using a V of 10-20 gives better results than using a smaller number, but each number could result in a slightly different error estimate. The optimal tree — which is derived from the maximal tree by pruning — could differ from one V to another because each cross-validation run will come up with slightly different estimates of the error rates of sub-trees and thus might differ in which tree was actually best.

Normally, a test sample is used to prune the maximal tree down to an "optimal" tree. This is especially recommended for large data sets, from which a test sample can be withdrawn. However, there are times when the size of the data set makes withdrawing a test sample difficult. In the absence of a test sample and without using cross validation, no pruning is done — this is called EXPLORATORY — and the maximal tree is the result. Note that the maximal tree in an exploratory run is identical to the maximal tree when using a test sample, provided that the learn sample is the same for each run.

When you are unwilling to use a test sample but still desire estimates of the error rates of each tree in the sequence, cross validation may be used. In a nutshell, cross validation establishes how much to prune the maximal tree by building a series of "ancillary-cross validation trees" from which error rates of the maximal tree and its subtrees can be estimated. Cross validation does not affect the growth of the maximal tree at all because it is conducted after the maximal tree is grown. The V ancillary cross-validation trees may be similar to the maximal tree, but not necessarily. Here is how it works:

- The maximal tree is grown and saved. Note that we do not have any "independent" estimate of the error rates for each node in the maximal tree, because we do not have a test sample. A pruning sequence is defined based on node complexities of the maximal tree, although the error rate for each tree in the sequence is not yet known. In other words, we know which nodes to prune off the tree and in what order, and we have a series of subtrees defined by the pruning sequence, but we do not know how far to prune.

- V ancillary cross-validation trees are then grown, each on a partition of the learn sample. For instance, if 10 cross-validation trees are grown, each uses 90% of the learn sample for tree growth and the remaining 10% as a pseudo test sample with which to estimate error rates for the nodes in the cross-validation tree.

- Error rates from each of the V cross-validation trees are combined and mapped to the nodes in the original maximal tree. The V cross-validation trees are then discarded.

Now that estimates of the error/cost for each node in the maximal tree are known, we are in a position to prune the maximal tree and declare an optimal tree.

Q: We typically use the default of 10-fold cross validation in CART. However, when we change to, say, 20-fold cross validation, CART indicates a different optimal tree. Why?

A: In both cases the maximal tree is the same. 20-fold cross validation will partition the learning sample into 20 subsets and will generate 20 ancillary cross-validation trees. These trees, each with their own error rates, will be combined to yield estimated error rates for the maximal tree. Since we are combining 20 trees rather than 10, it is almost certain that the 20-fold combined error rates estimated for the maximal tree will differ from those estimated by combining 10-fold cross-validation trees. Although the pruning sequence is the same in both runs, a different tree may be chosen as optimal between the two runs due to the differing error rate estimates. In other words, the maximal tree and pruning sequence is the same, but the 10- and 20-fold cross-validation procedures will result in a different amount of pruning.

Q: In the tree sequence and on the "select tree" dialog we see "cross-validated relative cost" (with confidence intervals) and "resubstitution relative cost," for each tree in the tree sequence, e.g.:

Â

Terminal Tree Nodes

Cross-Validation Relative Cost

Resubstitution Relative Cost

Complexity Parameter

1

15

0.7457930 +/- 0.0142744

0.6738151

0.0019035

2

10

0.7506419 +/- 0.0135887

0.6981514

0.0024436

3

9

0.7533725 +/- 0.0136544

0.7033467

0.0026077

4

7

0.7476655 +/- 0.0137743

0.7145392

0.0028081

5**

6

0.7439012 +/- 0.0135847

0.7221265

0.0038037

6

3

0.7605784 +/- 0.0142045

0.7499018

0.0046392

7

1

1.0000000 +/- 0.0000896

1.0000000

0.0625345

A: Cross-validated relative cost is the error rate of the tree, relative to the root node, using the cross-validation method. If you had used a test sample instead of cross validation, you would have been presented with “test sample relative cost.” The resubstitution relative cost depicts the error rate that would be estimated had you used a copy of the learn sample as your test sample. Note that this rate always decreases as the tree gets larger. This is a property of using the same data to estimate errors that were used to build the tree in the first place. The +/- number is a measure of the uncertainty around the actual (cross-validation or test sample) error rate of the tree in question when confronted with new data. The cross-validation error rate is derived from one cross-validation procedure, whereas a test sample error rate is derived from a one-test sample. Either way, if you ran another cross-validation procedure or used a different test sample you would likely see another (slightly) different error rate. The +/- gives an idea of the uncertainty of the error rate estimate.

CART and MARS continue to read data stored in the legacy SYSTAT format, a binary (i.e., not human-readable) format widely used by statisticians and researchers using the SYSTAT statistical programs. Relative to comma-separated-text and some other binary formats, the legacy SYSTAT format is quite restrictive (limited variable name lengths, limited lengths of character data). We do not recommend that you use it. However, for our clients that do need to work with this format, we provide the following C and Fortran programs that illustrate how legacy SYSTAT datasets are structured. Originally, legacy SYSTAT format was written and read with Fortran code. Thus, because the format must accommodate the record segmentation and padding typical of Fortran I/O, the C version handles these issues explicitly.

Obtain LFR.F here (www.salford-systems.com/programs/lfr.txt) — the Fortran 77 legacy SYSTAT file reader source. LFR is written entirely in Fortran 77. Record padding is handled implicitly by the Fortran I/O system. The user is prompted for an input legacy SYSTAT set. The data are echoed to the console in comma-delimited as confirmation.

Obtain CLFR.C here (www.salford-systems.com/programs/clfr.c) — the C legacy SYSTAT file reader source. CLFR is written entirely in C. Record padding is handled explicitly to match the Fortran conventions. You must give the input file name as the first command line argument. If you give an output file name as the second command line argument, data will be written to that file in comma-delimited for as confirmation, otherwise data are written to stdout.

Both are portable in the sense that they accommodate both Unix and Windows compilation platforms.

CART automatically produces a predictor ranking (also known as variable importance) based on the contribution predictors make to the construction of the tree. Predictor rankings are strictly relative to a specific tree; change the tree and you might get very different rankings. Importance is determined by playing a role in the tree, either as a main splitter or as a surrogate. CART users have the option of fine tuning the variable importance algorithm.

Variable importance for a particular predictor is the sum across all nodes in the tree of the improvement scores that the predictor has when it acts as a primary or a surrogate (but not as a competitor) splitter. Specifically, for node i, if the predictor appears as the primary splitter, then it has a contribution toward the importance of:

importance_contribution_node_i = improvement

If instead, the predictor appears as the nth surrogate instead of as the primary predictor, the expression is:

importance_contribution_node_i = (p ^ n) * improvement

in which p is the "surrogate improvement weight": a user-controlled parameter that is equal to 1.0 by default and can be set anywhere between 0 and 1. Thus, you are able to specify that surrogate splits contribute less towards a predictor's improvement than do primary splits. This parameter is controlled with the BOPTIONS IMPORTANCE option.

Linear combination splits do not contribute in any way to variable improvement.

If, in the absence of linear combinations, the improvement weight is greater than 0, and the variable has importance = 0.0, it does not appear in the tree as a primary or surrogate splitter, although it may appear as a competitor.

As illustrated above, the results of a decision-tree data-mining project are displayed as a tree-shaped visual diagram. Discovered relationships and patterns in the data - even in massively complex datasets with hundreds of variables - are presented as a flow chart. Compare this to complex parameter coefficients in a logistic regression output or a stream of numbers in a neural-net output, and the appeal of decision trees is readily apparent.

The visual display enables users to see the hierarchical interaction of the variables. In addition, the display often confirms previous knowledge about important data relationships, which adds confidence in the reliability and utility of the CART model. Further, because simple if-then rules can be read right off the tree, models are easy to grasp and easy to apply to new data.

Salford Systems' CART is the only decision tree based on the original code of Breiman, Friedman, Olshen, and Stone. Because the code is proprietary, CART is the only true implementation of this classification-and-regression-tree methodology. In addition, the procedure has been substantially enhanced with new features and capabilities in exclusive collaboration with CART's creators. While some other decision-tree products claim to implement selected features of this technology, they are unable to reproduce genuine CART trees and lack key performance and accuracy components. Further, CART's creators continue to collaborate with Salford Systems to refine CART and to develop the next generation of data-mining tools.

CART includes seven single-variable splitting criteria - Gini, Symgini, twoing, ordered twoing and class probability for classification trees, and least squares and least absolute deviation for regression trees - and one multi-variable splitting criteria, the linear combinations method. The default Gini method typically performs best, but, given specific circumstances, other methods can generate more accurate models. CART's unique "twoing" procedure, for example, is tuned for classification problems with many classes, such as modeling which of 170 products would be chosen by a given consumer.

Other splitting criteria are available for inherently difficult problems in which even the best models are expected to have a relatively low accuracy. Demographics, for example, are often weak predictors of attitude- and preference-based segments. Special CART tree-growing options can dramatically increase the predictive accuracy of such demographic-based models. Additional unique tree-growing criteria are available for problems involving unequal misclassification costs, ordered target variables, and continuous dependent variables.To deal more effectively with select data patterns, CART also offers splits on linear combination of continuous predictor variables. For this option, CART looks for weighted averages of predictor variables to use as splitters; these weighted averages can reveal important database structure and can uncover new critical measures.

Most data-mining projects involve classification for gaining insight into existing data and turning that knowledge into a predictive model. Typical classification projects include sifting profitable from unprofitable; detecting fraudulent claims; identifying repeat buyers; profiling high-value customers who are likely to churn; and flagging high-risk credit applications. CART is a state-of-the-art classification tool that, as a standalone package, can investigate any classification task and provide a robust, accurate predictive model. The software tackles the core data-mining challenges by accommodating classification for categorical variables, such as responder and non-responder, and regression for continuous variables, such as sales revenue.

In addition to delivering accuracy, CART offers three distinct advantages over other data-mining tools. First, CART is easily accessible to beginning users and does not require a high level of technical expertise to operate. CART's new, user-friendly GUI and reference manual guide users through a quick process. In addition, the default settings perform so well that many highly experienced experts do not change them. Second, CART results are extremely easy to interpret; the tree-shaped flow chart easily identifies the most important predictors. Lastly, CART costs thousands of dollars less than a data-mining suite, while handling classification projects comparably.

Rarely, computational instability or a precision problem will result in an message stating that an "internal error" has been encountered. To help Salford Systems resolve the problem, please attempt the following:

-Rebuild as an exploratory tree to see if CART completes the building process.

-Rebuild the tree with a randomly-selected subset for testing (ERROR p=.2, for example).

-Rebuild a cross-validation tree with an explicit SEED to perturb the selection of cross-validation subsets.

Please send Salford Systems via email (or diskette):

- command files or commands used prior to the build;

- output generated during the run;

- frequency table of your target variable;

- frequency table of your categorical predictor variables; and

- if possible, your data set, or any subset that can reproduce the problem.

If a variable does not enter the tree as a primary node splitter, it may still play a important role in the tree as a surrogate splitter. If you have turned the displaying of surrogate splitters off, you will not see how these variables affect the tree but they will still be used internally by CART when applying the tree to data. The Variable Importance Table produced by CART ranks the variables in the tree by their importance, a statistic measuring how strongly a variable acts as a primary or surrogate splitter.

Suppose a variable enters the tree as the top surrogate splitter in many nodes, but never as the primary splitter. If this variable is removed from the list of potential predictor variables and the tree is rebuilt, it will probably be a very different tree, and certainly will be if there are missing values in the data for the primary node-splitting variables.

Another possibility is due to the way CART grows trees. Normally, CART first grows a maximal tree and then tests it either through cross validation or a separate test sample. If a split does not hold up to testing, it is removed from the model. Thus, if a model splits one or more times on a particular variable, but none of these splits hold up to testing, the variable will not appear as a primary splitter in the final model. However, if the variable is dropped, the splits involving that variable in the maximal tree might be replaced by others, which may appear in the final tree.

CART is based on a decade of research, assuring stable performance and reliable results. CART's proven methodology is characterized by:

- Reliable pruning strategy – CART's developers determined definitively that no stopping rule could be relied on to discover the optimal tree, so they introduced the notion of over-growing trees and then pruning back; this idea, fundamental to CART, ensures that important structure is not overlooked by stopping too soon. Other decision-tree techniques use problematic stopping rules.

- Powerful binary-split search approach – CART's binary decision trees are more sparing with data and detect more structure before too little data are left for learning. Other decision-tree approaches use multi-way splits that fragment the data rapidly, making it difficult to detect rules that require broad ranges of data to discover.

- Automatic self-validation procedures – In the search for patterns in databases it is essential to avoid the trap of "overfitting," or finding patterns that apply only to the training data. CART's embedded test disciplines ensure that the patterns found will hold up when applied to new data. Further, the testing and selection of the optimal tree are an integral part of the CART algorithm. Testing in other decision-tree techniques is conducted after the fact and tree selection is left up to the user.

- In addition, CART accommodates many different types of real-world modeling problems by providing a unique combination of automated solutions:

1. surrogate splitters intelligently handle missing values;

2. adjustable misclassification penalties help avoid the most costly errors;

3. multiple-tree, committee-of-expert methods increase the precision of results; and

4. alternative splitting criteria make progress when other criteria fail.