Data defines the model by dint of genetic programming, producing the best decile table.

When Statistical Model Performance is Poor:Try Something New, and Try It Again Bruce Ratner, Ph.D.

Typically, the data analyst approaches a problem directly with an (inflexible) procedure designed specifically for that purpose. For example, the everyday statistical problems of classification (i.e., assigning class membership with a categorical target variable), and prediction of a continuous target variable (e.g., sale or profit) are solved by the “old” standard binary or polynomial logistic regression (LR) models, and the ordinary least-squares regression (OLS) model, respectively. This is in stark contrast to the newer machine learning “algorithmic” methods, which are nominally statistical models, or more aptly non-statistical models, in that no effort is made to represent how the data were generated. There are nonparametric, assumption-free “flexible” procedures that let the data define the form of the model itself. The working assumption that today’s (big) data fit the OLS and LR models – which were formulated within the small-data setting of the day over 200 years ago, and 50 years ago, respectively – is not tenable. A flexible, any-size data model that is self-defining clearly offers a potential for building a reliable, highly predictive model, which was unimaginable two centuries ago, even a half century ago. The most notable algorithmic methods for the everyday statistical problems are the decision trees (sets of if-then rules), such as CART, CHAID, and C5.0.

The data come from a study of multiple sclerosis (MS) – whether the disease is infectious. The possibility that the disease is infectious resides in the apparently sudden occurrence of MS in the Faroe Islands after British troops arrived there in 1941. There were two groups of patients: Those in Group A (coded GROUP_ = 0) had not been off the islands before onset, while those in Group B (coded GROUP_ = 1) had been off the islands for less than two years before the onset. Is there any evidence that MS is infectious? Effectively, if a model can be built to distinguish between the groups then the model is the evidence. The “MS" data are in Table 1, below (from Is multiple sclerosis an infectious disease? Biometrics, 46, 337-349).

Table 1. MS Data

I built the easy-to-interpret logistic regression model (LRM), and the not-so-easy-to-interpret GenIQ Model for the target variable GROUP_.

The results of the GROUP_-LRM are in Table 2. LRM log_of_odds_of_GROUP-Rank-order Prediction of GROUP_, below. There is not a perfect rank-order prediction of GROUP_. Patient ID #14 is consider the data mass, as it is poorly ranked #26.

Table 2. Rank-order Prediction of GROUP_ based on log_of_odds_of_GROUP_ .

V. GROUP_-GenIQ Model ResultsThe results of the GROUP_-GenIQ Model are in Table 3. GenIQ Model GenIQvar Rank-order Prediction of GROUP_, below. There is a perfect rank-order prediction of GROUP_. Note as per the LRM, Patient ID #14, who was poorly ranked #26, is now reliably ranked #3.

The results of the GROUP_-GenIQ GenIQvar1 Model are in Table 4. GenIQ Model GenIQvar1 Rank-order Prediction of GROUP_, below. There is not a perfect rank-order prediction of GROUP_. Patient ID #6 is consider the data mass, as it is misranked #8. Interestingly, as per the LRM, Patient ID #6 is not a data mass, as she is positioned in rank #3. Yet, as per the correct GenIQ GenIQvar Model, she is reliably placed at the end of the perfect ranking at position #7. As for Patient ID #14, who was poorly ranked by LRM in position #26, is ranked #2, and #3 by GenIQvar1 and the correct GenIQvar models, respectively.

Table 4. GenIQ Model GenIQvar1 Rank-order Prediction of GROUP_.

IX.Summary

The machine learning paradigm (MLP) "let the data suggest the model" is a practical alternative to the statistical paradigm "fit the data to the equation," which has its roots when data were only "small." It was – and still is – reasonable to fit small data in a rigid parametric, assumption-filled model. However, today's big data require a paradigm shift. MLP is a utile approach for analysis and modeling big data, as big data can be difficult to fit in a specified model. Thus, MLP can function alongside the regnant statistical approach when the data – big or small – simply do not "fit." However, when the best-laid statistical models go oft astray, and have poor performance, the approach is to try something new, and try it again.

For more information about this article, call Bruce Ratner at 516.791.3544; or e-mail at br@dmstat1.com.If you would like to see the way in which GenIQ works, then sign-up for a GenIQ webcast. I promise not to waste your time, as we both don't have time to waste.