I use the machine learning GenIQ Model to build a classification model, which predicts the rank-order likelihood of being a male, to illustrate the advantages, and to highlight the singular weakness of the machine learning paradigm. Specifically, the GenIQ Model shows the superiority of the machine learning paradigm over the statistical paradigm, as it not only specifies the true model form (a computer program), but simultaneously performs variable selection (which in this example is trival because only two predictor variables are considered), data mining and build the model – it’s like a Genetic Jackknife 3-in-1 Method. The difficulty in interpreting the computer program often accounts for the limited use of the machine learning paradigm.

Outline of Article

I. Situation

The data come from a study investigating a new method of measuring body composition, and give the body fat percentage (PERCENT_FAT), AGE, and gender (if male then MALE=1, if female then MALE=0) for eighteen normal adults aged bewteen 23 and 61 years. How are AGE and PERCENT_FAT related, and is there any evidence that the relationship is different for males and females? Effectively, if a model that can distinguish between males and females can be build then the model is the evidence. The “Phat Example" data are in Table 1, below (from American Journal of Clinical Nutrition, 40, 834-839).

Table 1. The “Phat Example" Data

I built the easy-to-interpret logistic regression model (LRM), and the not-so-easy-to-interpret GenIQ Model for the target variable MALE. This creates a counterpoint where the data analyst now can choose between a good interpretable model and a potentially better, unexplainable model.

The results of the Phat-LRM are in Table 2. LRM log_of_odds_of_MALE-Rank-order Prediction of MALE, below. There is not a perfect rank-order prediction of MALE for adult ID #7, as he is in the sixth rank, not the fourth rank, which would make the Phat-LRM results perfect.

GenIQ variable selection provides a rank-ordering of variable importance for a predictor variable with respect to other predictor variables considered jointly. This is in stark contrast to the well-known, always-used statistical correlation coefficient, which only provides a simple correlation between a predictor variable and the target variable - independent of the other predictor variables under consideration. Because this study only has two predictor variables the rank-ordering of variable importance is trival.

Variable Importance (w/r/to other variables considered jointly)

1. PERCENT_FAT2. AGE

VI. GenIQ Data Mining

GenIQ data mining is directly apparent from the GenIQ tree itself. Because this study only has two predictor variables, there are no signature GenIQ branches (genetically data-mined structure, i.e., new variables - the "golden nuggets" desired from a data mining effort), only a sine tranformation of AGE, sin(AGE), denoted by sine_of_AGE, which actually is representative of data mining, albeit, the simplest form.

To appreciate the predictive power of the GenIQ Model it is enlightening to see the single relationships for each predictor variable with the target variable, in Tables 3, 4 and 5, which show the Rank-order Predictions of MALE based on AGE, on sine_of_AGE, and on PERCENT_FAT, respectively.. Then, image the brilliance of the built-in IQ of GenIQ, in how it uncovers and ties together the individual data-mined relationships into its final model output in Section IV (GenIQ Model Tree Display and Computer Program) above, and in the GenIQ Model Results in Table 6 below.

Table 3. Rank-order Prediction of MALE based on AGE

Table 4. Rank-order Prediction of MALE based on sine_of_AGE

Table 5. Rank-order Predictions of MALE based on PERCENT_FAT

VII. Phat-GenIQ Model Results The results of the Phat-GenIQ Model are in Table 6. GenIQ Model GenIQvar Rank-order Prediction of MALE, below. There is a perfect rank-order prediction of MALE.

Table 6. GenIQ Model GenIQvar Rank-order Prediction of MALE

VIII. Phat-GenIQ Model Version #2 Output and Results GenIQ modeling is like all other (non-physical science) modeling: there is no unique model, but there are comparable, if not exact, results from alternative methods or different versions of the modeling process. To that end, I built a Phat-GenIQ Model Version #2. The Phat-GenIQ Model Version #2 Tree Display and Computer Program (which includes Int, the Integer function that takes the integer part of the number at hand), and its corresponding Table 7. GenIQ Model Version #2 GenIQvar2 Rank-order Prediction of MALE, below. GenIQ Model Version #2 produces a perfect rank-order prediction of MALE.

However, I prefer the first Phat-GenIQ Model over the version #2 model because the first model is compact (a desirable property of any model), and more precise model scores (obviously a desirable property of any model) than the second model. The first model is compact, albeit at the expensive of the unexpected appearance of the sine function. Also, its model scores for the top two adult ID's #3 and #4 have precisely distinguishing GenIQvar score values, 0.25638, and -0.74362, respectively. The Phat-GenIQ Model Version #2 is definitely not easy on the eyes (not compact), although it uses the easy-to-understand Integer function. But, it is not as precise as the first model, as it assigns the same GenIQvar2 score value of 0.00000 for the top two adult ID's #3 and #1.

The less precise Phat-GenIQ Model Version #2 readies an inquiry of whether the model is also less precise or discriminating vis-a-vis the first Phat-GenIQ Model among the females (MALE=0). This can be addressed by the Coefficient of Variation (CV). (Recall, the CV is a dimensionless number that allows comparison of the variation of populations with different positive mean values. It is often reported as a percentage by multiplying the above calculation by 100. The smaller the CV number, the less variation among the population/sample values.) It is often reported as a percentage by multiplying the above calculation by 100.) I use the CV to see if the variation - as an indicator of spread or diversity of model scores - is less for the second model than it is for the first model. I disregard the negative sign of the model scores to have positive mean values. The CVs are 22.97 and 23.08 for the GenIQvar2 and GenIQvar scores, respectively. Thus, Phat-GenIQ Model Version #2 is not as precise as Phat-GenIQ Model to severalize the adult females.

As a counterpoint to analysis and modeling tasks in the non-physical science, consider:

The world's most famous equation: E = mc**2It is unique, precise, and beautifully compact.

IX.SummaryThe machine learning paradigm (MLP) “let the data suggest the model” is a practical alternative to the statistical paradigm “fit the data to the equation,” which has its roots when data were only “small.” It was – and still is – reasonable to fit small data in a rigid parametric, assumption-filled model. However, the current information (big data) in, say, cyberspace requires a paradigm shift. MLP is a utile approach for database modeling when dealing with big data, as big data can be difficult to fit in a specified model. Thus, MLP can function alongside the regnant statistical approach when the data – big or small – simply do not “fit.” As demonstrated with the “Phat Example” data, MLP works well within small data settings.

For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com.