Data defines the model by dint of genetic programming, producing the best decile table.

Dummy Variables: The Problem and Its Solution Bruce Ratner, Ph.D.

The classic approach to include a categorical variable into the modeling process involves dummy variable coding. A categorical variable with k classes of qualitative (non-numerical) information is replaced by a set of k-1 quantitative dummy variables. The dummy variables are defined by the present (have a value of 1) or absent of the class values (have a value of 0). The class left out is called the reference class, to which the other classes are compared when interpreting the effects of dummy variables on response. The classic approach instructs that the complete set of k-1 dummy variables is included in the model regardless of the number of dummy variables that are declared non-significant. This approach is problematic when the number of classes is large, which is typically the case in big data applications. By chance alone, as the number of classes increases, the probability of one or more dummy variables being declared non-significant increases. To put all the dummy variables in the model effectively adds “noise” or unreliability to the model, as non-significant variables are known to be “noisy.” Intuitively, a large set of inseparable dummy variables poses a difficulty in model building, in that they quickly “fill up” the model not allowing room for other variables.