Data defines the model by dint of genetic programming, producing the best decile table.

Data Mining Paradigm: Historical PerspectiveBruce Ratner, Ph.D.

The term data mining emerged from the database marketing community sometime between the late 1970s and early 1980s. Statisticians did not understand the excitement and activity caused by this new technique, since the discovery of patterns and relationships (structure) in the data is not new to them. They had known about data mining for a long time, albeit under various names such as data fishing, snooping, and dredging, and most disparaging, “ransacking” the data. Because any discovery process inherently exploits the data, producing spurious findings, statisticians did not view data mining in a positive light.

Simply looking for something increases the odds that it will be found; therefore looking for structure typically results in finding structure. All data have spurious structures, which are formed by the “forces” that makes things come together, such as chance. The bigger data, the greater odds are that spurious structures abound. Thus, an expectation of data mining is that it produces structures, both real and spurious, without distinction between them.

Today, statisticians accept data mining only if it embodies the EDA paradigm. They define data mining as any process that finds unexpected structures in data and uses the EDA framework to insure that the process explores the data, not exploits it. See Figure 1.1. Note the word “unexpected,” which suggests that the process is exploratory, rather than a confirmation that an expected structure has been. By finding what one expects to find, there is no longer uncertainty as to the existence of the structure. Statisticians are mindful of the inherent nature of data mining and try to make adjustments to minimize the number of spurious structures identified. In classical statistical analysis, statisticians have explicitly modified most analyses that search for interesting structure, such as adjusting the overall alpha level/type I error rate, or inflating the degrees of freedom. In data mining the statistician has no explicit analytical adjustments available, only the implicit adjustments affected by using the EDA paradigm itself.