Data defines the model by dint of genetic programming, producing the best decile table.

Data Mining: An Ill-defined ConceptBruce Ratner, Ph.D.

The term Data Mining is an ill-defined concept in statistics and related disciplines. Before the late 1970s/early 1980s, statisticians had known about data mining for a long time, albeit under various names such as data fishing, snooping, and dredging, and most disparaging “ransacking” the data. Because any discovery process inherently exploits the data, producing spurious findings, statisticians did not view data mining in a positive light. A concept is well-defined when its definition specifies the concept – an idea that includes all that is characteristically associated with or suggested by the concept – in an unambiguous way along with its unique value. I performed a world-wide-web search for “definition of data mining.” Based on the search results, discussed below, I declare that data mining is not a well-defined concept. Data mining is in a state of helter skelter: [1] Entry-to-mid level data miners are in a quandary as to which definition to use, and practiced data miners produce research, from which there are nil meta-analyses of data-mining based studies. I conclude that either a singular definition for multi-disciplines or a multitude of discipline-specific definitions of data mining is long over due.

Today’s data mining is a high-concept: having elements of fast action in its development, glamour as it stirs the imagination for the unconventional and unexpected, and a mystic that appeals to a wide audience that knows curiosity feeds human thought. I googled “definition of data mining” and received a gross (vis-à-vis net) number of 232,000definitions! (Curiously, one of the entries was “Data mining is derogatory … ”) To have a sound working assumption for the task at hand, I netted the “gross” google-number to 2,320. (This netting in and of itself coincidentally reflects that the definition of google’s search engine optimization is also ill-defined.) Suffice it to say that data mining is an ill-defined concept, as 2,320 definitions are clearly not needed to unambiguously explain the concept. Unprecedentedly, the data mining concept early on (circa 1970s/early 1980s) did not have, and currrently does not have the scholarly cause to take form. I conclude that data mining is an ill-defined concept. And, I declare that the net number of definitions suggests there are discipline-specific data mining definitions; but how many are there: 18, 36, 54, … ? [2] Regardless of an agreed number of disciplines, 2,320 divided bythe “agreed-number” presents data mining proper or data mining discipline-specific as an ill-defined concept.

To the seemingly incapable problem of developing a well-defined definition of data mining, I would like to add entry # 2,321: Statistics Definition of Data Mining:

Today, statisticians accept data mining only if it embodies Tukey’s EDA Paradigm. [3, 4] They define data mining as any process that finds unexpected structures in data and uses the EDA framework to insure that the process explores the data, not exploits it. See Figure 1.1. Note the word “unexpected,” which suggests that the process is exploratory, rather than a confirmation that an expected structure has been. By finding what one expects to find, there is no longer uncertainty as to the existence of the structure. Statisticians are mindful of the inherent nature of data mining and try to make adjustments to minimize the number of spurious structures identified. In data mining the statistician has no explicit analytical adjustments available, only the implicit adjustments affected by using the EDA paradigm itself.

Reference: 1. Helter Skelter from the Beatles’ song Helter Skelter from their 1968 album The Beatles is the “right word.” [1a]

1a. “The difference between the ‘right word’ and the ‘almost the right word’ is the difference between lightening and the lightening-bug.” Samuel Langhorne Clemens (1835 – 1910), better known as Mark Twain.