Saturday, July 28, 2007

It is not known precisely why searching the databases, or data mining, raised such a furious legal debate. But such databases contain records of the phone calls and e-mail messages of millions of Americans, and their examination by the government would raise privacy issues.

While I recognize that the NYT is not a technical body, and reporters often get the gist of technology wrong, this particular kind of definition has swept the media to such a degree that the term "data mining" may never recover.

The definition itself has problems, such as1) searching databases per se I'm sure is not what they mean by data mining; almost certainly they mean programs that automatically searching the databases to find interesting patterns (and presumably horribly overfitting int he process, registering many false positives) as the problem. After all, a Nexus search searches a database and no one raises an eyebrow at that.

2) the problem with the searching is not the searching (or the data mining in their terminology), but the data that is being searched. Therefore the headline of the story, "Mining of Data Prompted Fight Over Spying" should probably more accurately read something like "Data allowed to be Mined Prompted Fight Over Spying"

It is this second point that I have argued over with others who are concerned about privacy, and therefore have become anti-data-mining. It is the data that is the problem, not the mining (regardless of the definition of mining). But I think the term "data mining" resonates well and generates a clear mental image of what is going on, which is why it gained popularity in the first place.

So I predict that within 5 years, few data miners (and I consider myself one of them) will refer to him/herself as a data miner, nor will we describe what we do as data mining. Predictive Analytics anyone?

3 comments:

I go back and forth on this. I think the only thing wrong with the term "data mining" is the number of people who mis-use it. Marketers want anything on the computer (especially database querying and OLAP) to be "data mining", because (to them) it sounds cool. Privacy advocates want anything which gathers data to be "data mining" because (to them) it sounds ominous.

I suppose that if enough people continue this abuse of the term they will have effectively established a new definition, and the rest of us will have to abandon ship. Where do we go? I'm not sure. Too many people think of modeling as something done by underfed women on a walkway. Anything with "analytics" at the end seems (to me) like something a consultant uses to dress up the mundane. Perhaps we'll have to settle for "statistics"?

I definitely agree with Will on the fact that data mining has its own definition in various domains. However, I don't think we should avoid to use it.

If it is used without any application interest (i.e. developing an algorithm and showing its capabilities on benchmark data sets) then it can be used as is. If data mining is used in a particular application domain (finance, biology, criminology, etc.), then it should clearly be mentioned what the author mean by data mining.

I feel the issue is not just semantics but a much wider problem. Data Mining is becoming the victim of its own success for doing ethical and un-ethical work on data. People in many countries are becoming more aware that their privacy is being invaded much more easily and more thoroughly thanks to our digital records and data mining techniques. While this sentiment takes hold there is no reaction from the data mining industry or accademia on how we should be more ethical as a whole. We certainly should not wait for law makers to regulate more the industry.Italo