Wednesday, October 08, 2008

The Sun rises in the East and Data Mining doesn't work

Remember the Do Not Fly list and all the other data mined lists that our Beloved Leader and his Righteous Deputies have been using to Protect us from Evil the past 8 years? God told them it would work. They had no need of reason or logic, they Knew It Was Good.

Shockingly, and I'm not joking now, someone in power actually involved the National Research Council to see if God was telling the truth to His One True Disciple.

Even more shockingly, after years of what must have been very strenuous blocking efforts, the report is being released.

Only mildly astoundingly, it's not a forgery.

As certainly as the Sun rises in the East, the data mining efforts produce too many false positive results. They do not work for the purpose for which they were created.

The most extensive government report to date on whether terrorists can be identified through data mining has yielded an important conclusion: It doesn't really work.

A National Research Council report, years in the making and scheduled to be released Tuesday, concludes that automated identification of terrorists through data mining or any other mechanism "is neither feasible as an objective nor desirable as a goal of technology development efforts." Inevitable false positives will result in "ordinary, law-abiding citizens and businesses" being incorrectly flagged as suspects.

The whopping 352-page report, called "Protecting Individual Privacy in the Struggle Against Terrorists," amounts to at least a partial repudiation of the Defense Department's controversial data-mining program called Total Information Awareness, which was limited by Congress in 2003.

But the ambition of the report's authors is far broader than just revisiting the problems of the TIA program and its successors. Instead, they aim to produce a scholarly evaluation of the current technologies that exist for data mining, their effectiveness, and how government agencies should use them to limit false positives--of the sort that can result in situations like heavily-armed SWAT teams raiding someone's home and shooting their dogs based on the false belief that they were part of a drug ring.

The report was written by a committee whose members include William Perry, a professor at Stanford University; Charles Vest, the former president of MIT; W. Earl Boebert, a retired senior scientist at Sandia National Laboratories; Cynthia Dwork of Microsoft Research; R. Gil Kerlikowske, Seattle's police chief; and Daryl Pregibon, a research scientist at Google.

They admit that far more Americans live their lives online, using everything from VoIP phones to Facebook to RFID tags in automobiles, than a decade ago, and the databases created by those activities are tempting targets for federal agencies. And they draw a distinction between subject-based data mining (starting with one individual and looking for connections) compared with pattern-based data mining (looking for anomalous activities that could show illegal activities).

But the authors conclude the type of data mining that government bureaucrats would like to do [jf: What do you mean "would like to do"?! They've been doing this for 8 years] --perhaps inspired by watching too many episodes of the Fox series 24--can't work. "If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so."

A summary of the recommendations:

* U.S. government agencies should be required to follow a systematic process to evaluate the effectiveness, lawfulness, and consistency with U.S. values of every information-based program, whether classified or unclassified, for detecting and countering terrorists before it can be deployed, and periodically thereafter.

* Periodically after a program has been operationally deployed, and in particular before a program enters a new phase in its life cycle, policy makers should (carefully review) the program before allowing it to continue operations or to proceed to the next phase.

* To protect the privacy of innocent people, the research and development of any information-based counterterrorism program should be conducted with synthetic population data... At all stages of a phased deployment, data about individuals should be rigorously subjected to the full safeguards of the framework.

* Any information-based counterterrorism program of the U.S. government should be subjected to robust, independent oversight of the operations of that program, a part of which would entail a practice of using the same data mining technologies to "mine the miners and track the trackers."

* Counterterrorism programs should provide meaningful redress to any individuals inappropriately harmed by their operation.

* The U.S. government should periodically review the nation's laws, policies, and procedures that protect individuals' private information for relevance and effectiveness in light of changing technologies and circumstances. In particular, Congress should re-examine existing law to consider how privacy should be protected in the context of information-based programs (e.g., data mining) for counterterrorism.

By itself, of course, this is merely a report with non-binding recommendations that Congress and the executive branch could ignore. But NRC reports are not radical treatises written by an advocacy group; they tend to represent a working consensus of technologists and lawyers.

The great encryption debate of the 1990s was one example. The NRC's so-called CRISIS report on encryption in 1996 concluded export controls--that treated software like Web browsers and PGP as munitions--were a failure and should be relaxed. That eventually happened two years later.

Asking this NRC group to write this report is like using a neutron bomb to kill a fly. I guess that's what it takes to stand up to a television show and God's One True Disciple.

As noted above this is not to say that the same data mining techniques could not "work" for other purposes. A medical test, for example, that's worse than useless for identifying disease in a patient may be useful for estimating the prevalence of a disease in a larger population ("pattern-based data mining" equivalent).

The key to deciding what works is to apply trivial statistic tests that have been well understood for decades using math that was old when Newton was young. Tests like sensitivity, specificity, positive predictive value, response-operator curves and so on.

Tests are good or bad based on their properties in context.

This is excruciatingly basic science.

It is a measure of the despair of our nation that it took a report from National Research Council, for the Love of Reason, to tell us what every medical student gets beaten into their heads (ok, so it doesn't always take).

Next thing you know someone will point out that the Orange Alert sign I see every time I drive by the airport is, you know ... a complete waste of electrons.

Reason. Logic.

I could cry.

Does this mean Obama stands a chance? That we're emerging from the Long Dark?

No. I am sure McCain/Palin will win. I am sure. Sure. Must not allow hope to emerge, down hope, keep back ...