To Data Mine or Not to Data Mine in the Fight Against Terrorism

Every nation in the world is well within its rights to use whatever legitimate tools they have at hand, including data mining, to protect its citizens from terrorism. That was at least the premise with which I approached and accepted the invitation to participate in a 2-day conference titled: Data Mining and Human Rights in the Fight Against Terrorism. It was held in connection with DETECTER, a research project undertaken by the European Union focusing on the “ethical and legal ramifications of the use of various detection and surveillance technologies in counter-terrorist efforts.” The project is being carried out by a consortium of seven European academic institutions and coordinated by the Centre for the Study of Global Ethics at the University of Birmingham, UK. I found the conference to be a fascinating and intellectually challenging exercise.

There was an interesting mix of participants in the event, including a handful of academics and participants from the U.S. The bulk of the attendees were Europeans representing the legal, law enforcement and intelligence communities and academic representatives from the universities and institutes involved. The academic contingent, interestingly enough, covered several disciplines including philosophy, ethics, public policy, IT and the law.

Organized by the research team from one of the participating partners, the University of Zurich, it was held June 10-11 in Zurich, Switzerland. The organizers had previously conducted a study on data mining in the context of counter-terrorism, and this became the basis for the conference.

So back to the question the Europeans were debating: Should we data mine or not in the fight against terrorism?

The debate started with the basics, of course, as it almost always does. What exactly is data mining? So on to definitions. Here are some of the most popular:

Process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. [Two Crows Corp.]

Analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to data owner. [Hand, Mannila, Smyth]

The U.S. Government has actually defined data mining for us within the public code. The Federal Agency Data Mining Reporting Act (FADMRA), 42 U.S.C. § 2000ee-3(b) (1) defines it as follows:

A program involving pattern-based queries, searches, or other analyses of one or more electronic databases, where—

(A) a department or agency of the Federal Government, or a non-Federal entity acting on behalf of the Federal Government, is conducting the queries, searches, or other analyses to discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity on the part of any individual or individuals;

(B) the queries, searches, or other analyses are not subject-based and do not use personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals, to retrieve information from the database or databases; and

(C) the purpose of the queries, searches, or other analyses is not solely—

(i) the detection of fraud, waste, or abuse in a Government agency or program; or

(ii) the security of a Government computer system.

So is it a program, is it a process, is it analysis, is it extraction or is it simply certain use of information, as DETECTER apparently defines it? (“Use of information technology to attempt to derive useful knowledge from (usually) very large data sets.” (DETECTER, Work Package # 6))

The different definitions can lead you in very different directions. For example, the Department of Homeland Security, in its 2009 report to Congress for FADMRA compliance (2009 Data Mining Report to Congress, Department of Homeland Security, December 2009), identifies three data mining systems within DHS that must be reported:

Automated Targeting System (ATS)

Data Analysis and Research for Trade Transparency System (DARTTS)

Freight Assessment System (FAS)

On the other hand, in the study carried out by DETECTER in their first attempt to inventory data mining systems in the U.S. and Europe, they identified over a dozen in DHS alone. Granted, the DETECTER purpose was to develop a list of possible systems to track, explore and study, but still the definition being so broad makes an Excel spreadsheet potentially a data mining tool.

I happen to be a strong believer in the use of technology to improve the human condition. The question is, of course, how do we do this and best harness the power of a specific technology to do good things. Leaving aside for the moment its use as a counter-terrorism tool, we have already seen many proven applications of data mining. From its early beginnings in “database marketing,” we saw the potential in applications such as cross marketing, credit assessment and fraud detection. We have also seen several stories recently reported about successful data mining applications in law enforcement. The Houston Police Department has been doing some interesting things in this area, and the New York Times recently reported how the Richmond Police were using it to better manage their resources, resulting in a decrease in crime in certain neighborhoods. (“Reaping Results: Data-Mining Goes Mainstream” by Steve Lohr, May 20, 2007).

Data mining clearly holds promise in dealing with the explosion of data in the 21st century. Applications in healthcare, education, and logistics, among others, will probably provide significant benefits to society. The predicament that we confront is the classic dual-use dilemma: How can we perform data mining and protect ourselves from the potential of doing harm?

This arises because most data mining exercises in counter-terrorism almost by definition must “mine” personally identifiable information, or PII for short. We are ultimately trying to find the names, addresses, birth dates, passport numbers and driver licenses of terrorists within very large databases that store PII of millions of individuals. And, by the way, the overwhelming majority of these individuals are not terrorists but perfectly innocent people. We are trying to find a needle in a haystack while simultaneously trying to determine what the needle looks like. This effort is obviously fraught with both technical problems and ethical dilemmas since it now exposes mostly innocent people to significant violation of their privacy with potentially damaging consequences.

This has been very present in the minds of most privacy activists and defenders of human rights because of the potential harm that comes from the generation of so-called false positives. Data mining in some way implies the identification of patterns from which to create a “signature” or “profile” to apply against a database or list. A false positive, when dealing with people, is the erroneous identification of a person as fitting a profile—say, of a known terrorist—when it is actually false. We have often been told of persons being detained at airports for appearing on the TSA “no fly list” when they have been misidentified as someone else who shares the same name and/or other relevant attributes.

So applying data mining to the task of catching terrorists hits us up front with two problems. First, the technical problem: because acts of terrorism are extremely rare, we have very little data from which to compose a highly reliable signature. Without this truly trustworthy template for pattern matching, the possibility for false positives is very high. This leads us to the second major problem, the ethical one: someone who is falsely identified as a terrorist is under severe risk of undeservedly having bad things happen to him or her. Hence, the potential for harming the innocent is very high.

Because of these issues, a substantial amount of debate has moved to exploring whether there are ways to do data mining while preserving the privacy of the individuals whose sensitive information is being mined. A starting point might involve an examination of what people consider to be sensitive information. There have been surveys and studies done to determine the “degree of touchiness” that people assign to types of data about themselves. The Ponemon Institute did a study that appeared in the July 15, 2006, issue of CIO Magazine and showed that 83% of respondents were touchy about their health records and 74% about their banking or home mortgage records. Touchiness, of course, also brings up the question of who “owns” the information, where that information is generated and who holds or stores it.

The Fair Information Practices Principles (FIPPs) provide an initial framework for the privacy discussion, but beyond that we have seen the emergence of advances in so-called privacy-preserving data mining (PPDM). To simplify what is truly a very complex subject, PPDM techniques follow two broad approaches: a randomization approach or a cryptographic approach. The first injects random “noise” to hide or disguise the data, which is filtered out for processing. The second often involves inter-operation of two different databases to prevent disclosure of personally identifiable information.

But much of the problem around the use of data mining arises also because of the role of trust in privacy concerns related to data mining. Specifically, in many instances, there seems to be a crisis of trust vis-à-vis the government. This debate at least allowed us to introduce some of the basic concepts around a “calculus of trust” which we might be able to apply to the problem as we attempt to develop solutions.

I also pointed to the importance of ethics and both a set of principles developed by the European Group on Ethics in Science and New Technologies as well as the Ten Commandments of Computer Ethics developed by the Computer Ethics Institute. These can provide frameworks for behavior that should be useful in the protection of personal data.

Given that data mining is ultimately a business intelligence exercise, it is important to understand as many viewpoints as possible with respect to the limitations of its potential, especially in this very important application in the fight against terrorism. As usual, it is key that we be able to identify the second order consequences of data mining and identify those that are unintended, unanticipated, and undesirable. Once we have done this and found ways to reduce them, then we are going to be in a better position to accomplish our objectives.

Dr. Barquin is the President of Barquin International, a consulting firm, since 1994. He specializes in developing information systems strategies, particularly
data warehousing, customer relationship management, business intelligence and knowledge management, for public and private sector enterprises. He has consulted for the U.S. Military, many
government agencies and international governments and corporations.

He had a long career in IBM with over 20 years covering both technical assignments and corporate management, including overseas postings and responsibilities. Afterwards he served as president of the Washington Consulting Group, where he had direct oversight for major U.S. Federal Government contracts.

Dr. Barquin was elected a National Academy of Public Administration (NAPA) Fellow in 2012. He serves on the Cybersecurity Subcommittee of the Department of Homeland Security’s Data Privacy and Integrity Advisory Committee; is a Board Member of the Center for Internet Security and a member of the Steering Committee for the American Council for Technology-Industry Advisory Council’s (ACT-IAC) Quadrennial Government Technology Review Committee. He was also the co-founder and first president of The Data Warehousing Institute, and president of the Computer Ethics Institute. His PhD is from MIT.