as components of a data-supported counterterrorist system, helping to perform specific functions that intelligence agents find useful, such as helping to detect aliases, or combining all records concerning a given individual and his or her network of associates, or clustering events by certain patterns of interest, or logging all investigations into an individual’s activity history. Data mining could even help with such tasks as screening baggage or containers. Such tools may not specifically rank people as being of interest or not of interest, but they could contribute to those assessments as part of a human-computer system. This appendix considers these possible roles in an examination of what is currently known about data mining and its potential for contributing to the counterterrorism effort.

An important related question is the issue of evaluating candidate techniques to judge their effectiveness prior to use. Evaluation is essential, first, because it can help to identify which among several contending methods should be implemented and whether they are sufficiently accurate to warrant deployment. Second, it is also useful to continually assess methods after they have been fielded to reflect external dynamics and to enable the methods to be tuned to optimize performance. Also, assuming that these new techniques can provide important benefits in counterterrorist applications, it is important to ask about the extent to which their application might have negative effects on privacy and civil liberties and how such negative effects might be ameliorated. This topic is the focus of Appendix L.

H.2PREPARING THE DATA TO BE MINED

It is well known by those engaged in implementing data mining methods that a large fraction of the energy expended in using these methods goes into the initial treatment of the various input data files so that the data are in a form consistent with the intended use (data correction and cleaning, as described in Section C.1.2). The goal here is not to provide a comprehensive list of the issues that arise in these efforts, but simply to mention some of the common hurdles that arise prior to the use of data mining techniques so that the entire process is better understood.

The following discussion focuses on databases containing personal information (information about many specific individuals), but much of the discussion is true for more general databases.

Several common data deficiencies need prior treatment:

Reliable linkages. Often several databases can be used to provide information on overlapping sets of individuals, and in these cases it is extremely useful to identify which data entries are for the same individuals across the various databases. This is a surprisingly difficult and