tag:www.schneier.com,2016:/blog//2/tag:www.schneier.com,2006:/blog//2.973-2016-09-03T05:36:07ZComments for Terrorists, Data Mining, and the Base Rate FallacyA blog covering security and security technology.Movable Typetag:www.schneier.com,2006:/blog//2.973-comment:1603055Comment from CVi on 2013-08-02CVi
The computers look for terrorists and leave the dirty work to the police...
Resulting in the police doing the dirty work for the computer instead of hunting credit card and cell phone thiefs.

Thy could assign the computers to catch the cell phone and credit card thiefs, and let the police do their work. They'd probably catch as manny terrorists.

@tom: "the majority of people who test HIV positive on their first test DO NOT have HIV, but no one says the test is useless, because all of those people then take a second test"
It depends. Let's not forget the miss rate. If HALF of people with HIV tested negative. In addition to the false positives.
Also let's imagine that second test involves a biopsy.

One more thing that needs to be said, all the terrorists that get caught and get media coverage, are discovered by police "stumbling" upon a clue, that is nested up using regular police work, the old fashioned way.
The ones that doesn't get caught that way, might as well turn out be the ones that NSA can't catch either.
And as I said earlier, if they used the computers to catch the credit card and cell phone thiefs *instead*, they'd free up those resources to do other police work. The nett result would be a lot of saved time, money, and more "non terrorist" criminals caught.

]]>
2013-08-03T04:42:43Z2013-08-03T04:42:43Ztag:www.schneier.com,2006:/blog//2.973-comment:581928Comment from hope on 2011-09-03hope
@Tom, "Oh, one more thing: Floyd Rudmin, your professor. He is a professor of psychology. Which means, right, he is not an expert on Bayesian analysis"
Not a psychology but the study of groups within groups would have some merit to this.
if you have 50 groups(character/ethic/lifestyle,etc), and one of the groups is likely to be the type for a terrorist then a 80% susses rate, could just be based on the selected persons which were in the neighborhood, and there wasn't the other 59 groups in high numbers.
A million dollar house in a very bad neighborhood, the people that own the million dollar house might not show up in the result type thing]]>
2011-09-03T10:24:44Z2011-09-03T10:24:44Ztag:www.schneier.com,2006:/blog//2.973-comment:581903Comment from Tom on 2011-09-03Tom
Well, there are lies, damn lies, and statistics. This fits into the statistics category. The assumptions here are fundamentally flawed, sorry. You are absolutely correct in that with the overwhelming majority of non-terrorists relative to terrorists, an INITIAL positive "hit" as a terrorist is far more likely to identify a non-terrorist than a terrorist, but with each successive round of testing, the ability to identify a terrorist increases dramatically (and "further investigation" would not mean interrogation, it would mean reading a second email, or more likely, reading a first email as the initial positive hit would be from a computer identifying some anomaly like a key word or strange internet purchase).

As an example: the majority of people who test HIV positive on their first test DO NOT have HIV, but no one says the test is useless, because all of those people then take a second test, and the vast majority of people who test positive multiple times DO have HIV. Once you've gone through one or two rounds of selection, the odds of separating true-positives from false-positives becomes very favorable.

I'm ambiguous on the use of data mining to capture terrorists, but I hate to see the credibility of statistics diminished in the eyes of the public because people without the ability or desire to use it properly try to abuse it to sway public opinion.

Oh, one more thing: Floyd Rudmin, your professor. He is a professor of psychology. Which means, right, he is not an expert on Bayesian analysis. He's just some guy as far as statistics are concerned. I feel it was dishonest to not state that he is a professor in a field not related to statistics because it leads readers to assume his is an expert opinion. It isn't, its just propaganda.

]]>
2011-09-03T09:38:49Z2011-09-03T09:38:49Ztag:www.schneier.com,2006:/blog//2.973-comment:89161Comment from Tank on 2006-07-18Tank
>> @Tank: "Where but in the least informed dicussions is it suggested that the NSA calls database is used to
>> identify terrorists rather than providing an unrivalled and infinitely useful investigative tool to aid existing
>> investigations by providing an outline of a suspects personal contact networks ?"

> Probably in the FISA court room, I guess. Isn't that the exact type of scenario where a warrant is granted?

Yep. The only thing that should generate a question mark here is how you went from sounding like you had a clue in one sentence....

....to sounding like you're puzzled by what you just said yourself in the following sentence.

BTW who gives a shit what FISA is doing or not ?
It doesn't factor into the conclusions of this article or my statements about the usefulness of phone contact data for mapping human networks.

]]>
2006-07-18T11:51:21Z2006-07-18T11:51:21Ztag:www.schneier.com,2006:/blog//2.973-comment:88186Comment from chunkada on 2006-07-17chunkada
"As terrible as the war in Iraq is, it has not managed to effectively kill 500,000 children unlike the trade sanctions earlier."

Comparing ten years with 2 years? And what about the mutagenic effects of the chemicals and heavy metals and radiological elements sprayed around, which will have the same effect as the ones sprayed about ten years ago did, ie birth defects and odd syndromes.

]]>
2006-07-17T05:01:58Z2006-07-17T05:01:58Ztag:www.schneier.com,2006:/blog//2.973-comment:88184Comment from chunkada on 2006-07-16chunkada
Oh and the best use for datamining?

]]>
2006-07-17T04:57:09Z2006-07-17T04:57:09Ztag:www.schneier.com,2006:/blog//2.973-comment:88182Comment from chunkada on 2006-07-16chunkada
@Tank: "Where but in the least informed dicussions is it suggested that the NSA calls database is used to identify terrorists rather than providing an unrivalled and infinitely useful investigative tool to aid existing investigations by providing an outline of a suspects personal contact networks ?"

Probably in the FISA court room, I guess. Isn't that the exact type of scenario where a warrant is granted? So tell us, in your infinite expertise, what is the FISA-abortive NSA thing for?

... and many more responses like this .. I do not have the time. Good luck with it, hope you are over it soon America.

]]>
2006-07-17T04:45:53Z2006-07-17T04:45:53Ztag:www.schneier.com,2006:/blog//2.973-comment:88175Comment from chunkada on 2006-07-16chunkada
oh and if you have 1/300,000 terrorists (@Nigel Sedgwick), you have a problem that no dragnet is gonna cure. In a population of 300,000,000 -- 1,000 *terrorists*? Are you kidding me?! I don't see embattled militia fighting street-to-street over there yet. I don't see internal faction wars. Or do I?

I think P(T) is much, much lower, I think actual terrorists, you should compare the 9/11 incident, which allegedly took 20 personel within the US.

And given this has not happened again, it probably means that even less than 100 people in the US would commit a major act of terrorism if you did nothing (I'm guessing).

And of those 100, how many are truly competant? And are they likely to have the same success as before? If so, why? Because your people are all busy snooping on their neighbours, instead of trying to make the country a nicer place and make its installations less useable for harm?

I think you'd be lucky to find 1,000 actual terrorists worldwide in any given year. What's that .. 1/6,000,000. Put that figure into your NSA dragnet probability calculator and watch the smoke come out.

]]>
2006-07-17T04:41:43Z2006-07-17T04:41:43Ztag:www.schneier.com,2006:/blog//2.973-comment:88163Comment from chunkada on 2006-07-16chunkada
above is me]]>
2006-07-17T04:25:53Z2006-07-17T04:25:53Ztag:www.schneier.com,2006:/blog//2.973-comment:88162Comment from Anonymous on 2006-07-16Anonymous
And one more thing:

Don't bring the 'the probability can be further refined by additional research' argument.

The probability assigned is defined as the final probability outcome of the SYSTEM. If you think it can be refined, then assign a better probability in the first place. Doesn't matter. The sums still say you're wrong.

]]>
2006-07-17T04:25:22Z2006-07-17T04:25:22Ztag:www.schneier.com,2006:/blog//2.973-comment:88160Comment from chunkada on 2006-07-16chunkada
oops: N(!t) and N(i) should both be either N(!t) or N(i) .. take your pick which symbol I should have used.]]>
2006-07-17T04:23:15Z2006-07-17T04:23:15Ztag:www.schneier.com,2006:/blog//2.973-comment:88159Comment from chunkada on 2006-07-16chunkada
further on the coin flipping thing,

The point of the anology is that even *if* the NSA dragnet is good enough to make the probability that a dragnetted identity == terrist P = 0.5, the problem is that you have still got a big bunch of people who that P applies to. Go back and look at the numbers used in the examples given, and question which of these hit/miss-rates you think are realistic for an automated system to achieve.

Make up an example for yourself using the hit and miss rates you think are real.

Then add it up like this:

* N(I) number of innocents dragged = population of US (a very large number)

* N(!t) number of innocents believed by NSA to be terrorists = population of US * misidentification rate (still a very large number)

* N(T) number of terrorists dragged = population of US * terrorism rate (a very small number)

* N(t) number of terrorists actually identified as terrorists by NSA (an even smaller number)

* P(T) probability that a person *identified by NSA as a terrist* is *actually* a terrorist = N(t) / [N(t) + N(i)]

Remember that N(i) is much larger than N(t) -- a very small number divided by a very large number, ie approximately zero, as explained by the good Professor.

Recall also that N(i) is very large, so sorting through the [N(i) + (N(t)] group by hand is not likely to be feasible. And because the P(T) is almost zero, any correlation between appearing on the NSA list, is specious.

The argument given above by some readers that the NSA are nice to people who appear on the list kinda reinforces this argument, rather than weakening it. The NSA *know* that the list is meaningless.

So what is the purpose of the drag-net?

Don't ask simply, what is the purpose of the list? -- that is not necessarily the purpose of the drag-net. In fact, I hope the list is not the purpose of the drag-net, since as pointed out also by the Professor or Bruce, the list does have correlations to activities other than terrorism -- unless the identities are chosen *completely at random*.

So what we end up with is a mass of publicity, a mass of fear toward the state, a mass of fear of terrists, and a list of people who fit some set of criteria which has not been made public.

But what we don't end up with is a useful list of people who have any useable probability of being associated with terrorism.

And what was the cost in financial terms for the technical implementation, let alone the social and personal costs and future political and societal implications? This technology is not run-of-the-mill. It has been purpose-designed, and implemented at great cost, at multiple points in the system, ie multiply the cost by number of installations (it's not deployed widely enough to become cheaper with scale ... unless it is being deployed globally ...)

I think, think about this.

]]>
2006-07-17T04:22:07Z2006-07-17T04:22:07Ztag:www.schneier.com,2006:/blog//2.973-comment:88153Comment from chunkada on 2006-07-16chunkada
If you thought the argument was wrong, you are incorrect. NSA dragnetting is not effective at finding terrists. The probablility argument is quite correct. The NSA dragnet pulls in and misidentifies many many many innocents, while locating only a few 'baddies', and the problem of seperating those groups still remains. Probably not very easy, given that all target suspects fall into the category, by definition, of what the NSA call 'dodgey'.

The coin-flipping is referring to looking at people who have been selected through NSA, not at looking at random members of the population. It is a confusing presentation to use.

The rest of the argument must be that the cost of invading the privacy and unjustly accusing 30,000 or however-many innocents to find 90 or however-many *potential* criminals is too high.

I think this is a discussion about gaining intelligence concerning people who *may* at some point commit a crime, more than it is about locating people who evidence indicates have committed crimes. If it was the latter, then I think a more directed approach would be taken. Would NSA FBI etc even consider this level of attrocity if they had an option of following hard evidence leads? I don't think so.

Either they have a secondary motive, or they have simply mis-judged the appropriateness of this response.

]]>
2006-07-17T03:53:53Z2006-07-17T03:53:53Ztag:www.schneier.com,2006:/blog//2.973-comment:86334Comment from Clive Robinson on 2006-07-13Clive Robinson
Having read through the postings, the argument appears to boil down to the probability of finding a lone terorist before he has committed the act, based on his communications and contacts.

In practice I doubt very much that that is the main aim of most anti-terrorist activities.

The professor is probably correct, you will not find an intelegent lone terrorist by data minning or any other mass survalence technique, it is just to easy to stay below the noise level. Also history of their communications and conntacts is not likley to throw up any other terrorists.

Also the lone terrorist due to supply difficulties is not likley to have access to sufficient materials to be "Random Target" active. They are more likley to pick a target such as an aircraft or train where a small explosion will produce a "high value" return. Due to this the normal survalence systems are considerably more likley to pick them up.

However if you think instead about terrorist organisations you are not dealing with lone individuals, this gives rise to recruitment issues where a history of communications and contacts will have a high probability of identifing other members of the terorist organisation.

With a terrorist organisation, the most desirable person to remove is the "Directing Mind", followed by the "Financing Hand", then either the "Supporting Network" or "Recruiting Agents". If these people are removed then the terrorist organisation will become at best disffunctional or cease to exist.

The terorists who commit the actual acts are as has been seen recently "expendable bio-mass/DNA" and will have been kept as an issolated group for a significant period of time by the organisation for security. This means that their may well not be sufficient history in the NSA DB for their communications and contacts to be seen.

However if you can identify even one recruiter and work your way back up the command chain to the directing mind, you can then work your way back down the individual paths to quite a large part of the organisation.

The problem is that in an established terrorist organisation the recruiter is likley to know they are a marked person and will use non conventional communications (say cut outs) and contacts back to the rest of the organisation.

This also suggests that the Proffesor might also be correct, in that you cannot mine data you don't have.

However the next line of attack the security services can take is to follow the financing and purchasing chains. Even terrorists need to eat, sleep and relax, all of which requires the expenditure of money. Unless they are out at a job then they will need to receive the money from somewhere.

Datamining for people with odd finacial profiles is going to prove very very fruitfull not just for finding terorists but drug dealers, people trafficers and other criminals.

We do not know if the NSA has access to everybodies financial information but it would seam unlikley that they did not at some level (Tax returns etc) or could easily obtain it in bulk (afterall large chunks are for sale as a commercial activity and the DHS does have the power to get the information if it so desires).

Also to commit a serious attack terorists need transportation and other materials, most of which can be traced back to a financial transaction, the recording of which is usually beyond their control.

Also some materials are just not that easy to get hold off in the quantities required, so looking at abnormalities in purchases (or thefts) of certain materials and other items might well give an indication as to an event becomming likley. Likewise with importation and transportation information. Again we do not know if the NSA has access to these types of records but again it would seam unlikley that they do not at some level.

So if the NSA can get access to financial, purchase and transportation records as well then the odds of finding terrorist goes up a lot.

As the credit agencies do a lot of financial modeling of US citizens allready, a scan through their DBs cross corelated with even a very large list of possible terrorists will produce significant dividends.

I think that with additional data over and above the communications and contacts a fairly effective automated system could be quite effective at finding large numbers of "undesireables" not just terrorists...

]]>
2006-07-13T11:41:08Z2006-07-13T11:41:08Ztag:www.schneier.com,2006:/blog//2.973-comment:86247Comment from winsnomore on 2006-07-12winsnomore
While the good professor doesn't know exactly what criteria NSA uses, he is surely brilliant for proving it can't work!!

]]>
2006-07-13T02:19:34Z2006-07-13T02:19:34Ztag:www.schneier.com,2006:/blog//2.973-comment:86235Comment from Vulturetx on 2006-07-12Vulturetx
Wow all the wrong assumptions from the orignal article onward to the commentors.

1. Data Mining when using a seed of "known terrorist(S)" significantly increases the detection rate. Yes there are more false positives than true positives. Turns out - many decision trees have this fault; does not stop beneficial results from occurring.
2. Data Mining is a group of programs ran by NSA. When a hit is collaborated by multiple programs the possiblility of a false positive is lessened.
3. Contrary to the extremists like Roy , being tagged as a "terrorist suspect" by the NSA does not mean investigation even much less the death and impisonment he claims. Since the FBI and other agencies subject these lists to human review.
4. Yes the system has worked, and the NYT has talked about it. They just did not understand the methodology.

5. Congrats you are already the victim of data mining. Usually multiple incident victims, but you keep reading your email and going to websites.

Me -someone who has built the Data Mining collection Clusters.

]]>
2006-07-13T00:52:19Z2006-07-13T00:52:19Ztag:www.schneier.com,2006:/blog//2.973-comment:86159Comment from johnb on 2006-07-12johnb
The comparison to flipping a coin is specious - flipping a coin on 300 million people in the US would misidentify 150 million as terrorists. A detector that was only 23% accurate, but only misidentified 3000 people, as in the example, would be quite useful.

Do you ever respond to these comments? I usually love your site, but this post is extremely bad. My first thought upon reading it was that the 50-50 coin toss analogy was terrible, and in fact a system with one false positive for every true positive would be an excellent system indeed. And sure enough, klassobanieras and others have been hammering this point. Do you have a response?

I guess I find this worrying because I normally respect your judgement. Not to get too personal, but are you letting your feelings against the program cloud your analysis?

]]>
2006-07-12T18:43:19Z2006-07-12T18:43:19Ztag:www.schneier.com,2006:/blog//2.973-comment:86124Comment from Nigel Sedgwick on 2006-07-12Nigel Sedgwickhttp://www.camalg.co.uk
bob wrote: "... not ostracized by their neighbors, put in jail or had their homes and possessions seized"

That is why I like such a measure as Detection Gain, Watchlist.

It is fairly easy to understand, for example, that the legitimate suspect is thought to be approximately 10,000 times more likely to be a terrorist than the average US citizen (ie around a 97% chance that he is innocent); he needs to be investigated further, prior to any consideration of arrest or search warrants. That indicates how the suspect should be treated, much better than "he's a suspected terrorist, bring him in (dead or alive)".

And remember that the current fuss is about traffic analysis of telephone call logs: it's no where near evidence in the sense normally considered in a criminal prosecution.

Best regards

]]>
2006-07-12T17:07:23Z2006-07-12T17:07:23Ztag:www.schneier.com,2006:/blog//2.973-comment:86094Comment from bob on 2006-07-12bob
@tony: your lottery analogy fails because the people who did NOT win the lottery were only out ~$1, not ostracized by their neighbors, put in jail or had their homes and possessions seized.]]>
2006-07-12T15:02:21Z2006-07-12T15:02:21Ztag:www.schneier.com,2006:/blog//2.973-comment:86065Comment from Anonymous on 2006-07-12Anonymous
Just to summarize:

All of you that use "maths" is stupid and ludicrous but I can't prove any of this because I work for some super secret stuff in the UK. But if you don't believe me, you're stupid and ludicrous too because you don't use Google.

And, bloody hell, that stupid explorer cleared my name out again when I wasn't looking! Someone's attacking me! Perhaps I can find out who really is doing this using my sophisticated data mining techniques that Bruce and everyone can't seem comprehend the brilliance of.

]]>
2006-07-12T12:57:39Z2006-07-12T12:57:39Ztag:www.schneier.com,2006:/blog//2.973-comment:86020Comment from Tank on 2006-07-12Tank
> And your statement that there are no well-described terrorist profiles is plain
> wrong, There are hundreds of them. I use them every day.
> Please credit your readers' intelligence by doing a bit of research on this topic
> before you write about it again. -- " " @ July 11, 2006 12:04 PM

This is a reoccurring problem.
Given the fact that reporting on suspect and evidence captures in terrorism cases is now worldwide mainstream news and that there are at least a dozen published works dealing only with analysis of terrorist's motivations, personal accounts and their lives at some point you have to conclude the ignorance is willing and purposeful.

Hell places like SITE are now included alongside the NYT in google news. Exactly how much research could you do on the topic of terrorism and still believe that the best characteristics the NSA has for terrorist profiles is their 7/11 purchases and which phone numbers they dialled. My guess is none.

]]>
2006-07-12T09:41:46Z2006-07-12T09:41:46Ztag:www.schneier.com,2006:/blog//2.973-comment:86016Comment from Tank on 2006-07-12Tank
> @Tank You claim the article is flawed but offer no mathematics to refute it.
> Posted by: Ralph at July 11, 2006 01:19 AM

What math?
I said the assumption that this data is used to identify persons as terrorists rather than identify the human networks associated with an identified subject is flawed.
Square that or add 7 if you like but the point you missed was that adding math to a flawed assumption is pointless.

> You suggest it might be rigged but also offer nothing to support the accusation.

Yeah... i did. The problem here is apparently that you didn't read or understand anything I posted before you replied to it.

BTW did you fail to provide math to refute my assertion that DNA is completely useless for identifying criminals because vials of DNA all look the same in a line up (false positive rate) or did you get the point I was making about a rigged arguement against the usefulness of data ?

> Data mining for MV crime after it has been commited is not the same as
>looking for someone you think might commit a crime at an unknown future date.

Yeah that was my point.
My other one was you'd need to be ignorant or intentionally misleading if upon learning that there was an MV registry which is used frequently by all law enforcement agencies in investigations, you assumed that it was being used for predicting future crimes or identifying potential criminals.

Supporting such a ridiculous assumption with maths, however competantly calculated, in no way improves upon the ridiculousness of your assumption.

> Please don't use the word we because you don't speak for me. If you represent
> more than yourself plse could you disclose this to other readers.

Yeah actually i do speak for you since you're not gonna be willing to say you disagree with what i've written.
In fact since i can't imagine anyone will i think i'll stick with the all encompassing "we" as entirely appropriate.

]]>
2006-07-12T09:22:16Z2006-07-12T09:22:16Ztag:www.schneier.com,2006:/blog//2.973-comment:86009Comment from Jon on 2006-07-12Jon
I am the author of the post at "Posted by: Anonymous at July 11, 2006 12:04 PM". I didn't actually mean to post it anonymously: explorer helpfully cleared that form field for me when I wasn't looking. I've no idea who the subsequent "Anonymous" was, but I fear he/she may be taking the mickey.

Of course I cannot point to evidence of these successes and I cannot tell you where to find the technology without revealing who I work for. What I can say is that I do not work for the US government; I work in the UK.

But you don't need to believe *me*; if Bruce and others actually bothered to do some research on Google, they'd find lots and lots of people all successfully doing data mining in this way. I'll give you a clue: social network analysis. And while they were there, they could look up data mining and find out what it was. This would hopefully stop them making gob-smackingly stupid assumptions about how you would use it to look for terrorists and other criminals. The 'maths' presented assumes that data mining techniques just look at every 'indicator' one by one to see whether it's terrorist or not. This is clearly ludicrous and ignorant. The whole point of data mining is to work with relationships amongst multiple entities and statistical relationships amongst multiple indicators. Thus rendering all of this 'maths' about data mining utterly meaningless.

If my calculations are correct, it means that the system is correct 99.9898% of the time.

The statistic compared to the flipping of the coin was:
If identified as a terrorist, are they really a terrorist.
If the system gave 50% for that statistic, wouldn't it mean that only one innocent person would be detected per terrorist detected?
Surely narrowing their search from 300 million to 30 400 to find their 400 terrorists would be a worthwile step.

]]>
2006-07-12T04:05:54Z2006-07-12T04:05:54Ztag:www.schneier.com,2006:/blog//2.973-comment:85978Comment from tony on 2006-07-11tony
i think this article is absurd. actually his approach to explain his theory does not make sense to me. it is like to say that nobody is going to win the lotto because the chances are slim:

I'd rather not say anything more, but let's just say that my work is very unique and specialized and is unencumbered by stuff like math or proofs or anything like that.

]]>
2006-07-11T20:08:38Z2006-07-11T20:08:38Ztag:www.schneier.com,2006:/blog//2.973-comment:85931Comment from bob on 2006-07-11bob
This seems to be the thread that wont die.

You guys also seem to be overlooking what they do AFTER they've decided that (a given person) is not a threat (begging the question that they ever actually decide someone will be excluded from further 'processing'), what do they do with the information pertaining to him/her? Keep it until something he/she has done IS illegal? Let office workers take it home on a laptop and leave the hard drive sitting on the roof of their car overnight? Sell it?

]]>
2006-07-11T19:44:10Z2006-07-11T19:44:10Ztag:www.schneier.com,2006:/blog//2.973-comment:85924Comment from Davino on 2006-07-11Davino
Kevin Davidson: I agree. And from the original post, the very highest of scores 1-(3,900/300,000,000) under the most fantastic conditions, would give maybe a 23% chance of being right, or 77% chance of shooting an innocent person.

Terrorist's moms and their mom's friends might score surprisingly high by this program.

I think you missed the point with scoring. Scoring eventually results in a binary choice, you either investigate further or you don't. If you investigate hundreds of thousands of people, then the resources you could have applied in other areas are mostly wasted.

Could you please provide pointers on how to find more information about these successes? It would almost be comforting to me to see evidence that data-mining can work against terrorist networks, that we are not throwing privacy out the window for dubious gains. But it's hard for me to imagine the US government not widely publicizing any such successes to justify their efforts.

Your blue sedan counterexample is seriously rigged. 30,000 or 300,000,000 suspects is functionally equivalent if you have 10 cops. Investigating blue sedans makes sense if there are 100 blue sedans and you have 10 police officers, it makes no sense if there are 1,000,000 blue sedans and you have one cop.

Using the numbers from the example in the link (remembering they're pretty generous) - if you have 1000 cops, that's 30 suspects per cop to investigate. Again, remember that all of the "ususal suspect" questions have already been asked (that's the point of the data mining in the first place). You already know there's "something suspicious" about these suspects, so you'd have to figure that investigating these suspects is going to require you to put in some serious work. If it takes 2 weeks to clear a suspect, that's 1,000 cops working full time for 3/5 of a year to identify the 400 terrorists.

Now, admittedly, that looks like a pretty good deal. But if you have 1,000 cops working full time for 3/5 of a year to catch 4 terrorists, that's not so much of a good deal if you can catch 8 of them by having one cop log into a chat room and pretend to be an Islamic militant and do humanint. If it takes a month of bugging their phone and following them (something that may require a team of investigators), the tradeoffs start looking really bad quickly...

]]>
2006-07-11T17:12:56Z2006-07-11T17:12:56Ztag:www.schneier.com,2006:/blog//2.973-comment:85899Comment from Anonymous on 2006-07-11Anonymous
Bruce - I think much of your work is great but on this issue I have to inform you that you're missing a few tricks.

I can't tell you much about what I do, but in my everyday work, I use data mining techniques (admittedly quite unique and specialised ones, but data mining nevertheless) to track down fraudsters, terrorists and other 'organised' criminals. And it works. In fact, it works really well. I don't need to appeal to Bayes theory or to any speculation based on completely unrealistic made-up scenarios. I can simply point to the fact that I do it for real day-in, day-out on real live data and it works. It doesn't catch everyone, but it does catch many. And the impact on everyone else is miniscule. We throw away data that doesn't relate to or contain anything suspicious immediately so we don't have to waste more time and money working on it.

One of the many things you've either missed or chosen to ignore is that it is not only information about actual bombers and terrorist cell members that gives useful leads to identify a terrorist plot. There are all sorts of individuals and activities that play a part in enabling acts of terror against innnocent citizens. Who sells the materials to these guys on the black market? Answer? Crooks. Greedy people. How do they get money to fund their terrorist acts? For tThose training camps in Afghanistan and elsewhere? Answer? Drugs, fraud, serious organised crime. Follow the money. Who runs the websites that host manuals on how to build bombs to maximise casualties? Bad guys. None of these people might be classed as 'terrorists' by your simplistic assumptions but I reckon many people would count these illegal activities as 'fair game' in the fight against terrorism. Certainly these activities are not included in any of the numbers you've used. If you tot up the number of people involved in these activities and the number of relationships amongst them, you suddenly find a lot more needles for the same amount of hay.

Also, your implication that data mining only works with known profiles is wrong; unsupervised clustering analysis can detect anomalous behaviours without ever being told what they look like. And your statement that there are no well-described terrorist profiles is plain wrong, There are hundreds of them. I use them every day.

Please credit your readers' intelligence by doing a bit of research on this topic before you write about it again.

]]>
2006-07-11T17:04:32Z2006-07-11T17:04:32Ztag:www.schneier.com,2006:/blog//2.973-comment:85889Comment from Nigel Sedgwick on 2006-07-11Nigel Sedgwickhttp://www.camalg.co.uk
@Roy, who wrote: "In 1776 the troublemakers in the colonies declared independence, insisting they would not tolerate shabby treatment from somebody named George. What was that line about not learning from history?"

But they chose someone called George to lead them, and got the French to help (on the sound basis that they, the French, would think 'my enemy's enemy is my friend').

Which just goes to show that arbitrary facts are no help, as well as arbitrary numbers being no help.

Best regards

]]>
2006-07-11T15:47:14Z2006-07-11T15:47:14Ztag:www.schneier.com,2006:/blog//2.973-comment:85877Comment from roy on 2006-07-11roy
1. The most optimistic analyses I've seen of wholesale data mining all ignore the obvious: the enemy, not being fools themselves, can have opted out of the universe being mined by the simple expedient of using communication channels that the NSA cannot examine.

Couriers can travel without their presence being recorded, as passengers in cars or mass transit, or as unregistered passengers on aircraft, trains, or ships. There is no electronic communication here, so monitoring is physically impossible.

Handwritten messages, or electronically recorded messages, can travel by courier, or through the mails, undetected.

If the terrorists are keeping their terrorist communication outside the sphere, then we are building castles in the air and having discussions about engineering and architectural concerns that simply don't matter.

2. If wholesale data mining is done diligently, it will result in complete failure, for a reason not evident in statistical analyses.

Suppose you were in charge of investigating 30,000 positives a day, and leaving cases open indefinitely was unacceptable. Even if your staff were huge, clearing 30,000 cases a day would put you all in the business of clearing cases, and only that. After the first several cases, they would all start looking alike, and your abilities to make distinctions would extinguish quickly. So, even if there were the rare occasional terrorist among your positives, you would routinely clear his case because routinely clearing cases is all you know how to do.

3. If wholesale data mining is done dishonestly, while it will never turn up a terrorist, it will generate bogus terrorists, keeping up with government demand in their publicity scams.

If the agency investigates 30,000 positives a day, the unofficial standing order would be to pick out the few who would most easily be framed. (With 30,000 random people to pick from, finding the idiots should be no trouble.) Run the picks through kangaroo courts and make sure the press sticks to the party line. Keep reminding the public what a great job the government war on 'terrism' is doing. Meanwhile remember to occasionally put out nonspecific warnings to take no specific actions at no specific time in no specific place.

4. The NSA people involved here are not stupid or innumerate. But they do know where their money comes from and they are willing to play along. It's that, or leave. Those who have left can claim honor. Those who have stayed are criminally responsible.

5. The cost to a filthy-rich government of a single false positive is negligible. The cost to that single false positive can be maximal: it can be the ruination of his life, even his execution without trial.

6. In 1776 the troublemakers in the colonies declared independence, insisting they would not tolerate shabby treatment from somebody named George. What was that line about not learning from history?

Your point about relative scores assisting targeting is a good one. There is some discussion of the NSA providing rankings to the FBI (scores of 1, 2, and 3). However, those rankings don't really help if the false positive rate is very high for even your highest ranked targets.

The Bush administration has been extremely interested in publicizing any successes in the war on terror. None of those successes have been attributable to the NSA's program. The NSA's program has been going on for years, but it hasn't contributed to the capture of a single terrorist.

From this I conclude that the NSA program has been and continues to be a waste of money and a massive violation of the law without making anybody safer. If the program has in fact been successful, the NSA needs to prove it, both to Congress and the people.

@Neighborcat

"The efficacy of the method is irrelevant!"

In a court of law, you're probably right. In the court of public opinion, it makes a big difference.

However, I don't rate your argument on the handshaking; that is unless it is hyperbole. If the latter, that is (I judge) too subtle for many who read here.

The first important part is that the Detection Gain (W) should be very high, to compensate for the fact that the a priori probability is very low. Thus, if the product of them is less than say 1% (my not very informed judgement) then that person is not worth further investigation. This is on the basis that the absolute value of any of the assumed figures is rather poor.

The next important point is those making the more detailed resource allocation and tasking decision must have some grasp of the numbers and what they mean. If, as is reported just above from the NYtimes, numerate judgement has gone (hopefully temporarily), the money, effort and commitment will be wasted.

Going back to Bruce's original posting, and the referenced articles, they are too pessimistic. They are also wrong to the extent that they do not consider the ranking of targets (as I describe) as an aid to resource allocation.

Finally, the 1 in 300,000 is not a particularly good starting point. Add in some Detection Gain (not absolute) concerning sex, age, ethnic background, religion, nationality, education. Then add in another set of Detection Gains, concerning good-guy attributes. One can only do these things where they are known (which is by no means common, and which itself costs). There are problems and dangers. But it's still likely to be worth doing to an appropriate extent, rather than not doing through following inadequate reasoning. Putting numbers in makes it grey. It's not black and white and it never has been, except for the simple-minded.

Best regards

]]>
2006-07-11T13:47:13Z2006-07-11T13:47:13Ztag:www.schneier.com,2006:/blog//2.973-comment:85864Comment from Davino on 2006-07-11Davino
Nigel, setting the threshhold from a ranked list makes sense. However since you're looking at such a small fraction of the population (1000/300,000,000) even obscenely dramatic improvements in the P(T|e)/P(T) gain is still insignificant. A gain of like 2x or 10x, (which is an amazing level of success in commercial data mining applications), would net you only 0.007 or 0.03 terrorists with a false positive rate of 99.9999% or 99.9967% of the 1000 people investigated.

The only way this program could be of any use is in producing a list of persons associated with a specific person already under investigation. If we want to devote 1000 more investigations into the associates of Mohammed Atta, we'd run this program, take the top 1000 most associated with him, strike off the ones we already know about, (like Atta supposedly met Saddam, Saddam shook hands with Rumsfeld, and then Bush hired Rumsfeld) and then take a look at the ones that remain.

"In the anxious months after the Sept. 11 attacks, the National Security Agency began sending a steady stream of telephone numbers, e-mail addresses and names to the F.B.I. in search of terrorists. The stream soon became a flood, requiring hundreds of agents to check out thousands of tips a month.

But virtually all of them, current and former officials say, led to dead ends or innocent Americans.

F.B.I. officials repeatedly complained to the spy agency, which was collecting much of the data by eavesdropping on some Americans' international communications and conducting computer searches of foreign-related phone and Internet traffic, that the unfiltered information was swamping investigators. Some F.B.I. officials and prosecutors also thought the checks, which sometimes involved interviews by agents, were pointless intrusions on Americans' privacy."

]]>
2006-07-11T12:24:46Z2006-07-11T12:24:46Ztag:www.schneier.com,2006:/blog//2.973-comment:85852Comment from Nigel Sedgwick on 2006-07-11Nigel Sedgwickhttp://www.camalg.co.uk
First, the text file layout was not very good. I've now put up a PDF file, which is a bit better. It's at URL: http://www.camalg.co.uk/sundry_2006/schneier_060711a.pdf

@Bernhard

Assuming the data mining algorithms provide any discrimination in favour of the target subset, every entry in the top-scoring few will have a higher probability of being a target than those not in the top-scoring few. Furthermore, the higher in the list, the more likely that person will be a legitimate suspect (even if the actual probability of them really being a terrorist is still low). This is on the basis of the "gain" in discrimination obtained from the data mining.

Consider for example, a person who has telephone the number abroad of a known terrorist organisation; this is against the 290+ million persons in the USA who have not phoned this organisation. Do you not think that caller is somewhat more likely to be a legitimate terrorist suspect than everyone else.

Now, there is, of course, the case that any prudent terrorist would not do something so obvious. However, he may have contact by telephone with a less clever accolyte who has, or who did somewhat earlier and did not take the excellent advice to change phone number, address, mobile phone, etc.

Each tiny bit of such evidence helps a tiny bit. If enough tiny bits are put together, cost-effectively by automatic processing, it is of some help.

Aren't you ignoring the fact that the sorted list of top-scoring suspects will be full of false positives?
I cannot see a reason why real terrorists would on average have a higher score than false positives.
Otherwise, the probability of detecting a terrorist would be very close to 1, which is not a realistic assumption.

]]>
2006-07-11T11:11:04Z2006-07-11T11:11:04Ztag:www.schneier.com,2006:/blog//2.973-comment:85841Comment from Nigel Sedgwick on 2006-07-11Nigel Sedgwickhttp://www.camalg.co.uk
The original article by Professor Rudmin looks too narrowly at the issue.

The individual score is very important, and is that aspect that Prof Rudmin has not considered sufficiently. One does not have to consider for arrest and interrogation, every legitimate terrorist suspect. A much more likely policy, for such well-informed organisations as the NSA and the FBI, is that a sorted list of the higher-scoring legit suspects would be produced, with their scores. Valuable and expensive covert (or in some cases overt) investigatory resources could then be allocated to the very highest-ranking legit suspects, as judged cost-beneficial and according to resource availability.

Now, of course some politicians and managers, of the statistically uninformed sort worried about terrorism (and the need to be seen to 'do something') might introduce the odd and serious glitch into this well-understood process. This may well cause investigatory teams to be tasked with futile investigation of the (very likely) innocent. Likewise some law enforcement 'foot soldiers', improperly tasked or insufficiently well trained in the real importance of their work, might find some of the investigatory legwork seemingly pointless.

Now for some very approximate numbers (or perhaps not).

If P(T) is 1/300,000 and investigatory resources are available for 1,000 investigations (of a particular sort and cost), we have no idea (prior to looking at the actual scores from data mining), as to what threshold 'e' should be set. However, we do know that we should consider no more than the top 1,000 candidates.

Then we should consider the scores based on the data mining evidence 'e' (that is the approximate Detection Gain, Watchlist) and also the assumed a priori probability P(T) (which is only known approximately). This is to determine whether the investigation of the least likely individuals on the hot list should actually go ahead. This decision should take into account the cost of the investigation (including the adverse motivational effect of pointless tasking on investigatory staff), together with the level of invasion of privacy and possible infringement on civil liberties (justified through the circumstances and P(T | e) in the least likely case pursued).

Now, of course there are several unknowns in all of this. The a priori probability 'P(T)' is only approximate. Likewise, the Probability Density Functions (PDFs) arising of the target (P(e|T)) and non-target (P(e|~T))data subsets are only known approximately. [Though note that the PDF of the non-target set is known much more accurately than the PDF of the target set, and this itself is useful in avoiding bad investigatory targeting.] However, it should be quite obvious that targeting the top-ranking of a sorted list (derived according to evidence of some merit) is far better than forgetting the ranking and setting some arbitrary threshold based on very approximate assumptions.

Best regards

]]>
2006-07-11T10:07:03Z2006-07-11T10:07:03Ztag:www.schneier.com,2006:/blog//2.973-comment:85838Comment from quincunx on 2006-07-11quincunx
Good point Matthai, not much attention is given to political empire building, or the general workings of Iron Law of Oligarchy in a Monopoly framework.

The way to get ahead in gov is to built an empire of employees beneath you. If you can just figure out some excuse for doing it, you will. It is also important to waste as much of your budget as possible so that you can claim that more is necessary. Of course a higher budget is necessary anyway since the previous fiscal period was entirely spent on misallocating the market economy and generally creating more problems in society.

In business, success is success.
In gov, failure is success.

(I need not go into the fact that 'cooking the books' & GNP calculation is nearly the same thing upon close inspection)

Now don't get me wrong, gov can be very successful in a narrow sense, especially when they outlaw competition, but 'catching terrorists' is not something they do as well as 'creating terrorists' (just like they are worse at 'performing useful services, economically' than 'creating fiat money'. Of course having people believe they can is a great excuse for perpetual conflict for perpetual peace.

If some willing people can just take the time to examine some history their teachers glossed over (somewhat having to do with being threatened to be forced out of the teachers' union) - they would realize that this time period we're in sure seems A LOT like other periods, almost to a tee. And if one sees how they play out (and will continue to play out if people continue to believe that societies' biggest parasite [look up etymology of 'politics'] is actually its greatest benefactor' they should certainly be skeptical of the optimists & those in denial.

I advise some reading of Man, Economy, & State by Murray Rothbard and Robert Higgs' Crisis & Leviathan & Against Leviathan to any scholar that would like to approach this topic in any socially scientific manner.

]]>
2006-07-11T09:48:14Z2006-07-11T09:48:14Ztag:www.schneier.com,2006:/blog//2.973-comment:85834Comment from Matthai on 2006-07-11Matthai
First possibility is they are paranoid. Second is they do not target terrorist, but political oponents.

But there is also third possibility. They are just wasting the money. Or using the money for something else. Look, they have a great job. They can be incompetent and inefficient and they can always hide themselves under "national interest". They won't tell you their success rate and amount of spent money, because that could "endanger national security".