It's worse than that. The author assumes no false negatives; that out of 301 positives, the terrorist must be in there. In fact there's a 10% chance that none of the 300 people you nabbed are terrorists, you let him go. He also assumes that the population of 1000 contains exactly one terrorist.

Even assuming that, and even assuming that the 90% rate is correct for both positive and negative assertions, if the device picks 301 people, there's only a 90% probability that it correctly identified the terrorist and he's in that group of 301, so it's not 1/301, it's 0.9/301 - so 0.299%, not the 0.33% that 1/301 gives you.

It would likely be a bit different than this too, because it's a rare test that has the same rate of false positives than false negatives.

Also of interest is that it only deals with 90% which is only one side of the story...

If your machine is 90% accurate how did you get to that figure?

And therfore what does it realy mean?

In his case of 3000 people you would get 300 positives, but you need to go on and ask if there are 3000 people is there actually any terrorists amoungst them? And if so do they ALL actually get picked up or only some of them?

But importantly how on earth do you actually test it.

For instance if you used test subjects how do you know that they are "fully independant" with respect to each other. That is will the test pick up PIRA, RAF, Al Queada with the same accuracy or not, and if not why not?

Also you need to know if the 90% is from a single test (has RPG on their back), from a chain of tests (has beard, olive skin, brown eyes and RPG on their back) or a tree of tests (If eyes blue then X, if Brown then Y, else Z).

Oh and I liked the comment of one of the posters to the page saying 90% is ok if you expect 10% in the population. Unfortunatly that is not right (theres no right answer for it but) it's nearer correct if you expect 20% in the population (Oh for that level of confidence ;).

In industrial Quality Control, the best test or inspection processes are considered to be effective 85% of the time. This is the reason that true quality assurance comes from controlling process, not from inspections and tests. Could we expect this test to be better? The 90% assumption must be questioned.

As JRR writes, giving an "accuracy" figure, and identifying that figure with the false-positive rate is an incomplete, even misleading characterization of this sort of system.

There are two variables that matter: the false positive rate FP, and the false negative rate FN. These sorts of systems generally have a "response curve", in which FP is plotted as a function of FN. This response curve completely characterizes the behavior of the system.

The response curve genrally slopes downwards --- a lower FN corresponds to a higher FP. This is easy to understand: a system that identifies everyone as a terrorist (FP=1) never misidentifies a terrorist as a good guy (FN=0), whereas if the machine gives everyone a pass (FP=0) it will never ID a terrorist (FN=1).

There is generally a user-settable threshold that can be dialed up or down, depending on the desired "sensitivity". Depending on the setting, the system moves up or down its response curve. The name of the game is to claim that your system has a sweet spot: a portion of the curve that is sensitive (low FN) but has an acceptably low FP rate. That is, the response curve starts at FN=0, FP=1, as it must, but then immediately plummets down to a low FP (say FN=0.01, FP=0.01), levels out, and then gradually extends to FN=1, FP=0. The sweet spot is where the curve levels off.

It is absolutely crucial that such a sweet spot should exist for any system that is to be used for mass loyalty/criminal intent screening (as opposed to investigation of suspects), because any appreciable FP will manifest itself as tens, hundrededs, or even thousands of false bad-guy IDs, depending on the number of people screened and on the sensitivity setting. Turning the sensitivity down to turn down the FP noise will result in a higher FN, which is to say a lower chance of catching a bad guy.

Insofar as I am aware, no system of any kind --- magic terrorist detectors, polygraphs, or anything else --- actually has this kind of response curve. The Congressionally-commissioned National Academies study of polygraphing found that the actual available calibration data was comically inadequate to support the efficacy claims made by securocrats and vendors, but that to the extent that data is available, no sweet spot exists. The only reason polygraph loyalty screening doesn't result in hundreds of US security employees losing their jobs every year is that the sensitivity is turned way down --- the tests couldn't detect a spy even if he'd just come back from meeting a his Chinese controller (say). Ouija boards would be equally effective. But the securocracy finds polygraphs so familiar and reassuring that they reject any such criticism out of hand, preferring magical thinking to scientific uncertainty.

Misunderstanding tests in this and other ways has much greater implications in the medical context. Everyone wants to do screening tests but most screening tests aren't very good. There are very few screening tests that have decent analytic and clinical validity and clinical utility. Just watch how medical tests are dealt with in the heath reform discussions.

Accuracy here is a function of three independent factors:
1. Sensitivity is how good the test is at detecting the quality when it is present. This is the unitary complement of alpha.
2. Specificity is how good the test is at missing the quality when it is indeed absent. This is the unitary complement of beta.
3. Prevalence is frequency of occurrence in the population under test.

Given a test with a=.1 and b=.1 then accuracy has a minimum at p=.5 where false positives are expected to be .1 and false negatives also .1.

When the quality is more likely than not, accuracy is higher, false positives lower, and false negatives higher.

When the quality is less likely than even-money, accuracy is also higher, false positives higher, and false negatives lower.

When the quality is exceedingly rare, virtually every positive will be a false positive, and it won't matter if the test is 99% accurate if the follow-up routinely identifies the negatives as false negatives.

Our company created a tiger detector made of wood, and this discussion of statistics has convinced me we can also sell it as an elephant detector, rhinoceros detector, and several other species. Thanks to Bruce and to other commenters for bringing this new marketing opportunity to our attention.

When trying to explain the base rate fallacy to some of my cohorts, I run into the problem of saying that such a detection machine is 90% accurate vs. that the machine is right 90% of the time.

There IS a difference, right?

It seems that the confusion the general public have when something is 90% accurate, with the whole forgetting-the-false-positives thing is that they're (inadvertantly) assuming they've gotten past that. In their minds, the statement reads that 90% accurate means that when the machine beeps that it's a terrorist, 9 times out of ten times, it's correct, and that was a terrorist.

Is this a fair statement? Can someone add something to this to make it clearer to the general schlomo?

Another mental trick I use for explaining the base rate fallacy is to suggest a screening system which always gives the more likely answer. Then I point out just how much higher the accuracy of the "always guess the more likely outcome" screening actually is.

Several others have mentioned the question of 90% accurate vs other methods of measuring effectiveness. The article jumps dramatically from a single percentage to failure rates, without questioning what the 90% means. If its 0% false-positives and 10% false-negatives, that would be a 90% where teh chances of that one guy being a terrorist is 90%. On the other side, a 10% false-positive rate and 0% false-negative rate also yields 90%, but the percentage chance of a positive test subject being at terrorist depends highly on the terrorist population at large.

I think the article assumes 10% false-negative and 10% false-positive. I don't know of a scanning machine out there which does that. usually you pick your null hypothesis and adjust tweaking factors until your most damaging case has a lower rate.

From the article
"If 3,000 people are tested, and the test is 90% accurate, it is also 10% wrong. So it will probably identify 301 terrorists - about 300 by mistake and 1 correctly."

The story assumes there is for sure 1 terrorist in the population of 3000. OK.
This ignores the case where where the terrorist did not get a positive signal.

Basically it's saying the false positive rate is 10%. You might ask, what would the false negative rate have to be for the 1 terrorist to "probably" be among the group testing positive? Exercise left to the reader.

I remember teaching this kind of thing re Aids testing in an intro stat class back in the early 90's.

Roy's "accuracy" (see above) is in fact "the probability that the machine is right", irrespective of whether the subject is benign or malignant. As he points out, it is a function not only of system parameters (FP and FN rates), but also of the proportion of malignants to benigns in the tested population (Roy's "prevalence").

You have no control over that proportion, except to the extent that you pre-screen. What you do control is the FP and FN rates, which you can trade off according to the system response function. So you have some limited control over the "accuracy", assuming the population proportion of malefactors is fixed. In existing systems that I am aware of, this control is insufficient to make a satisfactory bad-guy screening detector.

However, as you say, the term "accuracy" is often, misleadingly, used to characterize the FP rate (as in the cited article) or the FN rate alone. This sort of usage is worse than useless, although it makes for great marketing copy and lazy journalist bait.

Yes and no: 90% accuracy means that the system is right nine times out of ten. But as terrorists are incredibly rare in the general population, the real issue is not false negatives (a terrorist is falsely assumed to be clean) but false positives (a clean person is labeled terrorist).

If you have 3000 clean people and 1 suspected terrorist and your detector works with 90% accuracy. It will name 300 "terrorists". Of them only 0.9 are dangerous, all others are false positives.

@Dylan: Yes, an accuracy of 90% means the machine is right 90% of the time and wrong 10% of the time.

The problem comes when you apply this machine to ten million people per year at airports. When the machine is wrong, either it will miss a guilty person, or it will wrongly flag an innocent person. Suppose there are 100 actual terrorists out of those ten million people (which probably vastly exaggerates the number of terrorists, but nevermind). Then the machine is likely to miss at least 10 of the terrorists--oh well--but far more importantly, its going to flag about *one million innocent people*.

So what you have is a test that flags 90 terrorists and about one million innocent people. Which is utterly useless!

Notice that even if the machine was 99.9% accurate, you'd still have the same problem, only slightly less severe. The machine flags one out of every thousand people incorrectly. If you're lucky, it will flag all 100 of the terrorists as bad guys, and that part is great. The not-so-great part is that it also flags around 10,000 innocent people out of the ten million innocents! So approximately 1 in 100 of the people flagged by the machine would be terrorists, and the rest would have to be processed (and harassed, and investigated, and held without bail, and have their rights trampled on in dozens of other ways, all at the taxpayer expense).

Of course there will never, ever be a test that is 99.9% accurate at detecting "terrorists". Even 90% accuracy sounds wildly optimistic to me, and is already so inaccurate as to be downright useless.

In the UK current (in jail) ordinary criminals are hovering around 0.1%of the population.

Even when prevalence is very low (terrorists in general population) you need to be carefull with your test group size and the likleyhood of a terrorist being present in the test group.

You need to think through what each variable in the test you are doing has on the outcome.

To start with you have a box that indicates one thing (aproximatly) 90% of the time.

But what does it mean,

With the box you get a true or false output (not terrorist 90%, terrorist 10%) which in turn may be correct or incorrect.

So there are four not two possible outcomes from each use of the box,

A) Terrorist : who is (correct) [TP].
B) Non Terrorist : who is not (correct) [TN].
C) Terrorist : who is not (incorrect) [FP].
D) Non Terrorist : who is (incorrect) [FN]

The first incorrect (C) is known as "an error of the first type" or False Positive, usually due to the test sensitivity being to high. You could say the box (if it where human) was skeptical and saw fault where there was none.

The second incorrect (D) is known as "an error of the second type" or False Negative, due to the test specificity being to high. That is you could say the box (if it where human) was complacent and hade commited an oversight.

Next you realy need to consider not just one test but a number of tests taken on a subset or group of the general population. Each member of the "test group" is (supposadly) selected at random from the general population and there for each is "independent" of each other (in reality this is almost never the case).
The acid question you need to ask is what is the likely hood of my very very rare target (terrorist) being in my test group, and importantly to what extent.

Which means of the test group of 3000 people one or more may or may not be a terrorist (after all intel can be wrong).

Which means you can have four cases,

In case 1 (no terrorists) you have,

1.A) 2700 not terrorist : who are not (correct),
300 terrorist : who are not (incorrect).

In case 2 (1 terrorist) you have (effective) the same output from the device but two posabilities,

2.A) 2699 non terrorists : who are not (correct),
300 terrorist : who are not (incorrect),
1 non terrorist : who is (incorrect).

Or,

2.B) 2700 non terrorists : who are not (correct),
299 terrorists : who are not (incorrect),
1 terrorist : who is (correct).

In case 3 where you have 2 (or more) terrorists,

3.A) 2698 non terrorists : who are not (correct),
300 terrorist : who are not (incorrect),
2 non terrorist : who are (incorrect).

Or,

3.B) 2700 non terrorists : who are not (correct),
298 terrorists : who are not (incorrect),
2 terrorist : who are (correct).

Or,

3.C) 2699 non terrorists : who are not (correct),
299 terrorists : who are not (incorrect),
1 non terrorist : who is (incorrect),
1 terrorist : who is (correct).

Finally there is case 4 where your group is all terrorists (I'm leaving the numbers the same unlikley as it is),

@ Mat: I have that book, "How to Lie With Statistics". I too am reminded of it frequently. It's a good read.

As for the article, I like how the author reframes the question: Rather than looking at how 90% of bad guys will be caught, instead look at how 10% of good guys will be falsely suspected. (Assuming 90% successful identification.) Namely, take the additive inverse and apply it to the opposite group.

> I've got the perfect terrorist detector
> for the scenario in the article.
>
> It has a zero false positive rate, and a
> false negative rate of 0.03%: Just scan
> everyone with it, and it says "not a
> terrorist", unfailingly, every time.

Actually, that has a false negative rate of 100%. However, it has an accuracy of about 99.97%
You have to be careful at how you manipulate your information in your marketing strategy. don't mention the false negative rate, and tout the high accuracy.