Tuesday, May 25, 2010

How to Test for Syphilis

Yesterday on NPR's "All Things Considered" I heard this segment about "sparsity", which tried to link several different issues about data compression in one story. I don't think it was very successful, and the chosen term "sparsity" wasn't really representative of the content.

Nevertheless, the piece opened up with an interesting puzzle. The Army wants to do comprehensive blood tests for syphilis, a relatively rare disease, but each individual test is expensive. How can they test everyone more cheaply?

The idea is simple: mix the blood of g individuals together, and test that. A positive outcome indicates that at least one person in the group has syphilis, and a negative outcome indicates that no person in the group has it. In the event of a positive outcome, test all the members of the group.

Now let p be the probability that a randomly-chosen person has syphilis, and let N be the population size. What is the optimal size for the group? Choose g too small, and you end up doing lots of tests, because N/g (the number of groups) is large. Choose g too large, and it becomes very likely that testing the group will be positive, so you end up doing lots of tests again. Somewhere in between is the optimal choice of g.

How do we find it? There are N/g groups, and we have to test each of them. A randomly-chosen person fails to have syphilis with probability 1-p, so the probability that everyone in the group fails to have it is (1-p)g. Hence the probability that a group tests positive is 1-(1-p)g, and the expected number of positive-testing groups is (N/g)(1-(1-p)g). We have to test each person in a positive group, so this means g(N/g)(1-(1-p)g) = N(1-(1-p)g) additional tests. If the cost of a test is C, then the total cost is CN/g (the cost to test each group), plus CN(1-(1-p)g) (the cost to test everyone in the positive-testing groups). Factoring out CN, we need to minimize

1/g + 1 - (1-p)g. (1)

Notice that this expression only depends on p, the probability.

We can, in fact, find a closed form for this minimum (or at least Maple can), but it is somewhat complicated, and depends on the little-known Lambert W-function. It's easier to compute the minimum through bisection or some other method for any particular p. A recent estimate is that syphilis occurs in about 15.4 per 100,000 people, which corresponds to p = .000154. For this value of p, the optimal g is g = 81, which gives (1) the value of .0247. Thus, for this p, we are able to test everyone by combining tests -- for less than 3% of the cost of testing everyone individually.

By the way, a simple heuristic argument suggests that the g that minimizes (1) will be about p-½: to minimize (1), choose g so that the two terms in the sum (1), 1/g and 1-(1-p)g, are equal. Set g = p-½; then 1/g = p½. On the other hand, 1-(1-p)g = 1 - (1-g-2)g. But (1-1/g)g is about 1/e for large g, so using the Taylor series for exp(x), we see that 1 - (1-g-2)g is about 1/g, too. So this choice of g will be very close to the one minimizing (1), and the value of (1) is therefore about 2/g = 2 p½.

Added: with more complicated tests (see the comments), one can do even better. But some of the proposed methods are so complicated that it seems to me the possibility of human error in carrying them out would rule them out.

Syphilis in the US is more common among lower income brackets, and most people who join the army today are not from the higher income brackets so this suggests that they should assume p is slightly higher. On the other hand, people in the army are generally slightly younger so they've had less time to actually pick it up. So this may be a wash. Using .000154 is probably a close enough approximation that it won't matter.

(Incidentally, I've never seen this technique before. It seems like a very clever trick).

Luckily the value of (1) grows quite slowly near its minimum. So even if our estimate of p is somewhat off the mark, we'll still do reasonably well. For example, if p is twice as large as we predicted, choosing g = 81 only gives a cost of about .0370.

If you can use each of the n samples several times (like using only a bit of it each time), and if there is only one case of syphilis, then you can solve the problem in lg n tests.

You write the n numbers in binary, and compose lg n mixes so that mix i has a bit of all samples with bit i set to 1. The lg n binary results of the tests write the number of the sample with siphilis.

I give this problem in third and fourth year as an advanced view of (parallel) binary search (usually with the goal to detect a sample of contaminate water rather than siphilis, though). There are similar solutions when more than one sample is contaminated but, as in Jeff's analysis, I think that the estimate of the number of sample is important for the efficiency.

As dete says, the problem is recursive; since p is so small, for moderate-sized groups the optimal strategy actually begins rather similarly to jyby's, testing a substantial fraction of the group at once.

If my hacked-up code is right, for instance, then with a group of size 1000 and p=0.000154 you should begin by testing 333 of them. If that group tests positive, you should continue by testing 41 of them; in any case, when you've finished sorting out your group of 333 you should then test another 333 of the remaining 667, treating them the same way. Finally, and perhaps most surprisingly, with your final group of 334 you should begin by testing ... 333 of them!

If you proceed optimally then your expected number of tests is about 10.6, as compared with 24.7 using the 81-way split Jeffrey recommends.

Ahem. There was a bug in my code, so two corrections to my last comment. The correct thing to do with the last 334 people is to begin by testing *all* of them, not to test all but one. And the expected number of tests for 1000 people is about 9.6, not about 10.6.

Ahem squared. I had another bug. The best thing with n=1000 and p=0.000154 is to begin by testing everyone. In the not-terribly-likely event that you get a positive result, you should then test 377 people. The expected number of tests is about 3.2, a pretty big improvement on 24.7.

377. Hmm, Fibonacci. This appears not to be entirely coincidental. I don't understand why, but I haven't really tried and it's time I went to bed.

Some further comments, and my not-very-polished code for this, are at my blog.

We had three batches of transformer oil to get rid of. I was fairly sure there were no PCB's, so I tried to get Safety to test a mixture of all three. At some point you have to worry about sensitivity of the test, or interfering contaminants, when combining a large number of samples.This reminds me of the "which object is of odd weight problem".

Pete D: as for any biological agent (and as opposed to poisons), you can deal with the dilution at the cost of time. Given a soup ("Bouillon de Culture") and a bit of time, you can generate any quantity of the agent you want.