Friday, July 18, 2014

I am working on the second edition of Think Stats, adding chapters on some topics that didn't make it into the first edition, including survival analysis. Here is the draft of the new chapter; this post contains some of the highlights.

Survival analysis

Survival analysis is a way to describe how long things last. It is often used to study human lifetimes, but it also applies to “survival” of mechanical and electronic components, or more generally to intervals in time before an event.

If someone you know has been diagnosed with a life-threatening disease, you might have seen a “5-year survival rate,” which is the probability of surviving five years after diagnosis. That estimate and related statistics are the result of survival analysis.

As a more cheerful example, I will use data from the National Survey of Family Growth (NSFG) to quantify how long respondents “survive” until they get married for the first time. The range of respondents’ ages is 14 to 44 years, so the dataset provides a snapshot of women at different stages in their lives, in the same way that a medical cohort might include patients at difference stages of disease.

For women who have been married, the dataset includes the date of their first marriage and their age at the time. For women who have not been married, we know their age when interviewed, but have no way of knowing when or if they will get married.

Since we know the age at first marriage for some women, it might be tempting to exclude the rest and compute the distribution of the known data. That is a bad idea. The result would be doubly misleading: (1) older women would be overrepresented, because they are more likely to be married when interviewed, and (2) married women would be overrepresented! In fact, this analysis would lead to the conclusion that all women get married, which is obviously incorrect.

Kaplan-Meier estimation

In this example it is not only desirable but necessary to include observations of unmarried women, which brings us to one of the central algorithms in survival analysis, Kaplan-Meier estimation.
The general idea is that we can use the data to estimate the hazard function, then convert the hazard function to a survival function. To estimate the hazard function, we consider, for each age, (1) the number of women who got married at that age and (2) the number of women “at risk” of getting married, which includes all women who were not married at an earlier age.

The details of the algorithm are in the book; we'll skip to the results:

The top graph shows the estimated hazard function; it is low in the teens, higher in the 20s, and declining in the 30s. It increases again in the 40s, but that is an artifact of the estimation process; as the number of respondents “at risk” decreases, a small number of women getting married yields a large estimated hazard. The survival function will smooth out this noise.

The bottom graph shows the survival function, which shows for each age the fraction of people who are still unmarried at that age. The curve is steepest between 25 and 35, when most women get married. Between 35 and 45, the curve is nearly flat, indicating that women who do not marry before age 35 are unlikely to get married before age 45.

A curve like this was the basis of a famous magazine article in 1986; Newsweek reported that a 40-year old unmarried woman was “more likely to be killed by a terrorist” than get married. These statistics were widely reported and became part of popular culture, but they were wrong then (because they were based on faulty analysis) and turned out to be even more wrong (because of cultural changes that were already in progress and continued). In 2006, Newsweek ran an another article admitting that they were wrong.

I encourage you to read more about this article, the statistics it was based on, and the reaction. It should remind you of the ethical obligation to perform statistical analysis with care, interpret the results with appropriate skepticism, and present them to the public accurately and honestly.

Cohort effects

One of the challenges of survival analysis, and one of the reasons Newsweek was wrong, is that different parts of the estimated curve are based on different groups of respondents. The part of the curve at time t is based on respondents whose age was at least t when they were interviewed. So the leftmost part of the curve includes data from all respondents, but the rightmost part includes only the oldest respondents.

If the relevant characteristics of the respondents are not changing over time, that’s fine, but in this case it seems likely that marriage patterns are different for women born in different generations. We can investigate this effect by grouping respondents according to their decade of birth. Groups like this, defined by date of birth or similar events, are called cohorts, and differences between the groups are called cohort effects.

To investigate cohort effects in the NSFG marriage data, I gathered Cycle 5 data from 1995, Cycle 6 data from 2002, the Cycle 7 data from 2006–2010. In total these datasets include 30,769 respondents.
I divided respondents into cohorts by decade of birth and estimated the survival curve for each cohort:

Several patterns are visible:

Women born in the 50s married earliest, with successive cohorts marrying later and later, at least until age 30 or so.

Women born in the 60s follow a surprising pattern. Prior to age 25, they were marrying at slower rates than their predecessors. After age 25, they were marrying faster. By age 32 they had overtaken the 50s cohort, and at age 44 they are substantially more likely to have married. Women born in the 60s turned 25 between 1985 and 1995. Remembering that the Newsweek article was published in 1986, it is tempting to imagine that the article triggered a marriage boom. That explanation would be too pat, but it is possible that the article and the reaction to it were indicative of a mood that affected the behavior of this cohort.

The pattern of the 70s cohort is similar. They are less likely than their predecessors to be married before age 25, but at age 35 they have caught up with both of the previous cohorts.

Women born in the 80s are even less likely to marry before age 25. What happens after that is not clear; for more data, we have to wait for the next cycle of the NSFG, coming in late fall 2014.

Expected remaining lifetime

Given a survival curve, we can compute the expected remaining lifetime as a function of current age. For example, given the survival function of pregnancy length, we can compute the expected time until delivery.

For example, the following figure (left) shows the expecting remaining pregnancy length as a function of the current duration. During Week 0, the expected remaining duration is about 34 weeks. That’s less than full term (39 weeks) because terminations of pregnancy in the first trimester bring the average down.

The curve drops slowly during the first trimester: after 13 weeks, the expected remaining lifetime has dropped by only 9 weeks, to 25. After that the curve drops faster, by about a week per week.

Between Week 37 and 42, the curve levels off between 1 and 2 weeks. At any time during this period, the expected remaining lifetime is the same; with each week that passes, the destination gets no closer. Processes with this property are called “memoryless,” because the past has no effect on the predictions. This behavior is the mathematical basis of the infuriating mantra of obstetrics nurses: “any day now.”

The figure also shows the median remaining time until first marriage, as a function of age. For an 11 year-old girl, the median time until first marriage is about 14 years. The curve decreases until age 22 when the median remaining time is about 7 years. After that it increases again: by age 30 it is back where it started, at 14 years.

For the marriage data I used median rather than mean because the dataset includes women who are unmarried at age 44. The survival curve is cut off at about 20%, so we can't compute a mean. But the median is well defined as long as more than 50% of the remaining values are known.

Based on this data, young women have decreasing remaining "lifetimes". Mechanical components with this property are called NBUE for "new better than used in expectation," meaning that a new part is expected to last longer.

Women older than 22 have increasing remaining time until first marriage. Components with this property are UBNE for "used better than new in expectation." That is, the older the part, the longer it is expected to last. Newborns and cancer patients are also UBNE; their life expectancy increases the longer they live. Also, people learning to ride a motorcycle.

Thursday, July 10, 2014

Last year my occasional correspondent John D. Cook wrote an excellent blog post about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers. Here's his presentation of the problem:

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Then he presents the Lincoln index, an estimator "described by Frederick Charles Lincoln in 1930," where Wikpedia's use of "described" is a hint that the index is another example of Stigler's law of eponymy.

Suppose two testers independently search for bugs. Let k1 be the number of errors the first tester finds and k2 the number of errors the second tester finds. Let c be the number of errors both testers find. The Lincoln Index estimates the total number of errors as k1 k2 / c [I changed his notation to be consistent with mine].

So if the first tester finds 20 bugs, the second finds 15, and they find 3 in common, we estimate that there are about 100 bugs.

Of course, whenever I see something like this, the idea that pops into my head is that there must be a (better) Bayesian solution! And there is. You can read and download my solution here.

I represent the data using the tuple (k1, k2, c), where k1 is the number of bugs found by the first tester, k2 is the number of bugs found by the second, and c is the number they find in common.

The hypotheses are a set of tuples (n, p1, p2), where n is the actual number of errors, p1 is the probability that the first tester finds any given bug, and p2 is the probability for the second tester.

Where choose evaluates the binomial coefficient, , and binom evaluates the rest of the binomial pmf:

def binom(k, n, p): return p**k * (1-p)**(n-k)

And that's pretty much it. Here's the code that builds and updates the suite of hypotheses:

data = 20, 15, 3

probs = numpy.linspace(0, 1, 101)

hypos = []

for n in range(32, 350):

for p1 in probs:

for p2 in probs:

hypos.append((n, p1, p2))

suite = Lincoln(hypos)

suite.Update(data)

The suite contains the joint posterior distribution for (n, p1, p2), but p1 and p2 are nuisance parameters; we only care about the marginal distribution of n. Lincoln inherits Marginal from Joint, so we can get the marginal distribution like this:

n_marginal = suite.Marginal(0)

Where 0 is the index of n in the tuple. And here's what the distribution looks like:

The lower bound is 32, which is the total number of bugs found by the two testers. I set the upper bound at 350, which chops off a little bit of the tail.

The maximum likelihood estimate in this distribution is 72; the mean is 106. So those are consistent with the Lincoln index, which is 100. But as usual, it is more useful to have the whole posterior distribution, not just a point estimate. For example, this posterior distribution can be used as part of a risk-benefit analysis to guide decisions about how much effort to allocate to finding additional bugs.

This solution generalizes to more than 2 testers, but figuring out the likelihood function, and evaluating it quickly, becomes increasingly difficult. Also, as the number of testers increases, so does the number of dimensions in the space of hypotheses. With two testers there are about 350 * 100 * 100 hypotheses. On my non-very-fast desktop computer, that takes about a minute. I could speed it up by evaluating the likelihood function more efficiently, but each new tester would multiply the run time by 100.

The library I used in my solution is thinkbayes.py, which is described in my book, Think Bayes. Electronic copies are available under a Creative Commons licence from Green Tea Press. Hard copies are published by O'Reilly Media and available from Amazon.com and other booksellers.

I believe that the approach to Bayesian statistics I present in Think Bayes is a good way to solve problems like this. I cite as evidence that this example, from the time I read John's article, to the time I pressed "Publish" on this post, took me about 3 hours.