Friday, January 6, 2012

Complete example of right censoring in JAGS (with rjags)

In this post I provide a complete and simple example of how to do estimation of right-censored data in JAGS (with rjags). I could not find a complete, simple example anywhere (on the web or in print), and it took me lots of trial and error to figure out the tacit assumptions, which I try to make explicit here. (Of course, there still might be errors or infelicities. Let me know.)

Before launching into programming details, let me illustrate why this is important. Right-censored data can occur in many situations. For example, when measuring response times, the subject is given a limited temporal window in which to respond, and if no response has occurred by the end of the window, then the trial is aborted and the response is recorded as censored. If we assume that the subject was "on task" but merely didn't have enough time to respond before the experimenter got bored, then it would be inappropriate to omit this trial from the modeled data. In other words, the trial is providing genuine information about the response time, namely, that it takes longer than the experimenter's censoring limit. If we omit this trial from the modeled data, then our estimates will be too small!

Here is an example. The data are 500 values generated from a normal distribution, rescaled so that the sample mean is 100.0 and the sample standard deviation is 15.0. Then I marked any value above +1SD as NA. In other words, I censored about 15% of the data in the right tail. Suppose we model the data with a normal likelihood and estimate its μ and σ parameters. We would like the estimates to recover the true generating values of μ=100.0 and σ=15.0. But if we simply omit the censored values (as if they weren't part of the data), the estimates look like this:

Posterior estimates when censored data are omitted. Estimates are too low.
(ROPE is shown only for visual landmarks; it's not particularly meaningful in this case.)

Notice that the estimated parameter values are too low. This makes sense, because the model is trying merely to describe the uncensored data shown in grey, without any attempt to take into account the censored data.

On the other hand, if we tell JAGS that there are censored values that it should take into account, the estimates look like this:

Posterior estimates when censored data are modeled. Estimates are accurate.
(ROPE is shown only for visual landmarks; it's not particularly meaningful in this case.)

Notice that the estimated parameter values are very accurate. This improvement in accuracy (and meaningfulness!) is why we want to be able to model censored data.

The basic idea for implementing in JAGS is presented briefly (one paragraph) in the JAGS user manual. The idea starts with defining an indicator vector that encodes, for every data value, whether the data value was censored or not. The indicator value, here named isCensored, is 1 for censored values and 0 for non-censored values. Then that indicator value is modeled thus: isCensored ~ dinterval(y,censorLimit) It took me a while to infer from that syntax the key idea: When isCensored is 1 and y is missing, then all JAGS knows is that y is somewhere above censorLimit, so JAGS effectively imputes a random value for y from the likelihood function (defined separately, in the usual way). Here are the tacit assumptions that I eventually figured out:

Censored data must be recorded as NA, not as the value of censoring limit.

When explicitly initializing the chains, the censored values of the data must be explicitly initialized (to values above the censoring limits)!

The full program for generating the figures above can be found here. For details, read the comments in the top of the program. Please let me know of errors, improvements, etc. And please let me know if this was helpful.

12 comments:

This post is fortuitous, because I was struggling with right-censored data this week. I've expanded this model into one which is slightly more complicated, in which the data have associated measurement uncertainties, so the body of the model is:

The model assumes that you know when you've conducted a measurement trial but the measurement was censored. It other words, the model assumes that you know when you have started and stopped an observation window that defines a measurement trial. I think this is a ubiquitous, if tacit, assumption in all conceptions of data measurements.

Do you mean that you have missing data, and some of the data might be censored trials? Then you're getting into models of missing values, and into the issues of missing-at-random, or missing systematically, etc.

I'm trying to understand how to construct the censorLimitVec with a standard set of survival data. I have a vector of event times t and a vector of censored status c. I see that I must set t[which(c)] = NA, and is.censored = c. Can I put the actual censor time in censorLimitVec for censored events? What do I put in censorLimitVec for uncensored events? Or does it matter? If it doesn't matter, then can I just do is.censored[i] ~ dinterval(t, t)?

The key to understanding what JAGS is doing is that JAGS automatically imputes a random value for any variable that is not specified as a constant in the data. Thus, when y[i] is NA (i.e., a missing value, not a constant), then JAGS imputes a random value for it.

But what value should it generate?

The second line of the model, above, says that y[i] should be randomly generated from a normal distribution with mean mu and precision tau.

But the first line of the model, above, puts another constraint on the randomly generated value of y[i]. That line says that whatever value of y[i] is randomly generated, it must fall on the side of censorLimitVec[i] dictated by the value of isCensored[i].

To understand this part, let's unpack the dinterval() distribution. Suppose that censorLimitVec has 3 values in it, not just 1: censorLimitVec = c(10,20,30)Then randomly generated values from dinterval(y,c(10,20,30)) will be either 0, 1, 2, or 3 depending on whether y<10, 10<y<20, 20<y<30, or 30<y. So, if y=15, dinterval(y,c(10,20,30)) has output of 1 with 100% probability. The trick is this: We instead specify the output of dinterval, and impute a random value of y that could produce it. Thus, if we say1 ~ dinterval(y,c(10,20,30))then y is imputed as a random value between 10 and 20.

Yes, I was getting at missing-at-random, missing-below-threshold kinds of situations.

-- Jan ("hypergeometric")

P.S. Thank you so much for putting up such an excellent compendium of resources on Bayesian calculation, as well as your book. (I personally own 2 copies, and convinced my department to purchase a third. We have JAGS and runjags widely installed, etc.)