As expected, Derek Lowe has a thoughful post (with a very interesting discussion going on in the comments) about the latest “Expression of Concern” from the New England Journal of Medicine about the VIGOR Vioxx trial.

To catch you up if you’ve been watching curling rather than following the case: A clinical study of Vioxx was performed, resulting in a manuscript submitted to NEJM by 13 authors (11 academics and two scientists employed by Merck). The study was looking at whether adverse gastrointestinal events were correlated to taking Vioxx. During the course of the study, other events in participants (including cardiovascular events) were also tracked. The point of contention is that there were three heart attacks among study participants that happened before the official ending date for the study, and were known to the authors before the paper was published in NEJM, but that were left out of the data presented in the paper that was published. NEJM has identified this as a problem. While not coming out and saying, “Looky here, scumbag pharamceutical company trimming the data to sell more Vioxx, patient safety be damned!” that’s a conclusion people might be tempted to draw here.

But, as Derek points out, it’s not that simple.
For one thing, the data that was kept in — and the data that was left out — was decided on the basis of a pre-decided experimental protocol. As noted in a reply to the NEJM by academic authors of the study:

The VIGOR study was a double-blind, randomized outcomes study of upper gastrointestinal clinical events. We, as members of the steering committee, approved the study termination date of February 10, 2000, and the cutoff date of March 9, 2000, for reporting of gastrointestinal events to be included in the final analysis. Comparison of cardiovascular events was not a prespecified analysis for the VIGOR study. . . the independent committee charged with overseeing any potential safety concerns recommended to Merck that a data analysis plan be developed for serious cardiovascular events . . . As a result, a cardiovascular data analysis plan was developed by Merck. Merck indicated that they chose the study termination date of February 10, 2000, as the cutoff date . . . to allow sufficient time to adjudicate these events . . . (The three events) were neither in the locked database used in the analysis for the VIGOR paper no known to us during the review process. However, changing the analysis post hoc and after unblinding would not have been appropriate.

(Bold emphasis added.)

The authors point out that they included all the data that was supposed to be included, as per the protocol. Perhaps you could raise an eyebrow that the cutoff date for reporting the cardiovascular events fell a month sooner than the cutoff date for reporting gastrointestinal events. (After all, Merck set that earlier cutoff — did they know something?) But, you need to set some deadline — not just so you know when you can stop the study, but also so you can analyze the data and get the results out in a timely fashion. And, the need “to allow sufficient time to adjudicate these events” makes it look (to my untrained eye) like maybe it takes longer to figure out whether there has even been a cardiovascular event than it does to recognize a gastrointestinal event. Sure, if we turned up a tape of a Merck board meeting where the Merck guys were cackling about how the early cut-off would protect their evil secret, we’d call shenanigans here. But on the face of things, there’s nothing about establishing a reporting cut-off and sticking to the protocol that’s out of line.

Indeed, as the authors point out, violating the protocol by including the extra (post-cut-off) data might be more of a problem.

To the average person, this seems counterintuitive. We do scientific research to get data from the world. More data ought to give a better picture of how things are, right?

This works to a point. Especially when the scientific question we’re trying to answer bears on people’s health and lives in a direct way (e.g., drug testing), we don’t have the luxury of waiting around till we have all the data we might get. (Getting all the data takes forever.) So, we want to make sure we have a good sample of the relevant data — enough that we can draw reasonable conclusions, and gathered in a way that we think gives an accurate sample of the whole set of data that we don’t have forever to get.

Also, we want to make sure that the conclusions we draw from the data we get are as unbiased as possible. Looking at data can sometimes feel like looking at clouds (“Look! A ducky!”), but scientists want to figure out what the data tells us about the phenomenon — not about the ways we’re predisposed to see that phenomenon. In order to ensure that the researchers (and patients) are not too influenced by their hunches, you make the clinical trial double-blind: while the study is underway and the data is being collected, neither study participants nor researchers know which participants are in the control group and which are in the treatment group. And, at the end of it all, rather than just giving an impression of what the data means, the researchers turn to statistical analyses to work up the data. These analyses, when properly applied, give some context to the result — what’s the chance that the effect we saw (or the effect we didn’t see) can be attributed to random chance or sampling error rather than its being an accurate reflection of the phenomenon under study?

The statistical analyses you intend to use point to the sample size you need to examine to achieve the desired confidence in your result. It’s also likely that statistical considerations play a part in deciding the proper duration of the study (which, of course, will have some effect on setting cut-off dates for data collection). For the purposes of clean statistical analyses, you have to specify your hypothesis (and the protocol you will use to explore it) up front, and you can’t use the data you’ve collected to support the post hoc hypotheses that may occur to you as you look at the data — to examine these hypotheses, you have to set up brand new studies.

I suspect that the requirements for clean statistical analyses may not be persuasive to people who haven’t wallowed in such analyses. Here’s another problem that’s easier to understand: after the conclusion of the study, the researchers and participants find out who was in the treatment group and who was in the control group. This knowledge could sway people’s expectations, perhaps biasing their perception of subsequent events. If I find out I was in the treatment group and the next day I feel chest pains, maybe I’m more likely to call it in to the researchers. If I find out I was taking a sugar pill and the next day I feel chest pains, I may call my doctor, but will I also call the researchers? It’s not like the drug they’re studying caused my chest pains (since it turns out I was taking the sugar pill instead); why would they need to know about my chest pains?

Given not only the placebo effect but also our tendency to get attached to our hunches, it seems worth being careful about the “data” that presents itself once a study is unblinded.

One could ask, of course, whether the folks at the pharmaceutical companies might be clever enough to set study durations that were less likely to identify potentially bad outcomes of the drugs they want to sell. Or, even if you’re not in the “big pharma is eeevil” camp, you might imagine there are situations in which unforseen bad effects only become apparent long after the study has concluded. Do researchers (and drug companies, and regulatory agencies) have to ignore these just because including them might undermine the statistical street cred of the results? If, three years down the road, every participant who was in our treatment group sprouts horns (while no one from the control group does), shouldn’t this information be disseminated?

Of course it should — but perhaps it should come clearly labeled as to what kind of statistical assurance there is (or isn’t) that the finding is representative. And perhaps a new study of the horn sprouting — properly double-blinded — is also in order.

Being unbiased is really hard, even for scientists. Clinical trials are set up in ways that are supposed to minimize the biases as much as possible. Scientists who cling to the protocols — including the cut-off dates they specify — ought not to be presumed to be doing so to cover their own butts or those of their corporate masters. At the same time, of course, it’s good for scientists to keep doing a gut-check to make sure that they are using the statistical tools appropriately — not intentionally to tweak the data, but rather to avoid unintentional bias.

Comments

I’ve heard the endless debates about “German style” and “American style” of doing science. The former is supposed to be: Write the detailed protocol in your grant proposal then stick to it no matter what. The latter promotes flexibility – if your early returns suggest that your protocol can be improved on the fly, then you improve it on the fly and save the taxpayers money, as well as your own time and effort, from being wasted on an imperfect protocol (and you could not have known it was imperfect until the early data started coming in).

What do you think?

As for sample sizes, i.e. adding extra data points over and over, I hate doing that. Instead, I wait until the next colony of quail is old/mature enough and do a complete new repeat of the experiment on fresh animals – sometimes three times (i.e., on three generations of animals) – until I trust my data enough to publish.

The point that making your results more correct or more complete can make your conclusions wrong can be a difficult one to grasp. An example that I sometimes give in talking to medical students about why double blinding is important, even when the researchers are absolutely honest, is the following one:

Let’s say you are doing a clinical study, and you know which is the real treatment and which is the placebo. So you notice that some patients in the treatment group are not showing the anticipated response. So you pull their charts, and–what do you know? Somebody make an error, and some of these guys did not meet the inclusion criteria for the study! So you discard their data. And then at the end, you apply powerful statistics and they tell you that the treatment and placebo groups are different. Indeed they are! The ones in the treatment group got extra scrutiny, and the mistakes were corrected. The ones in the placebo group did not get the same scrutiny, and the mistakes slipped through.