About this Author

College chemistry, 1983

The 2002 Model

After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases.
To contact Derek email him directly: derekb.lowe@gmail.com
Twitter: Dereklowe

May 22, 2006

Merck and the Numbers

Posted by Derek

The New York Times has a good article today on the Vioxx data that I was talking about here last week. Check the graphic of the Kaplan-Meier charts especially; it's a good illustration of the problem. Merck is technically correct that the latest data still don't show a statistically meaningful difference between the Vioxx group and placebo until at least 18 months. As the article makes clear, they're hitting that theme very hard.

But Merck is also living in a dream world if they think that's going to help them much at this point. The problem is, the data look as if they're trending worse from a much earlier stage, and finally reach significance at the later time points. No lawyer in the world is going to walk away from that without driving it into the jury's heads that the danger is plain to see, yes, right there from the beginning, and don't talk to me about p-values when anyone can just look at this chart - your chart! - and see what's really going on. . .etc. We live by statistical arguments in the drug industry, but the people who are being called to jury duty sure don't. If I were one of the plaintiff's attorneys, I'd use the voir dire to make sure that anyone who knew anything about statistics never saw the inside of the jury box.

What's worse, to nonscientists, making statistics the centerpiece of your defense sounds shifty. People don't trust them; it's not for nothing that there are all those variously attributed quotations about "Lies, damned lies, and statistics". Now, if someone asks "Why are you so sure?" about something where I work, the answer "p less than point-oh-oh-five" will stop the questioner in their tracks. Not so in most workplaces, where that answer would make you sound as if you're dodging the question. And let's face it, the only p-values that strong that Merck can show are the ones that work against them.

The other problem is that a statistical approach is valid for large samples, the larger the better. But the jury isn't looking at a large sample. They're not there to decide how much Vioxx might have raised aggregate cardiovascular risk in certain subgroups, they're there to decide if it caused a heart attack in that guy sitting over there. The attorneys are going to keep things as personal as possible.

If you try and start fanning out profundities - like the oh oh 5 p-stuff - around here, you will reap some funny looks (which sometimes will progress into a derisive laugh and/or outright put-down.)

Merck had this enormous hubris about how great Vioxx was, "Celebrex is just an overpriced Aspirin", etc. - they were getting high on their own dope for quite some time. One usualy deceives himself first, before starting to deceive others.

Now take Lilly as the opposite example- for lot of money Lilly bought back the (dubious) Sepracor rights for developing enantiopure Prozac. This drug had a better PK and more favorable activity profile than the old racemic stuff - but quite late in the trials the hints of cardiotoxicity were noticed and Lilly pulled out. Now imagine what would happen if they chose to look the other way or covered up the disagreable data. They would introduce the product (as the new & improved Prozac) - fooling themselves they got a new money machine like the Prilosec - and within a year ot two (after heavy promotion bordering with outright bribery) 1 out of every 10,000 patients would start dropping with cardiac arrest...

1) Statistical hardasses would say that tests don't REACH significance, they are or are not significant. (Where's Yoda when you need him?)This is especially true when the issue at hand is sliding a cutoff point back and forth on a KM/survival analysis. But we all do it.
2) That said, Merck may gain some points claiming that the "suppressed events" wouldn't have made a difference. I realize given your NEJM post earlier that this may already be moot, but if those events are to the left of the sliding cutoff mentioned above they may at least be able to put forth enough smoke to counter the "you hid the data and PEOPLE DIED" attack.
3) ANY time someone comes at you with "p less than 0.05" restate the conclusion you believe results from the test, or ask them to do it. Amazing how you can stop some people dead in their tracks with this. And I don't mean catching them on statistical fine points like "You can't PROVE the alternative hypothesis..." I mean are the statistical test and its implications relevant to the discussion?
4) You are right about how laypeople won't buy into stuff specialists no longer question. One of my little nightmare scenarions involves the fact that it is unbelievably hard to run large-scale multi-multi-multi-center clinical trials without encountering something that really appears to cross the "scientific fraud" line. These events rarely invalidate the overall conclusions of the trial, but they do cause us trouble if for no other reason than we have to reprove that conclusion to ourselves again and again. However, would you be comfortable risking your professional and/or financial future on a jury making a distinction between scientific fraud and criminal/civil fraud?
I thought not.

You guys are right about statistics not playing to Vioxx juries, but I think you're a little off about why. It's not that juries don't understand or don't care. It's that the Vioxx liability is only marginally about safety or even science. The liability is about disclosure of risk. Merck's real problem is that from VIGOR onwards they knew and admitted an increased risk. It then took two years to negotiate labeling language with the FDA before they finally sent a Dear Dr. letter. During those two years, however, they continued their aggressive marketing and DTC campaign as if nothing had changed. That is the issue.

Big tobacco won lawsuits for years despite the fact they were producing a scientifically proven unsafe product. Why? Because juries ruled that smokers took on assumed risk due to the clear warning labels on every box. it was only after it became public that the tobacco companies knew and hid the health risks that they got in trouble.

To paraphrase a comment on Nixon, the real question is not "Is Vioxx safe?", the issue is "What did Merck know and when did they know it?" So while Merck argues in stastistical technicalities, they're not addressing the issue that matter to juries; full and honest disclosure. And trying to continue to hide behind these technicalities doesn't help them, but rather continues to demonstrate to juries a disregard for public safety for the sake of fudiciary expediency. If I were a juror, I would not view Merck favorably when they try to claim that there's no evidence of increased risk 12 months after use was stopped when the increase incidence rate was at a 89% confidence level. We all know that that doesn't mean there's no increased risk. In fact, it's good evidence there is increased risk, but that more studies are needed to "prove" it. That's not being honest and that's what Vioxx liability is all about.

There's one thing I don't really understand about statistics as they're usually used in this sort of case. As I understand it, the data is considered to be "statistically insignificant" because there is less than a 95% certainty of a correlation between the medication and heart attacks. Fine, ok, I understand that part.

What I don't understand is why this is taken as the appropriate way to look at things. For an issue of drug safety, I would think the goal is not a lack of proof of a correlation, but a proof of a lack of a correlation -- which would be a level of correlation smaller than what would normally be called the "5% confidence level", not merely something smaller than the 95% confidence level.

At the very least, a correlation with a confidence level greater than 50% means "It's more likely than not," which I would think really ought be interpreted as "This is a cause for concern until we do more testing" -- not as "We don't have to say anything about this until it's proven to be a risk."

Dr Gurkirpal Singh, MD, is Adjunct Clinical Professor of Medicine at Stanford University School of Medicine.

Dr Singh is a rheumatologist by training with research expertise in drug safety and epidemiology.

Dr Singh was asked to review internal company documents and emails between Merck scientists and executives that had been subpoenaed by Congress.

Dr Singh pointed out that as far back as 1996, Merck was already considering the possibility that a clinical trial of Vioxx versus a non-selective NSAID, would find that patients treated with Vioxx would have an increased risk of cardiovascular complications.

"We now know that by November of 1996," Dr Singh told the panel, "Merck scientists were seriously discussing a potential risk of Vioxx - association with heart attacks."

At that time, he said, it was not known that Vioxx could cause heart attacks, but the discussion focused on the issue that by inhibiting platelets, other painkillers may protect against heart attacks. Vioxx has no effect on platelets, and thus may seem to increase the risk of heart attacks in studies comparing it to other painkillers, he said.

"This was a serious concern because the entire reason for the development of Vioxx was safety," he explained. "If the improved stomach safety of the drug was negated by a risk of heart attacks," Dr Singh said, "patients may not be willing to make this trade-off."

"Merck scientists," he told the committee, "were among the first to recognize this."

"At this point in time," he said, "scientists should have started a public discussion about this potential trade-off, and designed studies that would more carefully evaluate the risk-benefit ratio of the drug."

"It appears from the internal Merck e-mails provided to me," he advised, "that in early 1997, Merck scientists were exploring study designs that would exclude people who may have a weak heart so that the heart attack problem would NOT be evident."

Brooks,
Your questions have both statisical and drug-development aspects that need to be addressed. To take the easier side first, you would not expect a new drug to have all risk eliminated, but rather to have it reliably characterized so the MD and patient can make informed decisions. Remember that Vioxx was supposed to have the advantage of lower GI complication rates, and even with all that is known about it would probably still be on the market if it had been promoted to people with greater risk of stomach problems than heart problems-but that's a smaller market. Also, you can have drugs with 100% rates for certain complications, but if the complications are tolerable and reversible then you've still got a drug.
The statistical issues are a little dicier to explain because there is not concensus about right and wrong in the field itself. However, the textbook way of rephrasing your issue would be there is greater than a 5% chance of a false positive, and that rate it too high. A false positive in this case is declaring a relationship to exist when it in fact does not (also called Type error). As stated this implies that the relationship tested was intended to be tested in the design of the experiment, which is obviously not the case in the Vioxx cardio risk analysis anyway. The thinking is that if you are going to perform an experiment to test a hypothesis it would be a huge mistake to declare the hypothesis valid if it in fact is not.
This leads to the issue that when discussing safety data in clinical trials the issue of stat significance isn't really relevant to begin with, precisely for the reasons I infer motivated your original post. This is trickiest when dealing with rare but catastrophic events, where you might need five events in one group to "reach" significance when none are observed in the other, but you in fact saw two.
Bad luck or real risk? You make the call.

The lack of significance only indicates that any effect seen could be by chance - it doesn't indicate that there is no effect. The data as stated don't indicate that Vioxx's heart effects are comparable to those of a placebo - more data would be needed to show that - but that any effects seen could be chance occurrences not caused by Vioxx.

It seems to me that classical inference is the wrong approach here. Bayesian decision theory makes much more sense. That means a) explaining what the implied prior beliefs would be for any conclusion given the data and b) applying a realistic loss function to the various errors one could make. Merck may well be correct in its assessment of the true causality, but its use of statistical significance to bolster its claims doesn't make sense in decision-theoretic terms.

Thanks for the comments, Still Scared. I particularly appreciate the observations about the drug-development aspects of this; the reminder that characterization of side effects and their frequency is the goal rather than a binary "do they exist or not" is an important part of the issue that I wasn't thinking about.

I think what's bugging me, really, is encapsulated in your statement that "if you are going to perform an experiment to test a hypothesis it would be a huge mistake to declare the hypothesis valid if it in fact is not." There's a fairly large difference between testing the hypothesis that "Vioxx is correlated with heart attacks" and testing the hypothesis that "Vioxx is not correlated with heart attacks," and failing to prove one is most certainly not a proof of the other. I find it deeply discouraging when people like Jeremiah completely miss that point and defend claims that are counter to common sense in the name of "statistics".

I don't think Merck would have a problem if Vioxx was merely failing to protect against heart attacks, as Dr. Singh predicted. This was a well-understood concern, and forthrightly acknowledged in Merck's product literature.

And in fact, the absence of a protective effect is essentially what Merck attributed the outcome of the VIGOR study to. Nobody was particularly upset about this. Such tradeoffs are routine in therapy, and the sort of thing that doctors should reasonably be able to deal with, either by prescribing a nonselective COX inhibitor or supplementing Vioxx with low-dose aspirin. Whether this latter approach would have sacrificed the GI advantages of Vioxx is a question that Merck apparently did not attempt to address directly, and one reasonable criticism of Merck is that they did not do so. If they had allowed patients taking low-dose aspirin into the VIGOR study, the problems with Vioxx might have been recognized earlier.

What ultimately led Merck to withdraw Vioxx, and what triggered the storm of lawsuits, was the discovery in the APPROVe study that patients on Vioxx were experiencing more heart attacks than those on placebo, so Vioxx wasn't merely failing to protect against heart attacks, it was causing them. Worse, there was no indication of a protective effect for the patients who were taking aspirin with Vioxx.

I think the "absence of evidence" vs. "evidence of absence" issue is a real concern, and goes beyond statistically unsophisticated juries. "No significant effect" is not the same thing as "no effect." I agree that the new data hurt Merck more than help them.

And just to complicate things further, there is the recent CMAJ study that found that Vioxx increased the risk of heart attacks right at the outset, but then the risk seems to go away. If they tolerated the first few doses, it seems that they were fine thereafter. Which is, of course, exactly the opposite of what Merck seemed to see in the APPROVe study. The Canadian study is a retrospective study, which means that there is a possible bias in that the patients taking Vioxx might not be exactly the same as the patients taking other anti-inflammatory drugs (although the authors try to control for known factors). But in this case, that may make it more relevant, in that it examines the risk in the population of people who actually took Vioxx (at least in Canada).

I cannot claim to be very savvy about the finer points of statistics, but I have tried to train myself such that should I see, say, two bars on a histogram that appear quite different and yet fail appropriate comparative statistical tests (P>0.05), well, then they are in fact the same. That is, I can't be sure that what I'm looking at is not the result of random chance (which is the null hypothesis being tested). Designers of clinical trials live (and sometimes die) by rigorously testing the null hypothesis, and rightly so. There is a reason why Phase III trials typically involve thousands of people. It appears to me that many physicians (who should know better - I'm looking at you Steve Nissen) and pundits (who alas probably should not) have failed to remember this. The data from APPROVE only reached statistical significance vs placebo at 18 months, not before nor after. Therefore, the only valid conclusion is that at other times the effect was the same as placebo, despite the appearance of trends to the contrary. I doubt most of you buy that, so forget about convincing a jury...
On another note, is there any drug program anywhere where the scientists involved aren't worried about possible mechanism-based adverse effects? If you dug up e-mails, say, from early statin programs, what nasty possibilities would you find being discussed there? If drug companies folded up their tents everytime a scientist voiced a concern, nothing would ever get done. Sometimes, preclinical species will only tell you so much, and you have to test your worries in humans. And sometimes even then you can be wrong.

My sense is that the industry has been acting as if randomized clinical trials (RCT's) are the only way to ascertain both positive effects and negative effects of drugs, and has set p

One must consider downsides in both situations of drug testing, however. The point I believe Graham is making is that when a successful RCT shows a therapeutic drug effect at p.05 , dismissing such an analysis is probably cavalier, as the downside is potentially quite serious. This is especially true in the setting of "blockbuster drugs" with millions of users, where the absolute number of affected's could be relatively large (e.g., 100,000 MI's and CVA's).

In other words, Graham is pointing out that there is what I call an asymmetry between evaluating a drug for positive effects vs. evaluating a drug's risk. One shoe does not fit all such drug testing.

What Gilmartin is saying, on the other hand, requires deeper consideration with respect to p values. By saying that "you can't take a study like this (i.e., retrospective review of HMO data) and take a patient population and extrapolate those kinds of numbers," he is indicating that he believes p>.05 in this scenario. Here's the problem: what is the p value, exactly, under such circumstances? Is p>.10 (one in ten chance that findings are unreliable)? Is p>.50 (50-50 chance or greater the results are not to be believed)? Is p>.75 (three out of four chances we should ignore the findings?) Truth is, the p value of such studies has not really been carefully considered and may be difficult to derive with certainty. Gilmartin is thus making a statistical value judgment.

It is also important to note, as in the Aug. 2005 article "Why Most Published Research Findings Are False" by Ioannidis in the journal PLoS Medicine, that:

... the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values. Research findings are defined here as any relationship reaching formal statistical significance, e.g., effective interventions, informative predictors, risk factors, or associations.

Ioannidis suggests that the positive predictive value (PPV) of a study is the critical issue. He reminds that PPV depends on many factors including pre-study probability of the hypothesis being true; study size; effect size; number and preselection of tested relationships; flexibility in designs, definitions, outcomes, and analytical modes; financial and other interest and prejudice; and the number of teams involved in a scientific field in chase of statistical significance. This article is well worth reading.

In many clinical studies on adverse advents I would remark that while p may not be less than .05, or positive predictive value not as high as one might like, they are in a range where ignoring the study results in today's pharma-hostile, litiginous environment is likely a Clint Eastwood "do we feel lucky today?" exercise. VIOXX may be the poster-child case for this.

The Gartner Group picked up on some of the points I'd raised, spoke with me and published some of the points in its 2006 "Industry Predicts" advice for pharma.

In the section entitled "Underutilization of analytical tools to review clinical study data will obscure the risks of approved drugs" the Gartner report states:

- The swift and severe judgment in favor of the plaintiff in the first Merck Vioxx trial sent a shock wave through the biopharma industry. It shows that biopharma manufacturers must do more to ensure that healthcare providers and the public have an accurate, ongoing assessment of medication risks. Biopharmas must also ensure that information on these risks is communicated promptly in an open, understandible manner. Posting clinical trial information on a web site is one step towards greater transparency, but does not provide information in a way that enables ... comparions of benefits and risks.

- ... It is still well recognized that all the possible side effects of a medication cannot be uncovered using a randomized sample of study subjects. The true test of safety and efficacy can only be determined when trial data is combined with other sources of information such as clinical encounters, adverse events (MedWatch) or observational studies (National Registry of Myocardial Infarction).

- In the furure, it is hoped that the EMR system will capture point-of-care information in a standardized format that can be used for drug surveillance. Today, biopharmas must be content with these other available, if imperfect, information stores.

- Biopharmas that ignore the opportunity to use analytical tools to proactively review contradictory sources of study information (for example, pre- and post-approval clinical data sets, as well as registries) will miss essential signals regarding product safety. Yet today, only a small percentage of biopharmas routinely utilize personnel with medical informatics backgrounds to search for adverse events in approved drugs.

- Biopharmas ... should look at risk from multiple perspectives ... they must also get actively involved in defining the electronic health and medical record so that it will contain the type of information required to make better safety assessments in the future.

The recognition of a gap in formally-trained medical informatics-trained personnel in the pharmaceutical industry was welcome.