“You don’t want to have on the books a conviction for a practice that many scientists do, and in fact think is critical to medical research,” says Steven Goodman, an epidemiologist at Stanford University in California who has filed a brief in support of Harkonen……

Goodman, who was paid by Harkonen to consult on the case, contends that the government’s case is based on faulty reasoning, incorrectly equating an arbitrary threshold of statistical significance with truth. “How high does probability have to be before you’re thrown in jail?” he asks. “This would be a lot like throwing weathermen in jail if they predicted a 40% chance of rain, and it rained.”

I don’t think the case at hand is akin to the exploratory research that Goodman likely has in mind, and the rain analogy seems very far-fetched. (There’s much more to the context, but the links should suffice.) Lawyer Nathan Schachtmen also has an update on his blog today. He and I usually concur, but we largely disagree on this one[i]. I see no new information that would lead me to shift my earlier arguments on the evidential issues. From a Dec. 17, 2012 post on Schachtman (“multiplicity and duplicity”):

So what’s the allegation that the prosecutors are being duplicitous about statistical evidence in the case discussed in my two previous (‘Bad Statistics’) posts? As a non-lawyer, I will ponder only the evidential (and not the criminal) issues involved.

“After the conviction, Dr. Harkonen’s counsel moved for a new trial on grounds of newly discovered evidence. Dr. Harkonen’s counsel hoisted the prosecutors with their own petards, by quoting the government’s amicus brief to the United States Supreme Court in Matrixx Initiatives Inc. v. Siracusano, 131 S. Ct. 1309 (2011). In Matrixx, the securities fraud plaintiffs contended that they need not plead ‘statistically significant’ evidence for adverse drug effects.” (Schachtman’s part 2, ‘The Duplicity Problem – The Matrixx Motion’)

The Matrixx case is another philstat/law/stock example taken up in this blog here, here, and here. Why are the Harkonen prosecutors “hoisted with their own petards” (a great expression, by the way)?

The reasoning seems to go like this: Matrixx could still be found guilty of securities fraud for failing to report adverse effects related to its over-the-counter drug Zicam, even if those effects were non-statistically significant. If non-statistically significant effects should have been reported in the Matrixx case, then Harkonen’s having reported the non-statistically significant subgroup is in sync with the government’s requirement. To claim that Matrixx should report non-statistically significant risks, and then to turn around and claim that Harkonen should not report non-statistically significant benefits is apparently inconsistent.

Really?

The two cases are importantly disanalogous on a number of grounds. Specifics can be found in the Matrixx posts cited above.In fact, one might argue that the Matrixx case actually strengthens the case against Harkonen. Giving an overly rosy picture of, or downplaying, information about potential regulatory problems with a drug is likely to be deceptive for investor assessment. Even moving away from the fact that Matrixx concerns security fraud, and granting that the ruling (by the Supreme Court) mentions, as an aside (or obiter dicta), that:

(1)The absence of statistical significance does not preclude there being a warranted ground for inferring (or claiming to have evidence that) a drug caused an adverse side effect.

This is still very different from claiming

(2) The absence of statistical significance (in the case of a post-data subgroup) provides a warranted ground for inferring (or claiming to have evidence that) this drug–with its own serious side effects– has a survival benefit.

But there is a lesson: When it comes to evidence that is relevant to regulation and policy, alterations of methodological standards initially made in the interest of strengthening precautionary standpoints may be (and often are) used later to weaken precautions. Tampering with standards of evidence intended to increase the probability of revealing risks to the public tends to backfire. Admittedly, the obiter dicta at least**, in the Matrixx case, are open to lawyerly undermining the government’s position in the current case.

*Harkonen himself claimed the report was intended for investors.

**Here the “dicta” are throwaway remarks by the Supreme Court on (lack of) statistical significance and causal inference. See earlier post here.

I would also disagree with the alleged ‘transposition fallacy’ remarked on in Schachtman’s new blog, but I don’t have time to do more than record this here.

Post navigation

I think I have made some inroads into your thinking about these cases, based upon some of your comments above. Irealize that I will not persuade you on some of my views, but I can perhaps expand on my previous comments with respect to the Matrixx case and its bearing on the Harkonen prosecution.

In Matrixx, the Supreme Court held that causation need not be alleged, and therefore whether or not statistical significance was needed in one or more datasets, and whether or not some parsing of the Bradford Hill factors was needed, became irrelevant. Why? Because the FDA can require regulatory action — a revamped warning label, a withdrawal from the market, a new drug application submission — without a showing that the drug causes harm. The reasons for this regulatory power lie in the balancing of risks (real or potential) against benefit. There are cases in which there is so little benefit, or in which there are alternative drugs with better safety profiles, that a hunch or a suspicion of harm can tip the balance in favor of regulatory action.

Now the Matrixx case was a civil securities fraud case. Plaintiffs claimed to have lost money by purchasing shares and then suffering an economic loss when the FDA took action that hurt (and then stopped sales). What triggered liability was not any statement that Matrixx made, but its failure to disclose case reports and other pieces of evidence when it chose to make bullish marketing projections for the product, Zicam. If the company had not made these projections, its duty to speak about the potential negatives never would have arisen. In the Harkonen case, the company could not remain silent; it had to speak under the SEC regs to the market about the clinical trial data because the company had disclosed the data to medical analysts who were not within the company. (Still, Harkonen was not prosecuted for criminal security fraud. The civil securities fraud class action against him and the company was dismissed. He was prosecuted for misbranding and for Wire Fraud. The complaining agency was not the SEC or the FDA; it was the Veterans Administration.)

So Matrixx properly understood has little or nothing to do with the Harkonen case except that the government in the Matrixx case took the plaintiffs’ side on an issue that was raised by the company when it contended that plaintiffs had to allege causation and a properly supporting set of facts to support that allegation. (Remember there was no actual evidence involved in Matrixx; it was all about whether the plaintiffs properly pleaded their claims. No one testified. No studies were before the court, etc.) With Matrixx having taken this position about what plaintiffs must allege, the government responded, agreeing with plaintiffs, that IF CAUSATION WERE TO BE REQUIRED TO BE PLEADED, THEN STATISTICAL SIGNIFICANCE IN SOME DATA SET SHOULD NOT BE ALSO REQUIRED. The Supreme Court rejected the antecedent; the consequent falls out of the case.

This is why the Supreme Court’s discussion of statistical significance as unnecessary is “dicta”; that is, the discussion is unnecessary to the holding, and not binding on any courts below, or on the Supreme Court itself. As I have pointed out in several places, the Supreme Court’s scholarship was seriously flawed, but that’s a digression for another day. The Matrixx’ legal and statistical scholarship was flawed as well. It does not make sense to point to the lack of statistically significant data, when the alleged facts are case reports that do not admit of any statistical analysis. Given the dubious utility of the product, the FDA could still yank the product from the market on mere case reports, and that’s enough to support the 9-0 vote in the Matrixx case.

In the Harkonen case, the antecedent of the conditional is in full play, and the government’s position is that statistical significance is not needed to make the claim for causation. I commend a reading of the actual brief, which I have quoted liberally in some posts on my blog, and I believe I have linked to the document so it can be downloaded. The government did not suggest any limitations to its no statistical significance position; and the cases it cited demonstrate a profound misunderstanding of causal inference, in science and in the law. (And the Court adopted much of the verbiage of the government’s brief without any apparent independent thought.)

So the Matrixx and Harkonen are, as you note, very different cases, and then even more so. One is civil and the other criminal. In one, the misrepresentation is the alleged failure to disclose case reports and other information when making bullish marketing projections. In the other, the misrepresentation claimed is the affirmative assertion of a causal inference (of therapeutic efficacy) from two clinical trials, with all the details that you have alluded to and others that I have discussed on my previous posts. Despite the differences in these cases, the contradiction between the government’s positions taken in Matrixx and in Harkonen could not be more extreme. And the government, in its Matrixx amicus brief, emphasized that its position applied to causation both in terms of efficacy of therapies and production of adverse effects.

I would be interested in hearing why Calloway’s statement, in Nature, that only “slightly fewer” patients had died on interferon γ-1b than on placebo, but that the difference was not statistically significant “because the probability that it was not due to the drug was greater than 5%, a widely accepted statistical threshold.” The probability that it was not due to the drug would seem to be (1- probability that it was due to the drug).

NAS: I appreciate your many clear articulations of the ins and outs of the Matrixx case, and I’ve learned a lot from them. Perhaps I can say to readers lost in this morass, that if they read one thing, it should be the Supreme Court Matrixx decision which includes a stray remark about significance tests. I wrote about it herehttp://errorstatistics.com/2012/02/08/distortions-in-the-court-philstock-feb-8/

The main connection to this long, dragged out Harkonen appeal, to put it very crudely, (but please see my post) is this: if the Supreme Court says, in a side remark, that one drug company should report a fairly obvious side effect, even though no statistical test was run, does it mean another drug company (in a completely different kind of case) needn’t be held accountable if they report p-values incorrectly when statistical tests ARE run?

It’s really just the logic that interests me.

On the transposition point, Nate, I’ll have to come back to it (the answer practically jumps right out of your sentence!)

I’m not so interested in this particular case so I’ll sidestep the legal questions and just say that I like the following statement of yours: “But there is a lesson: When it comes to evidence that is relevant to regulation and policy, alterations of methodological standards initially made in the interest of strengthening precautionary standpoints may be (and often are) used later to weaken precautions.” It reminds me of what seems to have happened with “statistical significance” in much applied research in psychology and medicine: originally (I assume), the rule of statistical significance was intended to protect people from jumping to conclusions based on noise. By holding to the “p less than .05″ rule, you make it more difficult to report noisy results, with the idea being that what is actually reported is more likely to be correct. That is, you’re moving along the ROC curve. Nowadays, though, statistical significance is often used in the other way, to put the stamp of approval on noise patterns and give people confidence to report them as fact. I think that it’s great to have an understanding of variation and uncertainty, but sometimes I wonder whether we’d be better off if there were no threshold at all to what could be reported. Then researchers would have to judge their conclusions based on direct evidence and they could not simply hide behind a p-value in order to claim that what they’ve found is correct.

(Just to be clear: the above problem is not just with p-values, it would also hold with confidence intervals, posterior probabilities under flat priors, etc.)

Totally agree that using p-values alone provides protection against noise, and not more than this.

But what do you mean by “direct evidence”? Some people might interpret this as confidence intervals – and if these end up being used only to the extent they include the null, or not, then we’re back to square one.

Andrew: I appreciate your reflections on this, and will reread them later, but the fact is that the underlying argument about p-values, and the ability to criticize his invalid p-values, are what enable holding Harkonen accountable. That is, the criticism depends upon having a platform for recognizing that things like hunting for significance, trying different subgroups and endpoints, etc. result in a higher probability of erroneously inferring evidence of risk or benefit. This is true, even if you think free speech allows him to send out effusive memos on how great the trial results were. It is a classic case (but perhaps one of the first to hold an individual instead of just the company accountable). The company itself, InterMune, expressed horror, admitted blame and paid a penalty. So as much as I detest the pseudoscientific dichotomous NHST—a methodology that exists only as an abuse–, statistical analysis in the context of controlled trials (generally performed by drug companies), is pretty sophisticated. Quite frankly there’s too much money to lose to have it discovered later that your drug doesn’t work or has toxic side-effects. Removing an error probability analysis would be a disaster for the companies, regulators and consumers.

The interesting issue concerns what the Supreme Court said about a completely different case, and Harkonen’s attempt to defend himself as a result.

The Supreme Court did not say that the drug company should report a fairly obvious side effect, even though no statistical test was done. There was no evidence in front of the court; there was no expert testimony. The Matrixx case was all about pleading requirements. The Court held that the plaintiffs pleaded an adequate complaint, without having to allege causation, and thus without having to allege statistical significance. Remember how the case got started. Plaintiffs filed a complaint; defendant moved to dismiss because the complaint was on its face inadequate.

You might think this is just lawyer talk, but it has real-world consequences. To be sure, there was a tussle between plaintiffs (with the government on their side) and Matrixx Initiatives over whether causation was relevant to the case at all. Plaintiffs properly won that battle, but they did not prove anything about “side effects,” etc. Side effects imply that Zicam caused them; and causation was ruled unnecessary.

As for your comment:

“But there is a lesson: When it comes to evidence that is relevant to regulation and policy, alterations of methodological standards initially made in the interest of strengthening precautionary standpoints may be (and often are) used later to weaken precautions.”

Keep in mind, staying on the FDA/Zicam example, that the agency has much stricter standards for drug approval than for drug removal. Zicam was an “over-the-counter” preparation, which was regarded as sufficiently safe to not require a New Drug Application. There were no studies showing safety or efficacy submitted to the FDA for marketing approval. There were certainly no double-blind randomized clinical trials of Zicam. This situation is very different from licensed, prescription medications. Given the status of Zicam, the FDA had much greater latitude in deciding to recall it, or to impose major warning label requirements.

Bringing a drug to market on a New Drug Application does require a much stronger showing of both efficacy and safety. I won’t bore you with the regulatory details, but the phase III clinical trials are generally RCTs. In the Harkonen case, independent Austrian researchers had published a small RCT with very strong results in the NEJM, back in 1999. The company, InterMune then sponsored a follow-up on the Austrian study, and then another RCT from scratch. The press release at issue in the Harkonen case referenced the Austrian study & follow up, as well as the company-sponsored RCT. As I noted in my recent post, the company’s RCT was published in the NEJM (2004) as well, with the hazard ratio for mortality at 0.3, statistically significant to boot. (You somehow lost track of that in your characterization of the evidence.)

Despite the very low hazard ratio for the entire trial population (not a subgroup), with p = 0.02, I believe, I am comforted by, and share your, detesting “the pseudoscientific dichotomous NHST.” That is, after all, the crux of the amicus brief that Ken Rothman, Tim Lash, and I filed with the Supreme Court. Nor do I think rejection the dichotomous test means that the analysis and interpretation of the p-value, or C.I., is meaningless.

As for the transposition fallacy:

You suggested my last sentence made your point. My last sentence, however, simply stated the complement of the probability described by Mr. Calloway as giving the meaning of the attained significance probability:

“The probability that it was not due to the drug would seem to be (1- probability that it was due to the drug).”

But it seems clear that both probabilities I reference in the sentence, immediately above, are posterior probabilities, and not probabilities of observing data at least as extreme as that observed, assuming that there is no association between therapy and mortality.

I might note that NHST–an invented animal–is not the same as statistical significance testing, whether “pure” or Neyman-Pearsonian. I’m sorry I don’t have time to write in more detail on the points Nathan raises, but I think the key issues are all in the links to the this post.

“Logically, one cannot assume something to be true as part of a calculation, and then use that calculation to measure whether the incorporated assumption is true. For this reason, and because the p-value also depends on study size, it does not measure the probability that the data are meaningful.”

“The p-value does not give the probability that the null hypothesis is correct (i.e., that the data would occur by chance),because it cannot measure the “correctness” of a hypothesis assumed to be true in the course of its calculation; nor does the p-value provide a measure of the probability that an observed result is correct or meaningful.”

I don’t get these passages. In the second, I don’t get the meaning of everything after “because”.

Your “invented animal” led me to think of hybrids such as the mule, and I ratcheted up your metaphor a bit.

I assume you are ok with:

1. p-values do not measure the probability that the data are meaningful

2. p-values depend upon sample size

3. The p-value does not give the probability that the null hypothesis is correct

The government’s Matrixx brief (and I believe its briefs in Harkonen as well) assert that the p-value gives the probability that the null hx is correct.

The “correctness” of the hypothesis (either the null or the alternative) would be a posterior probability, the probability that there is an association given the data. The p-value gives the probability of the data observed or more extreme given the null, on the assumption of no bias or confounding, and the correctness of the probability distribution assumed. I suppose Prof. Goodman would argue that there are Bayesian methods to move from p-values to posterior probabilities, but that’s not an argument I find persuasive.

Nate: I take it your brief was intended to state uncontroversial, plain Jane, points and definitions of the statistical notions, to show how wrong the other side’s conceptions are*. But I’m not at all sure it does (and this is the first time I’ve read it through**). For example, one need not agree that a posterior probability is a proper “measure of correctness” of statistical hypotheses. But my main puzzle has to do with the strange, or at least unfamiliar, assertions made above***(quoted in my last Oct. 11 comment). They may not be strange if I could decipher them.

*I may be wrong, not really knowing the status of amicus briefs in an adversarial legal proceeding.
**You made me feel guilty for my previous short take.
***e.g.,Logically, one cannot assume something to be true as part of a calculation, and then use that calculation to measure whether the incorporated assumption is true.

I don’t mean to distract you from your manuscript, but if it helps you retain your hair, then here goes:

I think you may be reading more into our amicus brief than is there. When we say that a p-value does not give a measure of probability about the correctness of an hypothesis, we are not thereby saying that the posterior probability is (therefore) the proper measure. We did not say or imply this. All we are trying to say is that you should not confuse a p-value with a posterior probability, and that a p-value doesn’t give the probability that the null hx is correct, or the probability that the data are meaningful, etc., etc.

We did not advance the propriety of calculating or intuiting a posterior probability as a (or the) proper “measure of correctness” of a causal hypothesis.

Nate: You are saving my hair, if only because your brief reminds me of some things that demand severe clarification–what this book is trying to do.

You haven’t explained the central parts of the passages I’ve asked about, what’s going on there?

I find it odd to be giving a list of what p-values are not, if nobody especially needs, or claims them to be, those things. Plus what matters here would be how p-values can be USED to appraise evidence, and how, if misused, a very misleading indication of the evidence can result.

As for the allegations that the other side misinterprets p-values, I’ll have to find where they are stated, or you can point me to them.

Why cover what p-values are not? Because the government misstated what they are, and the government claimed they were necessary in Harkonen, but not necessary in Matrixx. The point of the amicus brief is in part rhetorical, in part didactic. If the government can’t correctly say what p-values are, then it strikes us (my fellow amici and me) as peculiar that they want the imposition of a criminal sanction for someone who they claim has misused them. (Note that Fleming testified at trial that a Bonferroni correction could be done — not quite accurate for the particular trial at issue — but he never testified that the the p = 0.004 would inflate to larger than 0.05 with an appropriate correction; also note that the time-to-event measure in the published paper on the prespecified mortality endpoint was statistically significant, p = 0.02, with a HR = 0.30. This makes my point about Calloway’s misleading journalism.

In our amicus brief, we stated what p-values are. They are a continuous measure of the strength of the evidence against the null hypothesis. We criticized Fleming for his insistence upon a dichotomous assessment of p-values with respect to the null. I would have to go back to his testimony, but I believe that there were places he came close to saying that if p > 0.05, we must accept the null, which is bizarre in failing to distinguishing between “failing to reject” and “accepting.” But he needed to take this extreme position in order to say that Harkonen’s statement about “demonstrating” was objectively false.

Nate: Let’s forget the case itself, or cases. We both agree that SC shouldn’t have allowed “experts” like Ziliac and McCloskey (who misinterpret tests in their Matrixx brief, even interpreting p-values as posteriors) to influence them in any way (assuming they did, and perhaps they did not) to include the “throwaway” claim in the Matrixx case. I’ve no doubt that, in legal cases, it is fair and legitimate to fight fire with fire (i.e., exploit a misinterpretation in one case just as far as possible for another case). I’m totally out of my depth on that score; and if I need a lawyer in your area, I’ll want to hire you.

Nate: Here’s the separate comment on the logic of p-values. My question concerned remarks like: “Logically, one cannot assume something to be true as part of a calculation, and then use that calculation to measure whether the incorporated assumption is true.”

But of course one can, in just the way tests do: one may hypothetically assume the null or set of nulls (asserting no benefit as regards, say, pulmonary function), and use error statistical calculations (including p-values) to measure just how readily the observed results could have been generated even if there’s no actual benefit, and then appraise the null and alternatives. As you wrote: “data dredging, grasping for the right non-prespecified end point … implicates the problem of multiple comparisons or tests, with the result of increasing the risk of a false-positive”. This is one way to USE p-value reasoning to critically appraise the warrant of any rejections of the null, and perhaps assess magnitudes and so on.

Schachtman sent me the govt. brief in the Harkonen case. Here are a few snippets, decide for yourself (I don’t have a link):

“The release stated that data from the … trial demonstrated a statistically significant survival benefit in patients with mild to moderate IPF with a p-value of 0.004, without stating that this was not a primary, secondary, or even a pre- specified exploratory endpoint for the trial.

…The p-value as portrayed in the press release was rendered false by the complete omission of any mention that the only results with a p-value less than 0.05 – the subgroup analysis of patients with mild to moderate IPF – were observed only after InterMune engaged in multiple retrospective analyses. Moreover, the press release also omitted the fact that the clinical trial protocol had nine secondary endpoints – of which survival was ranked only seventh most

….. using the press release’s FVC cutoff meant that in patients with severe IPF, there was a higher deathrate among those on Actimmune than among those on the placebo, not what one would expect to see if Actimmune truly helped IPF patients live longer.
….
…..At the time it was issued, the press release was the only source of information available to the public about the trial results.”

Yes; that was the government’s position, which doesn’t make it the truth. The way this works is that the verdict winner can argue anything in support of the judgment whether it was actually decided or actually found by the court or jury below. Similarly, the government may characterize, in argument, that Harkonen “intended” this or that; that he had such-and-such “motive,” or that he “knew” something because another person (say Thomas Fleming) told him not to do it.

All that matters is that it was the government’s position and argument, and the government won the verdict. Don’t get carried away by the harrumphing of an appellee’s brief. I would be happy to share Fleming’s actual testimony with anyone who wants to look more closely and carefully at what actually was presented.

Mortality is always the most important endpoint, but clinical trial planning rarely allows trialists to prespecify mortality as a primary or a secondary endpoint. It usually is built in, as with Actimmune, as a composite of the primary. The Actimmune trial did make survival one of nine secondary endpoints. Why so many secondaries? Well there are many ways to assess pulmonary function, not just FVC. You can look to diffusing capacity, or to arterial oxygenation, or to ventilatory flow rates, in large airways, in medium airways, in small airways, or to lung volumes. If you have ever seen a comprehensive pulmonary function test, you realize that it will often has dozens of measurements, and there was no a priori sense in this trial, or others like it, where the benefit might result. That’s why the FDA signed off on 9 secondaries, and why it was entirely reasonable under the circumstances. Note also, that these were not statistically or biologically independent tests.

Finally, although you might think that the demarcation between advanced and mild- to moderate- pulmonary fibrosis is clear (and therefore Harkonen et al. were simply data dredging), you would be wrong. There are many competing guidelines on how to draw these lines, with no one that dominates. Indeed, in the field of pulmonary function testing, there are competing predicted normals, competing protocols for assessment, etc. The use of any set requires some clinical reasoning, but using predicted normals from Mormons living in Utah may not be reasonable for industrial workers living in Detroit or Baltimore.

I know you would like to squeeze Dr. Harkonen into the dominant paradigmatic preconception of an industrialist who has juked the stats, but he won’t fit there. Just look at the published article, Raghu et al. N. Engl. J. Med. 350, 125–133; 2004, and its reported hazard ratio for survival on the entire trial population.