Fragility index is too fragile

Where is an issue that has had a lot of publicity and Twittering in the clinical trials world recently. Many people are promoting the use of the “fragility index” (paper attached) to help interpretation of “significant” results from clinical trials. The idea is that it gives a measure of how robust the results are – how many patients would have to have had a different outcome to render the result “non-significant”.

Lots of well-known people seem to be recommending this at the moment; there’s a website too (http://fragilityindex.com/ , which calculates p-values to 15 decimal places!). I’m less enthusiastic. It’s good that problems of “statistical significance” are being more widely appreciated, but the fragility index is still all about “significance”, and we really need to be getting away from p-values and “significance” entirely, not trying to find better ways to use them (or shore them up).

Though you might be interested/have some thoughts as it’s relevant to many of the issues frequently discussed on your blog.

My response: I agree, it seems like a clever idea but built on a foundation of sand.

36 Comments

I am unfamiliar with the ‘fragility index’, but I understand the need to clarify results of clinical trials.

I am having a very, very hard time explaining a Bayesian analysis of a clinical trial. I estimate a posterior probability of 97% that the experimental therapy generates better outcomes than the control therapy, given the model and data generating conditions, etc.

Everyone, from PIs to research staff to clinicians, without exception ask me “Is that significant?” That is the sort of statement that they hear from all of the pharmaceutical TV commercials, and seems to be the only pertinent piece of information for consumers.

I’m suspicious of many of these Bayesian posterior probabilities for the same reason that I’m suspicious of p-values. A posterior probability, like any statistical inference, is conditional on a model. And if the posterior probability is based on a flat prior with noisy data, I don’t believe it. See here for further discussion of this point: http://www.stat.columbia.edu/~gelman/research/published/pvalues3.pdf

I always advocate analyses using hierarchical models and informative priors, and everyone likes the idea. Another thing that seems to get their ears is predictive probabilities of particular effects under different treatment regimens. At least with that method I can include prior uncertainty about the generalizability of the clinical trial results.

BUT, they still want to have that ‘bottom line’ statement to pass on to colleagues, patients, and customers.

I think it makes sense to have a bottom-line statement, but then I think the bottom line should be more direct. For example: If this treatment were performed nationally on X people, we estimate it would cost Y dollars and save Z lives. The posterior probability that the new treatment is better than the old treatment is not a bottom-line statement!

A clinician meets with a patient and has to decide what therapy to apply and how to justify it to the patient.
I don’t see a problem with the clinician stating that, given all the information at hand, this new therapy X has (say) 80% chance of providing greater relief than therapy Y.

I actually strongly disagree that final cost/benefits should be the real bottom line. Final costs are very dynamic! It may cost Y dollars now and save Z people. But technology marches on! After it’s used by the public, maybe costs reduce to Y – epsilon and saves Z + delta people and so the conclusion changes. Trying to model all that is simply impossible.

Is saying “that’s impossible, I give up!” a good idea? I think not. But it’s very reasonable to ask the question “does this really have a more positive effect than current alternatives?” and then ask the question of costs later. If the answer is “yes”, we’ll eventually get answers about about impact on society and hopefully it’s positive. If the answer is “no”, we should be wary about putting our eggs in the basket of this new product. (Note: I do recognize that significance testing doesn’t formally answer this question).

This is somewhat similar to the problem of futility analyses in clinical trial: the idea being we stop if it looks like the probability of the significant results (i.e. payout for the company) is very low. But this leaves the company in an awkward spot; suppose we stop for futility, yet the credible interval still contains a clinically significant value (which should almost always happen if we stop for futility). Then what? Restart the whole trial?

Sometimes gaining a small piece of knowledge is a very reasonable question.

I think it would be fine to perform inferences on intermediate quantities such as costs and benefits under a static regime. That still makes more sense to me than reporting the posterior probability that the average benefit is zero, or the p-value, or various other standard summaries that are also based on static assumptions.

Glad to hear! I am shying away from reporting 95 % credible intervals, due the tendency to only focus on the exclusion of 0, rather than capturing the uncertainty or summary of the parameter estimate under the models assumptions. My general approach, in a field dominated by NHST, is to use a prior of N(0, 1) or even narrower depending on the question, a multilevel model almost always, report 80 % intervals, and then posterior probabilities of a positive or negative effect. In addition, I use pp-checks, not to select a model, but to determine where my model may not describe the observed data very well (if it is that “bad”, I of course change things up). With this approach, I think the maximum and minimum values can be really informative.

This results in conservative, or maybe a better term is more reasonable, estimates than those obtained using lme4 or nlme. For example, the effect is more smaller (which is more realistic), less noisy, and pulled towards zero.

I unfortunately do not have some bottom line statement based solely on the results. I try to relate things to the literature at large, in combination with my results. However, I still am kind of unsure of what to say, other than the posterior captures our belief. Does a posterior of positive effect of 97.5 % (95 % CI excluding 0) carry substantially more evidence than, say, 95.7 %. I do not think so, as I would also be interested in other information, for example the precision of the estimate.

Still, no bottom line statement, which might be good because then it might force people to actually read the paper, rather than scanning for significance.

I re-read this paper yesterday, and one thing I realized was that basically what you were saying about p-values was: just use them sensibly, and if you do, then it’s fine. The same would then hold for posterior probabilities. The problems start when people start using them as strength of evidence.

>The problems start when people start using them as strength of evidence.

But that’s one of the key things we want from any empirical analysis. An indication of the strength of evidence there is for a claim or a set of claims. This strength of evidence assessment can be conditional on clearly stated and discussed assumptions, of course.

If the p-value or the liklihood ratio or the posterior distribution doesn’t give us strength of evidence, what does?

>”estimate a posterior probability of 97% that the experimental therapy generates better outcomes than the control therapy”

This is still NHST, changing to that approach won’t solve anything. If that is the alternative, it would be better to just leave the “is it significant” people alone so that can be used as a shibboleth.

The scientists (not the statistician) are supposed to come up with models that makes predictions which can be compared to the data. Then the statistician helps them find a plausible range for some summary statistics or model parameters based on the data. These are then compared to the predictions.

‘The scientists (not the statistician) are supposed to come up with models that makes predictions which can be compared to the data. Then the statistician helps them find a plausible range for some summary statistics or model parameters based on the data. These are then compared to the predictions.’

This is an important point, but I certainly don’t work on clinical trials like this. Frankly, I haven’t heard of clinical trials conducted that way. Not to say that it isn’t done.

I’ve never heard of that being done either. I guess it’s really hard to come up with a sensible and moderately realistic model of a therapy’s effects because it’s very complex and our understanding of what treatments are doing to us is so poor.

“The courts have also remarked that…rates are often very sensitive to small changes…. The reluctance to rely on small samples in such cases seems to be based on an intuitive appraisal of statistical significance….To the extent that pure randomness is taken to be the appropriate standard, this plausible-sounding concern is…without merit — a formal significance test is more effective in judging departures from randomness than is the court’s intuition” Meier, Sacks and Zabell, “What Happened in Hazlewood,” 1984.

Of course, that “to the extent” is doing a lot of work there, as is the notion that a bright line 5% p-value standard ought to have any relevance to a court. But the response to “only 5 more failures would have reversed your significance result” is “and with five more successes the effect would be even stronger.”

That said, this is all in a context with no forking paths. Where the analysis standard is malleable (through exclusion of problematic cases or choices of adjustments or choices in subgroupings) where fragility analysis is useful is putting some sort of meat to the question of how little or much p-hacking one has to do to achieve the requisite 5%. And that might be useful to know so long as NHST is the metric we’re using.

What surprises me the most about this paper (I know many of authors and was a co-author of one) is that its done in the context of a single study – a single study should never be the basis for determining belief in whether a treatment has an effect.

(I only scanned the paper but searched for meta-analysis and systematic review and meta-analysis was mentioned only once and as backing up the fragility index of 1 as being an indication not to believe the original study. Maybe they should have just said the fragility index is a rough and vague way to predict what a later meta-analysis might more thoroughly discern. We had floated the same idea in the context of meta-analysis – something like how many negative studies of average size need to exist to overturn the overall result – http://annals.org/aim/article/702054/meta-analysis-clinical-research Definitely no rocket science but all we could think of doing at the time.)

So if we junk NHST entirely how do we succinctly convey the complex results of a study analysed by alternative (Bayesian?) methods to a busy clinician for actionable intelligence in his daily practice?

As I wrote in comment above, I think it makes sense to have a bottom-line statement, but then I think the bottom line should be more direct. For example: If this treatment were performed nationally on X people, we estimate it would cost Y dollars and save Z lives.

Sure, predictions for particular patients would be good too. I guess I was thinking of an earlier stage in the process, where decisions have to be made about getting a treatment approved, recommended, and reimbursed.

The process of treatment approval, recommendation, and reimbursement approval until now (or just recently) was static and based on RCTs with small numbers of weird patients (those allowed into the trial who volunteered) treated in ways different from usual practice. Usually those RCTs are pre-approved and audited down to the level of individual data and just published RCTs (e.g. no access to raw data) are mostly ignored.

You write, “What’s the size of the “patients like your patient” cohort? If you are too selective you get no patient or just throw away good data.”

No: you use multilevel regression and poststratification! The fitted model will partially pool from other patients like yours. No way you’d just partition the data and separately analyze subgroups. You’d have to analyze all the data and then draw specific inferences, just as we do when we estimate state-level opinions from national polls.

This paper by Liu and Meng deals with exactly the topic of how closely to match the reference set to the patient whose treatment is being assessed: There is Individualized Treatment. Why Not Individualized Inference? https://arxiv.org/abs/1510.08539

They argue that frequentist, Bayesian and likelihood based statistical inferential methods lie on a continuum with the frequentists at the unindividualised reference set end and the likelihoodists at the most individualised reference set end. Very interesting indeed.