The “Big Think” on Statistical Inference

Remember theASA statement on p-valuesfrom last year? The profession is getting together today and tomorrow (Oct. 11/12, 2017) for a “big think” on the problem. John Ionnidis, who has published widely on the reproducibility crisis in research, said this morning that “we are drowning in a sea of statistical significance” and “p-values have become a boring nuisance.” Too many researchers, under career pressure to produce publishable results, are chasing too much data with too much analysis in pursuit of significant results. The p-value has become a standard that can be gamed (“p-hacking”), opening the door to publication. P-hacking is quite common — the increasing availability of datasets, including big data, means the number of potentially “significant” relationships that can be hunted is increasing exponentially. And researchers rarely report how much they looked at before finding something that rises to the level of (supposed) statistical significance. So more and more artefacts of random chance are getting passed off as something real.

Sometimes p-hacking is not deliberately deceptive — a classic example is the examination of multiple subgroups in an overall analysis. Subgroup effects can be of legitimate interest, but you also need to report how many subgroups you examined for possible effects, and do an appropriate correction for multiplicity. However, standard multiple corrections are not necessarily flexible enough to account for all the looks you have taken at the data. So, when you see a study reporting that statistical significance was lacking in the main point of the study, but statistically significant effects were found in this or that subgroup, beware.

This may sound heretical, but the basic problem is that there is too much research happening, and too many researchers. Bruce Alberts et al alluded to this problem in their article discussingsystemic flawsin medical research. As they put it,

“…most successful biomedical scientists train far more scientists than are needed to replace him- or herself; in the aggregate, the training pipeline produces more scientists than relevant positions in academia, government, and the private sector are capable of absorbing.”

The p-value is indeed a low bar to publication for the researcher who has lots of opportunities to hunt through available data for something interesting. However, merely replacing it with some different metric will not solve the supply-side problem of too many scientists chasing too few nuggets of real breakthroughs and innovations.

@Paul Bremner, I think you put it very well. As a graduate student in healthcare analytics at National University, we are steering towards more robust methods. We do still use p-values, but in tandem with confidence intervals and closer analyses of univariate tests. Strong samples are always going to be >30.

I have no medical training but I’ve read a fair number of medical research articles in recent years, trying to get smart on various things. The problem I see is similar to what you’ve described and unfortunately bleeds over into actual medical practice/diagnosis of patients. I would describe the issue as general unfamiliarity with, and/or misuse of, statistics. The problems seems to fall into three areas:

First is the issue of p-values. The assumption seems to be that as long as you have data on at least 30 patients, you can assume a normal distribution and therefore say something valid. Perhaps someone with a statistical background can put this more eloquently, but I don’t think that addresses the issue of whether the research design is solid (i.e. have we ensured that all the other things about patients are appropriately controlled so that your sample of 30 actually represents the larger population of people) It doesn’t appear that there’s much funding for medical research so every doctor’s office is trying to do underfunded research by corralling 30 or so of their patients. The result is a blizzard of studies using minimal numbers of people, published in reputable journals, reviewed by peers and judged to be statistically valid regardless of research design issues that leave them fatally flawed and contradicting one another.

Second, there seems to be misunderstanding in the medical community about what the 95% confidence interval represents and how to use this metric. For example, when reading blood tests the assumption seems to be that any reading within this 95% interval is normal, regardless of whether it’s normal for a given individual (with the result that no action is taken unless the number is actually outside the 95% interval.) This may mean that a person who is typically in the middle of this range for a given item could experience a 60-70% drop in a reading and still be in the “normal range.” For certain types of measurements (i.e. vitamins), there can be something seriously wrong when there’s a 60-70% drop in a reading. (Statistics is great but it doesn’t replace the need for actual medical diagnosis of an individual and the peculiarities of a specific body.)

The third issue is somewhat related to the last point. There seems to be a lack of knowledge about the validity of models, model building generally, and how models should be applied. As people in data science/statistics, we all know that the first thing you do when looking at datasets is plot the data, identify statistical outliers and deal with them. You avoid including outliers in your model building, or you modify the values, because these contaminate the results. And you wouldn’t take a model built on the general population and try to apply it to a statistical outlier. We also know that a good model may be accurate in only 90% of the cases. So 10% of the time we expect that the predicted results won’t be accurate and we either develop a new model or try something else to deal with these data points.

Unfortunately, many in the medical community don’t seem to fully appreciate that diagnosis and recommendations for drug dosing are basically models derived from observations of the general population (or a population with condition XYZ.) Various things can affect individual results and outcomes. Medications, for example, can have differing levels of effects depending on a range of conditions, especially body type (i.e. obese and very fit people may react quite differently to medications, the reason being that fat—particularly abdominal fat—actually plays a key part in metabolism and can affect how medications are processed by the body.) Failing to recognize differences in individuals is akin to applying a generalized model to statistical outliers – you get unexpected, and perhaps, very undesirable results.

Many of these issues can be avoided simply by looking more closely at individuals who you suspect may be statistical outliers (in data science/statistical terms, using “time series analysis” to see how blood level readings have changed over time, something that is quite easy and quick to do with current visualization software.) Unfortunately, most medical practices and hospitals are struggling financially and the way they typically try to cope with this problem is running people through appointments in minimum time. So doctors make decisions based on how patients compare to the norm rather than making an effort to look at personal history.

At any rate, the net effect of financial woes in the healthcare community seems to be lots of research that doesn’t really tell you much and, too often, diagnoses and use of medications that may be way off the mark for a given individual. Compounding the problem is the increasing tendency of the medical community to deem more and more tests as "unnecessary" as costs increase, meaning that it takes longer and longer to detect developing medical issues. This problem is only going to increase as finances tighten in the medical world. Can’t really blame people for trying to address financial challenges. It would be nice, however, if the medical community could get a bit more educated about statistics and how to use them.