The Methods Man: Lies, Damned Lies and Statistics. Really.

Statistical ethics, and how to "spot the fish."

Imagine that your job depends on one key metric. Number of sales, for example, or signatures on a petition, or -- I don't know -- religious converts. If you hit your (arbitrary) target -- you get to keep your job. Otherwise, you're out. Sound familiar?

But what if you could tweak your numbers a bit? What if you could call something a sale when, really, no money changed hands? What if you could get the same person to sign your petition multiple times with multiple different names? Would you do it? It's your job, after all.

This dilemma is faced by academic researchers every single day.

Publish or Perish?

My promotion (from vaunted Assistant Professor to hallowed Associate Professor) is dependent on, more or less, one metric. Publications. This isn't a Yale thing, by the way, this is just how it works for scientists. That metric is how the higher-ups judge the impact we're making on the world. Full disclosure: you also get points for teaching, community service, patient care, etc. (But not science blogging.)

Now, my publications are a matter of public record -- I can't lie about those. But the data that leads to those publications? The analyses that show whether a hypothesis I had was good or bad? That stuff is entirely within my control.

The conflict of interest is obvious. Scientists want to advance humanity through empiric knowledge. At the same time, when our hypotheses fall flat and our experiments fail, we have more difficulty getting those crucial publications. This, understandably, leads some down the dark path.

And for your edification, loyal readers, I'm going to teach you how it's done.

The Set-Up

No real data are being harmed or manipulated in the writing of this blog. The data are made up completely randomly. That is to say, I created a dataset where each data element is completely independent of every other data element. There is nothing here. You're welcome to play with it, by the way. I've posted a copy of this (useless) dataset on my blog here. Briefly, it's a dataset of 5,000 imaginary individuals. I've randomly generated lots of information about them -- ranging from their political affiliations to whether they like Star Trek or Star Wars.

The Twist

My challenge is to use some statistical wizardry to make it seem like something relevant is happening. But unlike when this happens in the medical literature, in this article I'll take you behind the curtain. Note that I'm presenting just a smattering of techniques -- but hopefully enough to give you a bit more skepticism when you're reading that article about the next great thing in medicine.

Trick #1: The Fish

This is the most basic, and, dare I say, most common trick in the statistics book. We've got a big dataset here, let's explore!

Basically, I'm going to compare variables in this dataset until I find two that seem related, and write my paper about that. Remember, none of the variables are actually linked. Let's start the hype train.

Did you know that, compared with other political affiliations, independents are 24% more likely to be Star Wars fans then Star Trek fans (P=0.001)?! Telemarketers are 5.5 times more likely to have kidney disease than other professions (P=0.02)! They are also 13.1 times more likely to describe themselves as a "Belieber" (P=0.017)! I was able to find all these statistically significant associations in my random dataset. Remember -- statistically significant doesn't mean true -- just unlikely to happen by chance. I cooked the books by only reporting the rolls of the dice that worked for us.

How to Spot the Fish:

Sometimes, the authors are nice enough to have stated their hypotheses before they did the analysis (like in the design paper for a randomized, controlled trial), but often not.

Some hints that a fish might be happening:

The observed results are biologically implausible

The observed results contradict all existing literature

The authors have published multiple papers using the same dataset, with no sense of thematic purpose

The authors present multiple analyses, and hang their hat on the one "significant" finding. They note that this could be the result of multiple testing, but call the results "hypothesis generating"

Trick #2: The Post Hoc Shuffle

OK, what if you, (the researcher) have an actual hypothesis. Let's say you think that people with more mojo (it's a fake dataset, I can make up fake variables) will be less likely to drink alcohol. You go into the data with the best intentions, to test that single hypothesis. Here's how that data look:

Figure 1: The hypothesis that drinkers have less mojo than non-drinkers doesn't seem to be playing out.

Your options now, as a researcher, are to try to get your (gasp -- negative!) study published, or to do a bit of post hoc shuffling. That means start excluding people from your data set ... Maybe we exclude people whose mojo is less than 1 for some reason, or whose mojo is greater than 5. Maybe we kick out diabetics, or those with heart failure, or Republicans. Bottom line: Is there any subgroup where our hypothesis is borne out?! Seek and ye shall find.

It turns out that, among people of height less than 56 inches, drinkers have significantly less mojo than non-drinkers (a mean of 2.8 versus 3.1, P=0.01). Hypothesis saved. By-the-way, if we set this cutoff at 55 inches or 58 inches, this would no longer work.

Hints that the post hoc shuffle is happening:

The subgroup of interest seems unrelated to the exposure and outcome of interest

Arbitrary cut-points of continuous variables are being used

Prior research, in this area, or from this lab, has not focused on this particular subgroup.

Trick 3: By the Power of Math, I Compel You!

There are certain statistical techniques, often aided by sophisticated computer software that are overpowered. What this means is that they can find patterns in the data even when no patterns exist. It's the silicon equivalent of staring at the clouds floating by. These tools exist for a reason -- often they are a nice way of visualizing trends in data that might not be immediately obvious -- but they should almost never be used to test hypotheses. But whatever, let's try one!

In this situation, I'm interested in predicting death. I built this dataset so that its members would die at a rate of roughly 25% over 10 years.

Figure 2: Fake deaths in the fake cohort. RIP.

What factors (presumably measured at baseline) tell us who is at risk of dying? Well -- I could, you know, have a hypothesis. But I could also use some advanced computational techniques to sift through the mountains of data and find factors that appear to make a difference. There are lots of ways to do this -- neural networks, machine learning, brute force, etc. In this case, I'm using something called regression tree analysis to basically let the computer find specific groups of people with particularly high risk.

After the computer churned through the data, I found that for individuals of height 74-82 inches who had heart failure, but not liver disease, not kidney disease, and were not "Beliebers," the rate of death was more than three times higher than those who didn't have that profile. Looks like we should be piping in the Biebs to the nursing home, huh?

Hints that a computer got way too involved with the dataset:

Results come from "data mining"

Results are not internally or externally validated

Methods section describes the use of clustered computing or supercomputer

Trick 4: Wrong Test, Wrong Result

A big part of learning how to do statistics involves understanding which statistical test to use given the data you are looking at. It depends on whether the variable of interest is continuous or categorical, how the data is "shaped," what the structure of the study was that generated the data, etc. But just because you should use a certain test, doesn't mean you can't use another test. Sometimes different tests give you different P-values. Let's shop around:

I'm looking at the association between height and physical function. Physical function, in this case, is an integer ranging from 1 -- 9. It is not a nice, normal, bell-shaped curve. The appropriate test of correlation between these two variables is the Spearman coefficient (don't worry about the details) which gives us a P-value of 0.06. Not statistically significant.

But, hey, the inappropriate Pearson correlation coefficient gives us a P-value of 0.04. That meets our threshold of statistical significance. So -- do you sink your paper, or do you use the wrong test?

Hints the wrong test might have been used for the data:

Results can't be reproduced (assuming you have access to the raw data)

This one only really works if you're looking at your data while your study is ongoing, but it's worth keeping an eye out for. The idea is that you keep running your primary analysis just to see if the result is significant. When it is -- stop the study! This requires a bit more forethought and a real lack of ethics, so (I think) it's pretty rare. In our dataset we keep checking our hypothesis (how about more mojo = less death?) as each new patient is enrolled. Here's a graph of the P-value over time, as we enroll each new patient:

Figure 3: The p-value jumps around at first (small numbers of patients = big shifts with each new patient), and then stabilizes, but there are clearly times when (by chance alone) the p-value falls below that magical 0.05 threshold (red dashed line). Stop your study there and you're golden.

Hints that frequent data checking might be happening:

The duration of the study was not prespecified

The study was "stopped early" for efficacy, but the P-value was just under 0.05

Trick 6: Outright Fraud

Let's not forget that a researcher can simply, well, lie. If all that we, as readers, get to see is the results of a study, how can we verify that the data itself seems plausible? Increasing numbers of journals are requiring scientists to submit their datasets as part of the publication process, and this really is laudable. But the process only works if individuals then download the datasets and try to reproduce the results. Even then, it takes a very astute eye (or some interesting statistical techniques that are beyond the scope of this article) to detect when data has been manipulated after the fact.

Solutions

I don't mean to fear monger with this article. I don't mean to imply that scientific misconduct like this is rampant. But it does happen. And it can happen without malevolence: "I know this hypothesis is true, so it's OK if I take some liberties with the statistics," "we need this data to support our next study, which is really important for the world to see," "I have a grant to apply for, and this paper needs to be published first." The road to hell and all that.

I propose a few solutions:

1) Encourage training in statistical ethics. Take young researchers and give them not only an understanding of the tools of statistics, but a sense of responsibility and stewardship. These are truly powerful tools, and should only be used for the greater good.

2) Limit the positive study = publication paradigm. There are a variety of ways to do this. Trial registries, like clinicaltrials.gov are a good start. Negative trials may still not be published, but at least we know they happened. Journals that ignore the "impact" of results but favor good science are also hugely beneficial (I'm looking at you PLOS One).

3) Researchers should be pressured to prespecify the statistical tests they are going to use on their data, and that data (once collected) should be made available to the public for replication.

4) You, I, the public, and especially the press need to take positive studies with a grain of salt. Sometimes, it really is the next big thing. Sometimes, it's smoke and mirrors.

That's it for this one, folks. Stay skeptical.

F. Perry Wilson, MD, is an assistant professor of medicine at the Yale School of Medicine. He earned his BA from Harvard University, graduating with honors with a degree in biochemistry. He then attended Columbia College of Physicians and Surgeons in New York City. From there he moved to Philadelphia to complete his internal medicine residency and nephrology fellowship at the Hospital of the University of Pennsylvania. During his research time, Dr. Wilson also obtained a Master of Science in Clinical Epidemiology from the University of Pennsylvania. He is an accomplished author of many scientific articles and holds several NIH grants. He is a MedPage Today reviewer, and in addition to his video analyses he authors a blog, The Methods Man. You can follow @methodsmanmd on Twitter.

Accessibility Statement

At MedPage Today, we are committed to ensuring that individuals with disabilities can access all of the content offered by MedPage Today through our website and other properties. If you are having trouble accessing www.medpagetoday.com, MedPageToday's mobile apps, please email legal@ziffdavis.com for assistance. Please put "ADA Inquiry" in the subject line of your email.