How can you evaluate a research paper?

You ended a post from last month [i.e., Feb.] with the injunction to not take the fact of a paper’s publication or citation status as meaning anything, and instead that we should “read each paper on its own.” Unfortunately, while I can usually follow e.g. the criticisms of a paper you might post, I’m not confident in my ability to independently assess arbitrary new papers I find. Assuming, say, a semester of a biological sciences-focused undergrad stats course and a general willingness and ability to pick up any additional stats theory or practice, what should someone in the relevant fields do to get to the point where they can meaningfully evaluate each paper they come across?

My reply: That’s a tough one. My own view of research papers has become much more skeptical over the years. For example, I devoted several posts to the Dennis-the-Dentist paper without expressing any skepticism at all—and then Uri Simonsohn comes along and shoots it down. So it’s hard to know what to say. I mean, even as of 2007, I think I had a pretty good understanding of statistics and social science. And look at all the savvy people who got sucked into that Bem ESP thing—not that they thought Bem had proved ESP, but many people didn’t realize how bad that paper was, just on statistical grounds.

So what to do to independently assess new papers?

I think you have to go Bayesian. And by that I don’t mean you should be assessing your prior probability that the null hypothesis is true. I mean that you have to think about effect sizes, on one side, and about measurement, on the other.

It’s not always easy. For example, I found the claimed effect sizes for the Dennis/Dentist paper to be reasonable (indeed, I posted specifically on the topic). For that paper, the problem was in the measurement, or one might say the likelihood: the mapping from underlying quantity of interest to data.

Other times we get external information, such as the failed replications in ovulation-and-clothing, or power pose, or embodied cognition. But we should be able to do better, as all these papers had major problems which were apparent, even before the failed reps.

One cue which we’ve discussed a lot: if a paper’s claim relies on p-values, and they have lots of forking paths, you might just have to set the whole paper aside.

Medical research: I’ve heard there’s lots of cheating, lots of excluding patients who are doing well under the control condition, lots of ways to get people out of the study, lots of playing around with endpoints.

The trouble is, this is all just a guide to skepticism. But I’m not skeptical about everything.

And the solution can’t be to ask Gelman. There’s only one of me to go around! (Or two, if you count my sister.) And I make mistakes too!

So I’m not sure. I’ll throw the question to the commentariat. What do you say?

45 Comments

Be very skeptical of any study with small Ns, unless the effect size is very large. If the effect is clear from a scatterplot, I’m more likely to take it seriously. I’ve become a big fan of overlaying model predictions on scatterplots.

I find skimming the statistical methods section informative. Are they appropriate? Do they seem to have been written by someone who knew what they were doing? Was a professional statistician involved?

Most importantly, did they correctly control for biases? I’ve come to believe that controlling bias is just about the most important thing we statisticians do. And bias comes in many forms, including multiple comparisons, differences between groups in prognostic demographics or other variables, evolving standard of care (beware of studies comparing time intervals before and after some event) — I have seen considerable influence from evolving standard of care in clinical studies spanning more than a few years. Missing data is an often ignored bias — patients may drop-out of a trial for reasons associated with the treatment or the severity of their condition (or lack of it), or age (we may see good follow-up through adolescence, followed by a drop-off as patients enter adulthood and have jobs and families). Then of course there are biases associated with which patients are consented or excluded from a trial.

I take results from a randomized trial more seriously than observational/retrospective studies, assuming the Ns are reasonable.

Over time I’ve come to associate certain authors’ names with junk papers. This is a matter of experience.

A lot of clinical research is done by over-worked medical students, residents, or Fellows, with little over-site of the data collection. Take it with a grain of salt.

I suggest being very skeptical, even if “the [estimated] effect size is very large.” Reported estimated effect sizes are overestimates because of the statistical significance filter. If N is small and data are noisy, estimated effect sizes will be huge no matter what.

Also when you talk about controlling for bias: yes, I agree, but I’d point you more strongly toward issues of measurement. I agree with you that statistical issues such as missing data and selection bias are important, and these can be enablers of all sorts of other problems—but more fundamental than all of this, I think, is that the measurements being analyzed are tied closely to the questions under study.

I agree. My intention was to clarify the bit about large effect sizes with overlays of the effect on a scatterplot, because often the scatterplot will make vividly obvious that a large effect size is due to a few outliers, while otherwise there is little evidence of change (not that I’m even vaguely advocating throwing-out outliers, merely being aware of their influence). Sadly, many investigators (and reviewers) ask that graphs be shown without the scatterplot because it “distracts” from the interpretation.

Your comment about measurements being analyzed being tied to the questions under study would seem to tie-in nicely with your earlier blog post about exploratory studies. Certainly I’ve seen many published studies done on measurements taken incidentally with regard to the primary study, especially when the results for the main outcomes didn’t pan-out.

Yes, to me as a statistics researcher and textbook writer, the challenge is to formalize ideas of good practice and fold them into statistical theory and education.

There can be a disconnect, where researchers know that ideas such as model checking, exploratory data analysis, careful measurement, etc., are important—but then they don’t know how to integrate these practices into their work. And in that case the logic of the statistical methods can, unfortunately, drive the research into the ground.

Consider topics such as power pose. Of course the Amy Cuddys and Andy Yaps and Susan Fiskes of the world know that good measurement is important and that p-hacking is bad. But with the primitive statistical methods they are using, their research and publication schedule becomes determined by what is or is not statistically significant, which pushes issues of bias and variance of measurement to the side because they do not appear in any of the formulas they’re using.

Much of my research during the past few decades is an attempt to (a) formulate the “good practice” intuitions so they can fit into statistical analyses, and (b) stretch the boundaries of statistical analyses to account for good practice.

It took me a long time and hard work to get to the point that I understood the basic ideas. For some reason many don’t want to put in the work but want to use statistics anyway.

Why only a semester of undergrad stats? Why not a year, or two years? I understand that people don’t have time. But there is no short cut to understanding. If significant amounts of statistics were a requirement to using it, would that be so bad? If you don’t have time to learn enough about statistics to use it, maybe don’t use it?

Part of the problem is that even several years of statistics classes could lead you to the point where you’re an “expert” in essentially p-hacking and choosing from among many tests to get the result you want/need. The kind of stuff taught in “standard” stats classes isn’t necessarily helpful to getting towards “less wrong”

I think it’s a combination of people believing stats is so difficult you can’t learn it with a regular high school education (so there’s no point in trying) and tools that make it so easy to get “results” that there’s no need to. I’ve had people give me reviewer reports pointing them to online calculators “just to make sure the results are significant.” I guess there’s an underlying assumption stats is what makes a claim “scientific,” without thinking what that means (and I’m mostly thinking of my field here).

The stats are just a tiny part of whether people can “meaningfully evaluate each paper they come across”. In fact, thinking stats are of utmost importance is another attitude that has lead to this mess.

How the data was collected and how much effort was put into developing the *research hypothesis* is much more important.

If you
1) collect crap data (unblinded, questionable proxy for what you care about, etc)
or
2) fail to get your model to make a prediction more precise than “A should be positively correlated with B”

Then the paper will not be useful.

Now, it would be fine if people skipped number 2 and just collected good data and published descriptions of it. Then the models/theories can be developed by people with those skills (this is much closer to how eg particle physics functions).

However, that is not really allowed under the current culture. So people attempt testing these vague speculations (by instead testing a precise null hypothesis known to be false…) and come to all sorts of incorrect conclusions. Why? Because such vague hypotheses are impossible to distinguish from a whole slew of alternative trivial explanations for the results. For biomed, the reality is that 99.99% of claims being published are not worth paying attention to, rather you are better off having never heard about it.

In addition, there is always the famous, “follow the money” which supposedly appeared first in the film (but not in the book), “All the President’s Men.” This admonition is especially true for the incoming Trump Administration.

See if they failed to blind themselves and used dynamite plots. You can also search for the term significance and often find blatant misunderstandings quickly that way. Do they claim A leads to B but fail to show a scatter plot of A vs B? Does it look like all the analysis was done with (essentially impossible to debug and reproduce) spreadsheet manipulations? Are there estimates of uncertainty present on the plots? Do they tell you real sample sizes or just say “at least n animals in each group”? Do sample sizes seem to change from figure to figure for unexplained reasons? All these are signs telling you the report should be treated as unreliable, you can go through that list in a few minutes.

What if the above doesn’t tell you to move on? Next you need to figure out what they actually measured, where did these numbers come from? That may take a bit more time. During this process you will likely need to follow some refs, use that as a spot check to make sure the papers say what is claimed. Anyway, think about whether a mouse sniffing a corner of a plus maze really has much to do with memory rather than other factors, or whatever it is they measured. You should *always* be able to come up with a few alternative explanations. Now check the paper, did they ignore alternative explanations, just handwave them, or take that issue seriously?

Using the above, within a day or so someone reasonably familiar with the field should be able to tell if a paper would be a waste of thier time to study further (but not if it is good). That makes me wonder how so much crap keeps getting published though…

I’m working on this problem right now. First; we must ask ‘evaluate why?’ Many people are discussing how to evaluate papers for the purpose of tenure or promotion, etc. That question – how to evaluate the importance of the paper in the context of careers and organizations – is a problem for another day (month, decade).

Then there are people who want to know whether a claim made in a paper is likely to be accurate. For example, they have a practical question – how infectious is disease X? Or what drug targets exist for …? A paper makes a relevant claim (observation, deduction) and the reader needs to know how likely it is to be correct. This is the reproducibility crisis and generalizability crisis problem.

For this, my approach does involve priors – how surprising is the finding? Combine with how reliable the field (and specific authors) have been… And then the smell of the paper (like code smells). Were the statistical methods modern and appropriate? What of silly errors and outrageous claims? etc etc.

Somewhat inadvertently this semester I structured an intro research methods course around the replicability crisis in science. It’s a bit scary to stare into the abyss this way, but somehow it can also be affirming. Confronting the failures brings into sharp relief the principles and procedures that support more robust research. And it’s also a good reminder that while there is a lot of noise out there, there is obviously a lot of scientific work that is pretty damn good, in all fields. I think it ended up affirming the message that science is a useful enterprise that demands hard work and clear thinking.

Reading papers together with other people — i.e. a journal club — can be very useful, and enjoyable, and it distributes the “work” of figuring out background information, methods, etc. A negative: an excellent journal club I’m part of comes out of nearly every meeting having concluded (correctly) that the paper we’ve looked at is awful, and this is perhaps even more dispiriting in a group than individually. A positive: I met my wife at a journal club. Of course, your mileage may vary.

The group I belonged to evaluate randomized trials for meta-analysis (1980,s) randomly assigned one group member to be a paper’s advocate and another a critic. So there was usually a they did this right aspect.

Not a rule that’s going to solve everything, but be sceptical of any study using non-randomly sampled data (this includes social media data) that doesn’t discuss the problems this might cause for inference.

I think you’d have to consider some awareness of your own biases as well. We tend to seek out information that confirms our views of the world, so you’re more likely to set aside skepticism if the conclusions of the paper match what you believe currently. Deliberate behaviors to compensate for those things are good.

I agree Raghu’s suggestion of working in groups is helpful, and I’d add that we try to engage in an exercise where it starts with “this result could be nothing if is true.” And then, we look to the materials we have to see if the author addressed that possibility. We don’t do it exhaustively, but we want to arrive at a bunch of alternative explanations for the observed effect and see if the author thought about them as well. It sort of reminds me of those lame graphs you see on Facebook “this one graph explains why is ” There is no “one graph” that completely explains anything in my mind.

Any chance that a group could try to get full(er) coverage on the few/most common flaws?

Seems most groups try to look at studies one by one and give them a fair/full evaluation… which is nice but seems time consuming. I was thinking it might be a lot fairer/faster if a group tries to only estimate the Type S, Type M errors for typical p-value studies, then publishes a big reference… that way, when considering whether to bother understanding a new paper in an unfamiliar discipline, one could just look up the reference and not bother with obvious noise.

I think there would be a market among non-statisticians for a book called “Common Statistical Mistakes in the Social Sciences and How to Avoid Them.” As an economics student who is laboriously trying to teach himself techniques of statistical inferences, it would be very useful to me.

(The notes were developed for a “continuing education” course I taught for several years. I’ve decided to retire from it, but someone else will be teaching it in May 2017 — see https://stat.utexas.edu/training/ssi for information.)

For me, what I look at first (and think about) are alternative mechanisms for whatever the result is that are either not cited, or mentioned and then shrugged off by citing a single paper that may (or may not… I’m not going to track it down) rebut the alternative. I want what appear to me to be *serious* tests of the alternative mechanisms, and at the very least a grounding of this result quantitatively alongside the rest of the results on whatever the topic is.

Second, I look for p values anywhere near 0.05 as evidence, combined with forking paths. We all know where that 0.0492 came from, and it’s never the straight path.

Third, I look for *simple* tests of the result, with increasing skepticism as more advanced (and convoluted) techniques yield very different results. Sure, IV methods are needed when there is endogeneity. but if the IV results and the OLD results don’t point in the same direction, then you’d better convince yourself that the instrument wasn’t selected to get you there. This isn’t an indictment of IV methods… just a suggestion to look at the magnitude of the change from OLS before accepting the result credulously.

Fourth, look for (where applicable) predictions of the implications of the central result in something not directly related and, where possible, actual tests of those predictions. But at least respect the researcher who pins herself down as to how applicable she thinks these results are outside of this particular estimation procedure.

Fifth, look for modesty. I like papers which aren’t all about “look what I found” but contain a healthy dose of “but this is why this might not be right.” (This is partly related to the first criterion.) Most critically, if there’s some possible mechanism that’s not tested for or even mentioned in the paper itself, and if it’s an objection that I came up without really being immersed in the particular area myself, that’s a big red flag.

“Our study reveals that there are significant gender differences in how people react to different ways this information is presented to them: Women reach better/more accurate decisions when information is presented multimodally, i.e. using natural language text and graphics, whereas men are happy with just the graphics.”

From the linked paper
“We found that females score significantly higher at the decision task when exposed to either of the NLG output presentations, when compared to the graphics-only presentation (p < 0.05, effect = +53.03).

We found that males obtained similar game scores with all the types of representation. This suggests that the overall improved scores (for All Adults) presented above, are largely due to the beneficial effects of NLG for women."

There are some basic things that catch my eye right away. I am not a statistician, but I can recognize certain kinds of errors.

1. An obvious discrepancy between the study’s actual findings and its conclusions (in the abstract or press release). For example, if a study purports to show that students learn math better when they gesture, but really only found that students who were told to gesture while solving a math problem did somewhat better on a subsequent math quiz than students told explicitly *not* to gesture, I know something’s up. The issue may be with the reporting, but often it’s in the language of the study itself.

2. Unclear or inconsistent measurement, or measurement of the wrong thing. If a study purports to show that students who read stories of scientists’ *struggles* make more progress in science than those who read of scientists’ *successes,* and the progress is measured in terms of homework grades (across assignments, with no consistent criteria), then the findings are iffy at best.

3. Failure to think a problem through. The other day I read about a study of the emotional “shapes” of stories. From Drake Baer’s article: “With that corpus [of 1,327 stories from Project Gutenberg] in place, the researchers analyzed the happiness levels of the words themselves, ratings that were found via crowdsourcing the hedonic value of individual words (“love,” “laughter,” and “happiness” are at the top).” The researchers do not seem to have considered the many problems with this approach, problems that become clear if you look closely at a work of literature.

You can’t help but be Bayesian because your priors inform what you tend to believe or not, what you even notice or not. So the best papers are those which challenge your priors and that means “best” includes those which suck if they illuminate in some way your priors so the most important or best papers to you are those which transform your priors. That doesn’t need to be earthshaking; it can be realization than a tool exists by which you can think about your priors, something that enables better analysis, rather than a particularly analytical result. The most important papers to all are those which do the same for a large number of people – which we idealize to “a large number of smart people” have their minds adjusted by that work.

There are relatively clear tools for examining whether a paper says anything worthwhile or not: noise, sign of effect relative to its size, etc. But the clear tools are often really hard to apply, particularly because papers are written to obscure the weak points in their reasoning or to gloss over the weak points in the data or in the the analysis. And the tools are in no way complete enough that you can rely on a testing algorithm that lets you pump in a study for evaluation and get an answer quickly without a ton of fiddling. So you have to rely on fairly simple measures like: did this paper make me think about diabetes in a different way? That way doesn’t have to be correct – the paper might be complete crap – but if it changed your mind in some way then your mind is now capable of figuring out whether that answer is half decent or really good or utter garbage. It is this which I think explains our fascination with shitty work: you have to recognize crap where it is and it’s often not as obvious as a giant horse apple pile in the path and sometimes it’s as slippery as goose or turkey shit whose color nearly matches your deck or the grass or the leaves (because some shit wants to blend in, doesn’t it? Kind of as Wednesday says at the end of Addams Family when asked why she isn’t in costume for Halloween and she says, “I’m going as a serial killer. They look like everyone.” Now that’s shit blending in.

The cool thing about the question is that it’s asked because that means you’re open to challenging your conceptions, which means you are able to change your priors and perhaps to understand how these priors affect your perception of and comprehension of your posteriors, meaing your results. That’s where Bayes connects to life in general: your life is a series of posterior results, achieved in a generalized probability model in which priors both influence – to some degree, depending on the circumstance – not only the posteriors but how the posteriors are noticed, categorized and acted on. That’s a fun thing in modeling game theoretical behaviors and coin flips: that you can quickly and obviously define circumstances in which there are clear negative and positive outcomes for the entire process of priors affecting posteriors. Now if only people could understand that …

Three basics things that raise suspicion (already mentioned in the comment thread, but consider this as extra votes)
1. Small N (too many papers in social psych and behavioral sciences with N=25 per cell, rather than 50 or 100)
2. Too many rejections of the null with p close to 0.05 (p-hacking)
3. Diana’s point 1, very specific manipulations to prove a general claim.
If these things are present, start looking for issues and you’ll find.

Two things predicted non-replication in the Reproducibility Project Psychology. High p values and surprisingness. Subdiscipline = social psych was a predictor too, but that’s very confounded with surprisingness.
I bet against studies like that in the various prediction markets and always made good money while having to do little actual reading of bad studies.

But replicabilty is, in a way, boring, many studies are of course invalid, even if they replicate, and detecting that requires real reading and thinking.
The others all mention journal clubs, and that has taught me a lot too (we always have a positive and a negative round, but it often leaves us feeling dejected too).
But you don’t always have access to a journal club of bright minds with the specific expertise who want to discuss the paper. I like the Altmetric bookmarklet https://www.altmetric.com/products/free-tools/bookmarklet/ and the Pubpeer browser extension. Altmetric includes post-publication peer review on Pubpeer, Publons, Twitter and blogs. If there is some criticism there, it can be helpful. This has helped me with stuff outside my immediate discipline. In some disciplines (e.g. genetics) this helps a lot. Unfortunately blogs aren’t well-tagged in Altmetric.

Also, the absence of robustness and sensitivity analyses is a red flag to me. Unfortunately such analyses are infrequent in psychology. I mean they all did these analyses to see whether they could squeeze out a bigger effect, but they don’t report them. Especially in correlational work, I want to see what the model assumptions do, I won’t be convinced that the model you happened to choose was best.

I’m just working on my master’s degree, but reading this blog has been the single most useful tool in my statistical education so far. I took a pile of stats classes in undergrad that taught me how to do basic procedures but little to nothing about statistical thinking… so thanks for everything you do here, it’s incredibly important!

This is slightly aside from the question, but one step that a lot of people I respect seem to do, before even getting to the statistical methods or anything to do with the paper, is to look at the claim and ask, ‘why do I care?’ If the answer is you don’t, just move on and maintain your pristine ignorance. But otherwise, you can map out steps to assess the claim that are proportional to the benefit or cost you expect conditional on the claim having merit, up to and including asking Andrew Gelman to help :).

In fact this is not entirely tongue in cheek: I’ve used the same procedure as a lead on software development teams, evaluating the marketing claims of software products; most of the time you just don’t even bother.

I guess the point is that I think the process of science implicitly has some measure of utility in it, as there are many more questions we choose *not* to study than we chose to study, simply because the answers are largely irrelevant either way and resources are constrained. So perhaps one ought to just do that explicitly even in one’s own evaluation of research.

“I think you have to go Bayesian. And by that I don’t mean you should be assessing your prior probability that the null hypothesis is true. I mean that you have to think about effect sizes, on one side, and about measurement, on the other.”

That’s smart. but how’s that Bayesian? Every frequentist, EDA fan or whatever can do that, too.
Certainly Bayes himself didn’t mention a thing about these things in his famous paper.

In classical statistics, prior information on effect sizes and measurement is used in designing the data collection and in choosing what class of model to use and what variables to include in the model, but that’s it. Bayesian statistics encourages the use of prior information in the model fitting stage as well.

And, for sure, “Bayesian statistics” is not just what’s in Bayes’s paper. He used a uniform prior, at least that’s what I remember. But Bayesian statistics as we define it today does not require uniform priors.

Bayes’ original paper gave the basic principle. True, he used a uniform prior but it’s not a far stretch to use others.
The only Bayesian way to incorporate information about effect sizes and measurements is through priors but that’s of little help when a non-expert reader tries to evaluate a research paper. What is required here is critical thinking about the given information on measurement and effect sizes, not Bayesian computation. Critical thinking is by no means an exclusively Bayesian domain.

Maybe require p less than 0.00000001 just to be safe. But in the meantime we still have to make decisions! I think the problem is in the desire for certainty where none exists. Rather than setting thresholds, I prefer to summarize what knowledge we have and accept our uncertainty. That is, we should be Nate Silvers, and not let our desire for certainty make us into Sam Wangs.

I think it depends in large part on what you’re talking about! I’m currently progressing through a PhD program, and as I’ve completed the Master’s degree, I’m allowed to teach in order to get my funding. The course the Department gave me to teach was…Psych Stats! So I’ve been thinking about this stuff for a while now!

The general thrust of my reporting guidelines on slides 12-15 is that most psychologists generally don’t want to move away from NHST procedures. So the best advice I can offer is to report your test statistic & p-value, PLUS the appropriate effect size measure, PLUS confidence interval, PLUS means & SDs so the reader can see for him/herself how much variability is in the data and the magnitude of the difference in means. Such an approach helps to present a more complete picture of what’s going on in the data, rather than just “p < .05" Please feel free to check it out and correct me on any misapprehensions that may exist!