Tuesday, November 29, 2016

I followed my partner to a workshop in plasma physics. The
workshop was held in a mountain resort in Poland – getting there was an
adventure worthy, perhaps, of a separate blog post.

“I’m probably the only non-physicist in the room”, I say, apologetically,
at the welcome reception, when professors come up to me to introduce
themselves, upon seeing an unfamiliar face in the close-knit fusion
community.

Remembering this XKCD comic,
I ask my partner: “How long do you think I could pretend to be a physicist
for?” I clarify: “Let’s say, you are pretending to be a psychological
scientist. I ask you what you’re working on. What would you say?”

“I’d say: ‘I’m working on orthographic depth and how affects
reading processes, and also on statistical learning and its relationship to
reading’.” Pretty good, that’s what I would say as well.

“So, if you’re pretending to be a physicist, what would you
say if I ask you what you’re working on?”, he asks me.

The look on my partner’s face tells me that I would not do
very well as an undercover agent in the physics community.

The attendees are around 50 plasma physicists, mostly
greying, about three women among the senior scientist, perhaps five female
post-docs or PhD students. Halfway through the reception dinner, I am asked
about my work. In ten sentences, I try to describe what a cognitive
scientist/psycholinguist does, trying to make it sound as scientific and
non-trivial as possible. Several heads turn, curious to listen to my
explanation. I’m asked if I use neuroimaging techniques. No, I don’t, but a lot
of my colleagues and friends do. For the questions I’m interested in, anyway, I
think we know too little about the relationship between brain and mind to make
meaningful conclusions.

“It’s interesting”, says one physicist, “that you could
explain to us what you are doing in ten sentences. For us, it’s much more
difficult.” More people join in, admitting that they have given up trying to
explain to their families what it is they are doing.

“Ondra gave me a pretty good explanation of what he is
doing”, I tell them, pointing at my partner. I sense some scepticism.

Physics envy is a term coined by psychologists (who else?),
describing the inferiority complex associated with striving to be taken serious
as a field in science. Physics is the prototypical hard science: they have long
formulae, exact measurements where even the fifth decimal places matter, shiny multi-billion-dollar
machines, and stereotypical crazy geniuses who would probably forget their own
head if it wasn’t attached to them. Physicists don’t
always make it easy for their scientific siblings (or distant cousins)* but, admittedly, they do have a right to be smug towards psychological scientists, given the
replication crisis that we’re going through. The average physicist,
unsurprisingly, finds it easier to grasp concepts associated with mathematics
than the average psychologist. This means that physicist have, in general, a better
understanding of probability. When I tell physicists about some of the absurd
statements that some psychologists have made (“Including unpublished studies in
the meta-analysis erroneously biases an effect size estimate towards zero.”;
“Those replicators were just too incompetent to replicate our results. It’s
very difficult to create the exact conditions under which we get the effect:
even we had to try it ten times before we got this significant result!”),
physicists start literally rolling on the floor with laughter. “Why do you even
want to stay in this area of research?” I was asked once, after the physicist I
was talking to had wiped off the tears of laughter. The question sounded
neither rhetorical nor snarky, so I gave a genuine answer: “Because there are a
lot of interesting questions that can be answered, if we improve the
methodology and statistics we use.”

In physics, I am told, no experiment is taken seriously
until it has been replicated by an independent lab. (Unless it requires some unique equipment, in which case it can't be replicated by an independent lab.) Negative results are still considered informative, unless they are due to experimental errors. Physicists still have issues with
researchers who make their results look better than they actually are by
cherry-picking the experimental results that fit best within one’s hypothesis
and with post-hoc parameter adjustments – after all, the publish-or-perish
system looms over all of academia. However, the importance of replicating
results is a lesson that physicists have learnt from their own replication crisis:
in the late 1980s, there was a
shitstorm about cold
fusion, set off by experimental results that were of immense
public interest, but theoretically implausible, difficult to replicate, and
later turned out to be due to sloppy research and/or scientific
misconduct. (Sounds familiar?)

Physicists take their research very seriously, probably to a
large extent because it is often of great financial interest. There are those
physicists who work closely with industry. Even for those who don’t, their work
often involves very expensive experiments. In plasma physics, a shot on the
machine of Max Planck Institute for Plasma Physics, ASDEX-Upgrade, costs several
thousand dollars. The number of shots required for an experiment depends on the research aims, and whether there is other data available, but can go up to 50 or more. This gives very strong
motivation to make sure that one’s experiment is based on accurate calculations
and sound theories which are supported by replicable studies. Furthermore, as
there is only one machine – and only a handful of similar machines all over
Europe – it needs to be shared with all other internal and external projects.
In order to ensure that shots (and experimental time) are not wasted, any team
wishing to perform an experiment needs to submit an application; the call for
proposals opens only once a year. A representative of the team will also need
to do a talk in front of the committee, which consists of the world’s leading
experts in the area. The committee will decide whether the experiment is likely
to yield informative and important results. In short, it is not possible – as
in psychology – to spend one’s research career testing ideas one has on a whim,
with twenty participants, and publish only if it actually ‘works’. One would be
booed of the stage pretty quickly.

It’s easy to get into an us-and-them mentality and feelings
of superiority and inferiority. No doubt all sciences have something of
importance and of interest to offer to society in general. But it is also
important to understand how we can maximise the utility of the research that we
produce, and in this sense we can take a leaf out of physicists’ books. The
importance of replication should be adopted also into the psychological
literature: arguably, we should simply forget all theories that are based on
non-replicable experiments. Perhaps more importantly, though, we should start
taking our experiments more seriously. We need to increase our sample sizes;
this conclusion seems to be gradually coming through as a consensus in
psychological science. This means that also our experiments will become more
expensive, both in terms of money and time. By conducting sloppy studies, we
may still not loose thousands of dollars of taxpayers’ (or, even worse,
investors’) money for each blotched experiment, but we will waste the time of
our participants, the time, nerves and resources of researchers who try to make
sense of or replicate our experiments, and we stall progress in our area of
research, which has strong implications for policy makers in areas ranging from
education through improving social equality, prisoners’ rehabilitation, and political/financial
decision making, to mental health care.

--------------------------------------

* Seriously, though, I haven’t met a physicist who is as bad
as the linked comic suggests.

Acknowledgement: I'd like to thank Ondřej Kudláček, not only for his input into this blogpost and discussions about good science, but also for his unconditional support in my quest to learn about statistics.

Thursday, November 24, 2016

On a website called flexiblemeasures.com, Malte Elson
lists 156 dependent measures that have been used in the literature to quantify
the performance on the Competitive Reaction Time Task. A task which has this
many possible ways of calculating the outcome measure is, in a way, convenient
for researchers: without correcting for multiple comparisons, the probability
that the effect of interest will be significant in at least one of the measures
skyrockets.

So does, of course, the probability that a significant
result is a Type-I error (false positive). Such testing of multiple variables
and reporting only the one which gives a significant result is an instance of p-hacking. It becomes problematic when another researcher tries to establish whether there is good evidence for an effect: if one
performs a meta-analysis of the published analyses (using standardised effect
sizes to be able to compare the different outcome measures across tasks), one
can get a significant effect, even if each study reports only random noise and
one creatively calculated outcome variable that ‘worked’.

Similarly, it becomes difficult for a researcher to
establish how reliable a task is. Take, for example, statistical learning.
Statistical learning, the cognitive ability to derive regularities from the
environment and apply them to future events, has been linked to everything from
language learning to autism. The concept of statistical learning ties to many
theoretically interesting and practically important questions, for example,
about how we learn, and what enables us to be able to use an abstract, complex
system such as languages before we even learn to tie a shoelace.

Unsurprisingly, many tasks have been developed that are
supposed to measure this cognitive ability of ours, and to correlate
performance on these tasks to various everyday skills. Let us set aside the
theoretical issues with the proposition that a statistical learning mechanism
underlies the learning of statistical regularities in the environment, and
concentrate on the way statistical learning is measured. This is an important
question for someone who wants to study this statistical learning process:
before running an experiment, one would like to be sure that the experimental
task ‘works’.

As it turns out, statistical learning tasks don’t have
particularly good psychometric properties: when the same individuals perform
different tasks, the correlations between performance on different tasks are
rather low; the test-retest reliability varies across tasks, but ranges from
pretty good to pretty bad (Siegelman & Frost, 2015). For some tasks, performance
on statistical learning tasks is not above chance for the majority of the
participants, meaning that they cannot be used as valid indicators of
individual differences in the statistical learning skill. This raises questions
about why such a large proportion of published studies find that individual
differences in statistical learning are correlated with various life-skills,
and explains anecdotal evidence from myself and colleagues of conducting
statistical learning experiments that just don’t work, in the sense that there
is no evidence of statistical learning.* Relying on flexible outcome measures
increases the researcher’s chances of finding a significant effect or
correlation, which can be especially handy when the task has sub-optimal
psychometric properties (low reliability and validity reduce the statistical
power to find an effect if it exists). Rather than trying to improve the
validity or reliability of the task, it is easier to continue analysing different variables until
something becomes significant.

The first example of a statistical learning tasks is the
Serial Reaction Time Task. Here, the participants respond to a series of
stimuli, which appear on different positions on a screen. The participant
presses buttons which correspond to the location of the stimulus. Unbeknown to
the participant, the sequence of the locations repeats – the participants’
error rates and reaction times decrease. Towards the end of the experiment,
normally in the penultimate block, the order of the locations is scrambled,
meaning that the learned sequence is disrupted. Participants perform worse in
this scrambled block compared to the sequential one. Possible outcome variables
(which can all be found in the literature) are:

- Comparison of accuracy in the scrambled block to the
preceding block

- Comparison of accuracy in the scrambled block to the
succeeding (final) block

- Comparison of accuracy in the scrambled block compared to
an average of the preceding and succeeding blocks

- The increase in accuracy across the sequential blocks

- Comparison of reaction times in the scrambled block to the
preceding block

- Comparison of reaction times in the scrambled block to the
succeeding (final) block

- Comparison of reaction times in the scrambled block
compared to an average of the preceding and succeeding blocks

- The increase in reaction times across the sequential
blocks.

This can hardly compare to the 156 dependent variables from
the Competitive Reaction Time Task, but it already gives the researcher
increased flexibility in selectively reporting only the outcome measures that
‘worked’. As an example of how this can lead to conflicting conclusions about
the presence or absence of an effect: in a recent review, we discussed the
evidence for a statistical learning deficit in developmental dyslexia (Schmalz, Altoè, & Mulatti, in press). In regards to the Serial
Reaction Time Task, we concluded that there was insufficient evidence to decide
whether or not there are differences in performance on this task across
dyslexic participants and controls. Partly, this is because researchers tend to
report different variables (presumably the one that ‘worked’): as it is rare
for researchers to report the average reaction times and accuracy per block (or
to respond to requests for raw data), it was impossible to pick the same
dependent measure from all studies (say, the difference between the scrambled
block and the one that preceded it) and perform a meta-analysis on it. Today, I
stumbled across a meta-analysis on the same question: without taking into
account differences between experiments in the dependent variable, Lum, Ullman, and Conti-Ramsden (2013) conclude that there is
evidence for a statistical learning deficit in developmental dyslexia.

As a second example: in many statistical learning tasks,
participants are exposed to a stream of stimuli which contain regularities. In
a subsequent test phase, the participants then need to make decisions about
stimuli which either follow the same patterns or not. This task can take many
shapes, from a set of letter strings generated by a so-called artificial
grammar (Reber, 1967)
to strings of syllables with varying transitional probabilities (Saffran, Aslin, & Newport, 1996). It should be noted that both
the overall accuracy rates (i.e., the observed rates of learning) and the psychometric
properties varies across different variants of this tasks (see, e.g., Siegelman, Bogaerts, & Frost, 2016, who specifically aimed
to create a statistical learning task with good psychometric properties). In these tasks, accuracy is
normally too low to allow an analysis of reaction times; nevertheless, different
dependent variables can be used: overall accuracy, the accuracy of grammatical
items only, or the sensitivity index (d’). And, if there is imaging data, one can apparently interpret brain patterns in the complete absence of any evidence of learning on the behavioural level.

In summary, flexible measures could be an issue for
evaluating the statistical learning literature: both in finding out which tasks
are more likely to ‘work’, and in determining to what extent individual
differences in statistical learning may be related to everyday skills such as
language or reading. This does not mean that statistical learning does not
exists, or that all existing work on this topic is flawed. However, it creates
cause for healthy scepticism about the published results, and many interesting
questions and challenges for future research. Above all, the field would
benefit from increased awareness of issues such as flexible measures, which
would lead to the pressure to increase the probability of getting a significant
result by maximising the statistical power, i.e., decreasing the Type-II error
rate (through larger sample sizes and more reliable and valid measures), rather
than using tricks that affect the Type-I error rate.

Wednesday, September 21, 2016

Yesterday, I woke up to a shitstorm on Twitter, caused by an
editorial-in-press by social psychologist Susan Fiske (who wrote my
undergraduate Social Psych course textbook). The full text of the editorial,
along with a superb commentary from Andrew Gelman, can be found here. This editorial, which launches an
attack against so-called methodological terrorists who have the audacity to
criticise their colleagues in public, has already inspired blog posts such as
this one
by Sam Schwarzkopf and this
one which broke the time-space continuum by Dorothy Bishop.

However, I would like to write about about one aspect of
Susan Fiske’s commentary, which also emerged in a subsequent discussion with
her at the congress of the German Society for Psychology (which, alas, I
followed only on twitter). In the editorial Fiske states that psychological
scientists at all stages of their career are being bullied; she seems especially worried about graduate
students who are leaving academia. In the subsequent discussion, as cited by Malte Elson,
she specifies that >30 graduate students wrote to her, in fear of cyberbullies.*

Being an early career researcher myself, I can try to
imagine myself in a position where I would be scared of “methodological
terrorists”. I can’t speak for all ECRs, but for what it’s worth, I don’t see
any reason to stifle public debate. Of course, there is internet harassment
which is completely inexcusable and should be punished (as covered by John
Oliver in this video).
But I have never seen, nor heard of, a scientific debate which dropped to the
level of violence, rape or death threats.

So, what is the worst thing that can happen in academia?
Someone finds a mistake in your work (or thinks they have found a mistake), and
makes it public, either through the internet (twitter, blog), a peer-reviewed
paper, or by screaming it out at an international conference after your talk.
Of course, on a personal level, it is preferable that before or instead of
making it public, the critic approaches you privately. On the other hand, the
critic is not obliged to do this – as others build on your work, it is only
fair that the public should be informed about a potential mistake. It is
therefore, in practice, up to the critic to decide whether they will approach
you first, or whether they think that a public approach would be more effective
in getting an error fixed. Similarly, it would be nice of the critic to adopt a
kind, constructive tone. It would probably make the experience more pleasant
(or less unpleasant) for both parties, and be more effective in convincing the
person who is criticised to think about the criticiser’s point and to decide
rationally whether or not this is a valid point. But again, the critic is not
obliged to be nice – someone who stands up at a conference to publicly destroy
an early career researcher’s work is an a-hole, but not a criminal. (Though I
can even imagine scenarios where such behaviour would be justified, for
example, if the criticised researcher has been unresponsive to private
expressions of concern about this work.)

As an early career researcher, it can be very daunting to
face an audience of potential critics. It is even worse if someone accuses you
of having done something wrong (whether it’s a methodological shortcoming of
your experiment, or a possibly intentional error in your analysis script). I
have received some criticism throughout my five-year academic career; some of
it was not fair, though most of it was (even though I would sometimes deny it,
in the initial stages). Furthermore, there are cultural differences in how
researchers express their concern with some aspect of somebody’s work: in
English-speaking countries (Australia, UK, US), much softer words seem to be
used for criticising than in many mainland European countries (Italy, Germany).
When I spent six months during my PhD in Germany, I was shocked at some of the
conversations I had overheard between other PhD students and their supervisors
– being used to the Australian style of conversation it seemed to me that
German supervisors could be straight-out mean. Someone who is used to being
told about a mistake with the phrase: “This is good, but you might want to
consider…” is likely to be shocked and offended if they go to an international
conference and someone tells them straight out: “This is wrong.” This could
lead to some people feeling personally attacked due to what is more or less a
cultural misunderstanding.

In any event, it is inevitable that one makes mistakes from
time to time, and that someone finds something to criticise about your work.
Indeed, this is how science progresses. We make mistakes, and we learn from
them. We learn from others’ mistakes. Learning is what science is all about.
Someone who doesn’t want to learn cannot be a scientist. And if nobody ever
tells you that you made a mistake, you cannot learn from it. Yes, criticism
stings, and some people are more sensitive than others. However, responding to
criticism in a constructive way, and being aware of potential cultural
differences in how criticism is conveyed, is part of the job description of an
academic. Somebody who reacts explosively or defensively to criticism cannot be
a scientist just like someone who is afraid of water cannot be an Olympic
swimmer.

---------------------------

In response to this, Daniël Lakens wrote, in a series of tweets
(I can’t phrase it better): “100+ students told me they think of quitting
because science is no longer about science. [… They are the] ones you want to
stay in science, because they are not afraid, they know what to do, they just
doubt if a career in science is worth it.”

Monday, June 27, 2016

Anyone who has talked to me in the last year would have heard me complain about my 8-times-failure-to-replicate which nobody wants to publish. The preprint, raw data and analysis scripts are available here, so anyone can judge for themselves if they think the rejections to date are justified. In fact, if anyone can show me that my conclusions are wrong – that the data are either inconclusive, or that they actually support an opposite view – I will buy them a bottle of drink of their choice*. So far, this has not happened.

I promise to stop complaining about this after I publish this blog post. I think it is important to be aware of the current situation, but I am, by now, just getting tired of debates which go in circles (and I’m sure many others feel the same way). Therefore, I pledge that from now on I will stop writing whining blog posts, and I will only write happy ones – which have at least one constructive comment or suggestion about how we could improve things.

So, here goes my last ever complaining post. I should stress that the sentiments and opinions I describe here are entirely my own; although I’ve had lots of input from my wonderful co-authors in preparing the manuscript of my unfortunate paper, they would probably not agree with many of the things I am writing here.

Why is it important to publish failures to replicate?

People who haven’t been convinced by the arguments put forward to date will not be convinced by a puny little blogpost. In fact, they will probably not even read this. Therefore, I will not go into details about why it is important to publish failures to replicate. Suffice it to say that this is not my opinion – it’s a truism. If we combine a low average experimental power with selective publishing of positive results, we – to use Daniel Lakens’ words – get “a literature that is about as representative of real science as porn movies are representative of real sex”. We get over-inflated effect sizes across experiments, even if an effect is non-existent; or, in the words of Michael Inzlicht, “meta-analyses are fucked”.

Our study

The interested reader can look up further details of our study in the OSF folder I linked above (https://osf.io/myfk3/). The study is about the Psycholinguistic Grain Size Theory (Ziegler & Goswami, 2005)**. If you type the name of this theory into google – or some other popular search terms, such as “dyslexia theory”, “reading across languages”, or “reading development theory” – you will see this paper on the first page. It has 1650 citations, at the time of writing of this blogpost. In other words, this theory is huge. People rely on it to interpret their data, and to guide their experimental designs and theories in diverse topics of reading and dyslexia.

The evidence for the Psycholinguistic Grain Size Theory is summarised in the preprint linked above; the reader can decide for themselves if they find it convincing. During my PhD, I decided to do some follow-up experiments on the body-N effect (Ziegler & Perry, 1998; Ziegler et al., 2001; Ziegler et al., 2003). Why? Not because I wanted to build my career on the ruins of someone else’s work (which is apparently what some people think of replicators), but because I found the theory genuinely interesting, and I wanted to do further work to specify the locus of this effect. So I did study after study after study – blaming myself for the messy results – until I realised: I had conducted eight experiments, and the effect just isn’t there. So I conducted a meta-analysis on all of our data, plus an unpublished study by a colleague with whom I’d talked about this effect, wrote it up and submitted it.

Surely, in our day and age, journals should welcome null-results as much as positive results? And any rejections would be based on flaws in the study?

Well, here is what happened:

Submission 1: Relatively high-impact journal for cognitive psychology

Here is a section directly copied-and-pasted from a review:

“Although the paper is well-written and the analyses are quite substantial, I find the whole approach rather irritating for the following reasons:

1. Typically meta-analyses are done one [sic] published data that meet the standards for publishing in international peer-reviewed journals. In the present analyses, the only two published studies that reported significant effects of body-N and were published in Cognition and Psychological Science were excluded (because the trial-by-trial data were no longer available) and the authors focus on a bunch of unpublished studies from a dissertation and a colleague who is not even an author of the present paper. There is no way of knowing whether these unpublished experiments meet the standards to be published in high-quality journals.”

Of course, I picked the most extreme statement. Other reviewers had some cogent points – however, nothing that would compromise the conclusions. The paper was rejected because “the manuscript is probably too far from what we are looking for”.

Submission 2: Very high-impact psychology journal

As a very ambitious second plan, we submitted the paper to one of the top journals in psychology. It’s a journal which “publishes evaluative and integrative research reviews and interpretations of issues in scientific psychology. Both qualitative (narrative) and quantitative (meta-analytic) reviews will be considered, depending on the nature of the database under consideration for review” (from their website). They have even announced a special issue on Replicability and Reproducibility, because their “primary mission […] is to contribute a cohesive, authoritative, theory-based, and complete synthesis of scientific evidence in the field of psychology” (again, from their website). In fact, they published the original theoretical paper, so surely they would at least consider a paper which argues against this theory? As in, send it out for review? And reject it based on flaws, rather than the standard explanation of it being uninteresting to a broad audience? Given that they published the original theoretical article, and all? Right?

Wrong, on all points.

Submission 3: A well-respected, but not huge impact factor journal in cognitive psychology

I agreed to submit this paper to a non-open-access journal again, but only under the condition that at least one of my co-authors would have a bet with me: if it got rejected, I would get a bottle of good whiskey. Spoiler alert: I am now the proud owner of a 10-year aged bottle of Bushmills.

To be fair, this round of reviews brought some cogent and interesting comments. The first reviewer provided some insightful remarks, but their main concern was that “The main message here seems to be a negative one.” Furthermore, the reviewer “found the theoretical rationale [for the choice of paradigm] to be rather simplistic”. Your words, not mine! However, for a failure to replicate, this is irrelevant. As many researchers rely on what may or may not be a simplistic theoretical framework which is based on the original studies, we need to know whether the evidence put forward by the original studies is reliable.

I could not quite make sense of all of the second reviewer’s comment, but somehow they argued that the paper was “overkill”. (It is very long and dense, to be fair, but I do have a lot of data to analyse. I suspect most readers will skip from the introduction to the discussion, anyway – but anyone who wants the juicy details of the analyses should have easy access to them.)

Next step: Open-access journal

I like the idea of open-access journals. However, when I submitted previous versions of the manuscript I was somewhat swayed by the argument that going open access would decrease the visibility and credibility of the paper. This is probably true, but without any doubt, the next step will be to submit the paper to an open-access journal. Preferably one with open review. I would like to see a reviewer calling a paper “irritating” in a public forum.

At least in this case, traditional journals have shown – well, let’s just say that we still have a long way to go in improving replicability in psychological sciences. For now, I have uploaded a pre-print of the paper on OSF and on researchgate. On researchgate, the article has over 200 views, suggesting that there is some interest in this theory; the finding that the key study is not replicable seems relevant to researchers. Nevertheless, I wonder if the failure to provide support for this theory will ever gain as much visibility as the original study – how many researchers will put their trust into a theory that they might be more sceptical about if they knew the key study is not as robust as it may seem?

In the meantime, my offer of a bottle of beverage for anyone who can show that the analyses or data are fundamentally flawed, still stands.

-------------------------------------------------------

* Beer, wine, whiskey, brandy: You name it. Limited only by my post-doc budget.** The full references of all papers cited throughout the blogpost can be found in the preprint of our paper.

-----------------------------------------

Edit 30/6: Thanks all for the comments so far, I'll have a closer look at how I can implement your helpful suggestions when I get the chance!

Please note that I will delete comments from spammers and trolls. If you feel the urge to threaten physical violence, please see your local counsellor or psychologist.

Thursday, June 16, 2016

You are working on a
theoretical paper about the proposed relationship between X and Y. A
two-experiment study has previously shown that X and Y are correlated, and you
are trying to explain the cognitive mechanisms that drive this correlation.
This previous study makes conclusions based on partial correlations which take
into account a moderator that has not been postulated a priori; raw
correlations are not reported. The p-values for each of the two partial
correlations are < 0.05, but > 0.04. In a theoretical paper, you stress
that although it makes theoretical sense that there would be a correlation
between these variables, we cannot be sure about this link.

In a different
paradigm, several studies have found a group difference in a certain task. In
most studies, this group difference has a Cohen’s d of around 0.2. However, three
studies which all come from the same lab report Cohen’s ds ranging between 0.8 and 1.1. You
calculate that it is very unlikely to obtain three huge effects such as these
by chance alone (probability < 1%).

For a different
project, you fail to find an effect which has been reported by a previously
published experiment. The authors of this previous study have published their
raw data a few years after the original paper came out. You take a close look
at this raw data, and find some discrepancies with the means reported in the
paper. When you analyse the raw data, the effect disappears.

What would you do in each of the scenarios above? I would be
very happy to hear about it in the comments!

From each of these scenarios, I would draw two conclusions:
(1) The evidence reported by these studies is not strong, to say the least, and
(2) it is likely that the authors used what we now call questionable research
practices to obtain significant results. The question is what we can conclude
in our hypothetical paper, where the presence or absence of the effect
is critical. Throwing around accusations of p-hacking
can turn ugly. First, we cannot be absolutely sure that there is something fishy. Even if you calculate that the likelihood of obtaining a certain
result is minimal, it is still greater than zero – you can never be completely
sure that there really is something questionable going on. Second, criticising
someone else’s work is always a hairy issue. Feelings may get hurt, and the
desire for revenge may arise; careers can get destroyed. Especially as an
early-career researcher, one wants to stay clear of close-range combat.

Yet, if your work rests on these results, you need to make something of them. One could just ignore
them – not cite these papers, pretend they don’t exist. It is difficult to draw
conclusions from studies with questionable research practices, so they may as
well not be there. But ignoring relevant published work would be childish and
unscientific. Any reader of your paper who is interested in the topic will
notice this omission. Therefore, one needs to at least explain why one thinks
the results of these studies may not be reliable.

One can’t explain why one doesn’t trust a study without
citing it – a general phrase such as: “Previous work has shown this effect, but
future research is needed to confirm its stability” will not do. We could
remain general in our accusations: “Previous work has shown this effect (Lemmon & Matthau, 2000), but future research is needed to confirm its stability”. This,
again, does not sound very convincing.

There are therefore two possibilities: either we drop the
topic altogether, or we write down exactly why the results of the published
studies would need to be replicated before we would trust them, kind of like what I
did in the examples at the top of the page. This, of course, could be
misconstrued as a personal attack. Describing
such studies in my own papers is an exercise involving very careful phrasing
and proofreading for diplomacy by very nice colleagues. Unfortunately, this
often leads to the watering down of arguments, and tip-toeing around the real
issue, which is the believability of a specific result. And when we think about
it, this is what we are criticising – not the original researchers. Knowledge
about questionable research practices is spreading gradually; many researchers
are still in the process of realising that they can really damage a research
area. Therefore, judging researchers for what they have done in the past would
be neither productive, nor wise.

Should we judge a scientist for having used questionable research
practices? In general, I don’t think so. I am convinced that the majority of
researchers don’t intend to cheat, but they are convinced that they have
legitimately maximised their chance to find a very small and subtle effect. It is, of course, the responsibility of a criticiser to make it clear that a problem is with the study, not with the researcher who conducted it. But the researchers whose work is being criticised should also consider whether the criticism is fair, and respond accordingly. If they are prepared to correct any mistakes – publishing file-drawer studies, releasing untrimmed data, conducting a replication, or in more extreme cases publishing a correction or even retracting a paper – it is unlikely that they will be judged negatively by the scientific community, quite on the contrary.

But
there are a few hypothetical scenarios where my opinion of the researcher would
decrease: (1) If the questionable research practice was data fabrication rather
than something more benign such as creative outlier removal, (2) if the
researchers use any means possible to suppress studies which criticise or fail
to replicate their work, or (3) if the researchers continue to engage in
questionable research practices, even after they learn that it increases their
false-positive rate. This last point bears further consideration, because
pleading ignorance is becoming less and less defensible. By now, a researcher
would need to live under a rock if they have not even heard about the
replication crisis. And a good, curious researcher should follow up on hearing
such rumours, to check whether issues in replicability could also apply to
them.

In summary, criticising existing studies is essential for
scientific progress. Identifying potential issues with experiments will save
time as researchers won’t go off on a wild-goose-chase for an effect that
doesn’t exist; it will help us to narrow down on studies which need to be
replicated before we consider that they are backed up by evidence. The
criticism of a study, however, should not be conflated with criticism of the
researcher – either by the criticiser or by the person being criticised. A strong distinction between the criticism of a study versus criticism of a researcher would result in a climate where discussions about reproducibility of specific
studies will lead to scientific progress rather than a battlefield.

Saturday, May 21, 2016

Recently, I was asked: “What made you interested in research
methods?” I’m afraid I didn’t give a good answer, but instead started
complaining about my eight-times
failure to replicate that nobody wants to publish. I have been thinking
about this question some more, and realised that my interest in research
methods and good science is driven by predominantly selfish reasons. This gave
me the idea to write a blog post: I think it is important to realise that
striving towards good science is, in the long run, beneficial to a researcher.
So let’s ignore the “how” for the time being (there are already many articles
and blog posts on this issue; see, for example, entries for an essay
contest by The Winnower) – let’s focus on the “why”.

The world as it
should be

Let’s imagine the research world
as it should (or could) be. Presumably, we all went into research because we
wanted to learn more about the world – and we wanted to actively contribute to
discovering new knowledge. Imagine that we live in a world where we can trust the
existing literature. Theories are based on experiments that are sound and
replicable. The job of a researcher is to keep up to date on this literature,
find gaps, and design experiments that can fill these gaps, thus providing a
more complete picture of the phenomenon they are studying.

The world as it is

The research world as it is
provides two sources of frustrations (at least, for me): (1) Playing Russian
Roulette when it comes to conducting experiments, and (2) sifting through a
literature which consists of an unknown ratio of manure to pearls, and trying
to find the pearls.

Russian Roulette

I have conducted numerous experiments during my PhD and
post-doc so far, and a majority of them “didn’t work”. By “didn’t work”, I mean
they showed non-significant p-values
when I expected an effect, showed different results from published experiments
(again, my eight-times
failure to replicate), and occasionally, they were just not designed very
well and I would get floor/ceiling effects. I attributed this to my own lack of
experience and competence. I looked to my colleagues had many published
experiments, and considered alternative career paths. In the last year of my
PhD, I came to a realisation: even
professors have the same problem.

In the research world as it is, a
researcher may come up with an idea for an experiment. It can be a great idea,
based on a careful evaluation of theories and models. The experiment can be
well-designed and neat, providing a pertinent test of the researcher’s
hypothesis. Then the data is collected and analysed – and it is discovered that
the experiment “didn’t work”. Shoulders are shrugged – the researcher moves on.
Occasionally, one experiment will “work” and can be published.

How is it possible, I asked
myself, that so much good research goes to waste, just because an experiment
“didn’t work”? Is it really necessary to completely discard a promising
question or theory, just because a first attempt at getting an answer “didn’t
work”? How many labs conduct experiments that “don’t work”, not knowing that
other labs have already tried and failed with the same approach? These are, as
of now, rhetorical questions, but I firmly believe that learning more about
research methods and how these can be used to produce sound and efficient
experiments can answer them.

Sifting through manure

Some theories are intuitively appealing, apparently elegant,
and elicit a lot of enthusiasm with a lot of people. New PhD students want to
“do something with this theory”, and try to do follow-up studies, only to find
that their follow-up experiments “don’t work”, replications of the experiments
that support the theory “don’t work”, and the theory doesn’t even make sense
when you really think about it. *

Scientists stand on the shoulders
of giants. Science cannot be done without relying on existing knowledge at
least to some extent. In an ideal world, our experiments and theories should
build on previous work. However, I often get the feeling that I am building on
manure instead of a sound foundation.

So, in order to try and
understand whether I can trust an effect, I sift through the papers on it. I
look for evidence of publication bias, dodgy-sounding post-hoc moderators or
trimming decisions, statistical and logical errors (such as concluding that the
difference between two groups is significant because one is significantly above
chance while the other is not); check whether studies with larger sample sizes
tend to give negative results, while positive results are predominantly
supported by studies with small samples. It’s a thankless job. I criticise and
question the work of colleagues, who are often in senior positions and may well
one day make decisions that affect my livelihood.
At the same time, I lack the time to conduct experiments to test and develop my
own ideas. But what else should I do? Close my eyes to these issues and just
work on my own line of research? Spending less or no time scrutinising the
existing literature would mean that I don’t know whether I am building
my research agenda on pearls or manure. This would mean that I could waste
months or years on a question that I should have known to be a dead end from
the very beginning.

Conclusion

So, why am I interested in research methods? Because it will
make research more efficient, for me personally. It is difficult to conduct a
good study, but in the long run, it should be no more difficult than running a
number of crappy studies and publishing the one that “worked”. It should also be
much less frustrating, much more rewarding, and in the end, we will do what we
(presumably) love: contribute to discovering new knowledge about how the world
works.

-----------------------------------------------------------------

* This example is fictional. Any resemblance to real persons
or events is purely coincidental.