Thoughts on preregistering my research

Last week, I submitted the methods for the project I’ve recently started to the Center for Open Science’s Preregistration Challenge. Briefly, the goal of the challenge is to get more scientists to preregister their research, and it’s got a monetary incentive. The goals of preregistration itself are to increase transparency and reproducibility in scientific research.

I’d never done a preregistration before, but it seemed like a Good Thing to Do in the name of Open Science. And the monetary incentive pushed me over the learning-curve barrier and the fact that it involves a bit more work than usual. I consider my preregistration a bit of an experiment. Having written one now, I have some opinions of the pluses and minuses.

Let’s start with the drawbacks. I found three significant drawbacks, the first of which is simply that preregistration is a foreign concept to most ecologists, and so I had to explain what I was doing — and justify it — a number of times to other people. That was only a slight annoyance in of itself, but it made the other two drawbacks harder.

It took me a few months to put together the preregistration plan. The reason for this is due to the nature of the project. I am using data produced by NEON and doing a series of complex statistical analyses on them. To do a preregistration means thinking about all the parts of analyses in depth: what variables am I going to use, how am I going to transform them, what will be the structure of my equations, and how am I going to do inference from model results to scientific meaning. In addition, I had to think about all the “what ifs”: What if I found that some variable was far from normally distributed? What if the data didn’t have good coverage or the response variables didn’t vary in the way I thought they would? What follow-on tests or modeling was I going to do if I got result A versus result B? Note that I didn’t look at the data while I was doing any of this, as part of the conditions on the preregistration challenge.

These are all very important things to think about, but like most everyone else in ecology, I am accustomed to figuring out many of the answers to these questions when — and if — the situation arises. This classical approach may lead to “researcher degrees of freedom” however, and I understand why it might be a good idea to preregister. On the other hand, having to figure out so many different contingencies might be a waste of time. If I have to figure out a bunch of contingencies that never happen, that’s time I could have been moving forward with analyses. I haven’t yet done the analyses, so we’ll see how much this drawback matters.

The final and probably biggest drawback was that I didn’t have any progress to report for three months. No doubt about it — I was making progress, but I didn’t have anything to show for it. I didn’t have any preliminary analyses or graphs or numbers or anything to show that was doing something. My lab does weekly progress updates and many of mine were feeble sounding: “I worked on some more mathematical modeling.” Blah. Because the NEON staff know I am working with their data, I was also asked by NEON my opinions about some of the data for their annual review. But because I hadn’t performed any analyses yet, I couldn’t provide any useful feedback, other than “ask me next year! I’ll have all the answers.” Pushing all the results to the end of the project can be a real detriment to projects focused on an analysis of existing data and/or applied projects.

Now the advantages of doing a preregistration plan.

Working through the full scope of my analysis without playing with the real data made me think very hard and carefully about the questions I wanted to ask and the kind of results I expected to get. Instead of just plugging data in, I had to ask, “What if the data are like this? What if the data are like that? What would that mean?” It made me figure out my assumptions in a way that I don’t think I usually do when I figure out analyses as I go along. It made me clarify my qualitative thoughts into quantitative predictions. I think the process made me a better scientist.

I think that having scoped out all my analyses in detail at the start will mean that doing the analyses themselves will go really quickly. In fact, if they do, I think figuring out analyses ahead of time will have saved me time in the long run. I remember playing with a big data set as a grad student and trying to figure out all the various questions I could ask of it. Instead of thinking about what questions were important to ask, I tried to ask as many questions as possible. It took a lot of time and left me with many loose threads that were hard to tie together into a coherent story (for a paper). Being super clear about my questions means, I hope, that writing the paper will be fairly straightforward, which would be yet another time-saver. But all of this depends on the analyses working out okay. That is, hopefully I have enough data with enough variation and that at least some of my predictors do actually contribute to predicting the response.

The preregistration queries on the Center for Open Science’s website were super useful in helping me think through my research. I’d recommend using them even if you don’t plan to file an official plan. In particular, when I got to the question about drawing scientific inference from analytical results, I realized I didn’t have a concrete plan. While a p-value of 0.05 is a pretty standard cutoff for a lot of traditional ecology research, I am using Bayesian statistics and am not a fan of arbitrary cutoffs generally. I didn’t have a good answer off the top of my head, so I emailed some colleagues and that turned into an interesting discussion about good/normal/accepted ways to report Bayesian posterior distributions. I don’t think I’d ever have made a conscious effort to figure out how to interpret results otherwise.

Finally, if you do want to take the Preregistration Challenge, I have a couple more notes to recommend it. First, David Mellor has been super responsive and helpful as I waded through my preregistration. Any questions? Ask him. And while the Preregistration Challenge website states that it can take up to two weeks to have your preregistration approved — and that you shouldn’t start your analyses until it is — mine was approved within 24 hours. I’m looking forward to actually putting the data through my models now!

Paul

Thanks for writing this post! I had some similar reservations with the process, but thought it was incredibly helpful for really thinking about the questions at hand. The time sunk into the pre-registration was difficult at the time, but like you predicted – the real benefits came when the experiment was finished. Analysis and writing up was super speedy. I plan on using pre-registrations for all of my future experiments.

Good luck with the project, and I hope you have a similar positive experience with preregistration.

This is really interesting, Margaret. I’m _almost_ sure that my doubtful reaction to the idea of preregistration is entirely a function of it being something different than I’ve always done – as you say, ecologists just don’t. On the other hand, wow – I’ve _never_ spent three months, or even one, planning out an analysis. Great post!

A very minor point – in your 2nd-last paragraph you talk about p = 0.05 and not liking “arbitrary cutoffs”. As you probably know, there is nothing intrinsically “cutoff-y” about p values, and nothing arbitrarily non-cutoff-y about Bayesian methods. You can do Bayesian methods with preset cutoffs, and you can use p-values without them. In particular, p=0.05 as an arbitrary cutoff is one reasonable philosophy of significance testing, but it’s not the only one, and there’s some interesting history about Fisher and Neyman and the development of the philosophy of the p-value. More here: https://scientistseessquirrel.wordpress.com/2015/11/16/is-nearly-significant-ridiculous/

Oh, I agree. But in a preregistration you have to say how you’re going to interpret your numbers. If I do a regression and calculate a p-value, then I have to specify how I’m going to report that result. I might say that “If the p-value is less than 0.05, I will infer that the experimental results support my hypothesis, and if it is less than 0.001, I will say that there is strong evidence for my hypothesis.” In any case, pre-registration makes you be explicit, and it’s hard to be explicit about things like p-values without using cutoffs of some sort. (Though if you can think of a way you can explicitly say how you would interpret p-values without using cut-offs, I’d love to hear it.) And agreed that you can do cutoffs with Bayesian methods, but I feel like there’s more flexibility in how Bayesian results can be reported than traditional statistical tests (maybe just because of historical precedent).

Thanks for taking the time to reflect on your preregistration! I hope that the upfront cost be rewarded at the end and the overall result will be an improvement. One recent trend I have seen to account for some of the uncertainty that can bog one down is the use of written “standard operating procedures.” These provide a backup plan for when things don’t go as expected and provide data-independent justification for decisions made that were not in a preregistration (or a “pre-analysis plan”). Here is one SOP on github, created by Don Green’s lab in experimental political science. There are links on that page to the SOP documents and to a paper explaining their utility: https://github.com/acoppock/Green-Lab-SOP

That is a super idea, and I can imagine it’s very useful for labs that tend to collect the same sorts of data over and over. Thanks for the link; it looks like it would take quite a lot of effort to come up with an SOP from scratch, but modifying someone else’s would be a lot easier.

Ken

Preregistration is something I’ve only been vaguely aware of. But it sounds a lot like what I had to go through in a PhD program, but without the absolute commitment to the methods.

I could see preregistering a phylogenetic study or a controlled experiment, but data from NEON… that would be really tough I think and I am not surprised that you had to put in so much effort for it.

So, you cannot analyze any data beforehand. Does that extend to data summary, an ordination or two just to see if your data is worth analyzing/spending time on for your question in the first place?

But I do buy into the whole idea of thinking through a project to the last detail as you outline so well in your post. I agree with David that a high level of preparation will likely result in an improved outcome. And I will certainly check out the Center for Open Science’s website for help with this.

But I also look forward to a follow up post in the future for your assessment. Surely there are diminishing returns in the level of effort expended on the front end. And maybe official preregistration goes too far, even though the basic idea is entirely valid?

[…] linear path as idealized or unrealistic, its real-world value is underscored by the movement toward preregistration of experiments, for example. One can also ask why we often report on studies as if they were conducted to test […]