Tag: peer review

Jack Welch got a little conspiracy-theory crazy with the job numbers. Thomas Lumley over at StatsChat makes a pretty good case for debunking the theory. I think the real take home message of Thomas’ post and one worth celebrating/highlighting is that agencies that produce the jobs report do so based on a fixed and well-defined study design. Careful efforts by government statistics agencies make it hard to fudge/change the numbers. This is an underrated and hugely important component of a well-run democracy.

On a similar note Dan Gardner at the Ottawa Citizen points out that evidence-based policy making is actually not enough. He points out the critical problem with evidence: in the era of data what is a fact? “Facts” can come from flawed or biased studies just as easily from strong studies. He suggests that a true “evidence based” administration would invest more money in research/statistical agencies. I think this is a great idea.

An interesting article by Ben Bernanke suggesting that an optimal approach (in baseball and in policy) is one based on statistical analysis, coupled with careful thinking about long-term versus short-term strategy. I think one of his arguments about allowing players to play even when they are struggling short term is actually a case for letting the weak law of large numbers play out. If you have a player with skill/talent, they will eventually converge to their “true” numbers. It’s also good for their confidence….(via David Santiago).

Here is another interesting peer review dust-up. It explains why some journals “reject” papers when they really mean major/minor revision to be able to push down their review times. I think this highlights yet another problem with pre-publication peer review. The evidence is mounting, but I hear we may get a defense of the current system from one of the editors of this blog, so stay tuned…

Several people (Sherri R., Alex N., many folks on Twitter) have pointed me to this article about gender bias in science. I initially was a bit skeptical of such a strong effect across a broad range of demographic variables. After reading the supplemental material carefully, it is clear I was wrong. It is a very well designed/executed study and suggests that there is still a strong gender bias in science, across ages and disciplines. Interestingly both men and women were biased against the female candidates. This is clearly a non-trivial problem to solve and needs a lot more work, maybe one step is to make recruitment packages more flexible (see the comment by Allison T. especially).

Nate Silver, everyone’s favorite statistician made good, just gave an interview where he said he thinks many journal articles should be blog posts. I have been thinking about this same issue for a while now, and I’m not the only one. This is a really interesting post suggesting that although scientific journals once facilitated dissemination of ideas, they now impede the flow of information and make it more expensive.

Two recent examples really drove this message home for me. In the first example, I posted a quick idea called the Leekasso, which led to some discussion on the blog, has nearly 2,000 page views (a pretty recent number of downloads for a paper), and has been implemented in software by someone other than me. If this were one of my papers, it would be one of the more reasonably high impact papers. The second example is a post I put up about a recent Nature paper. The authors (who are really good sports) ended up writing to me to get my critiques. I wrote them out, and they responded. All of this happened after peer review and informally. All of the interaction also occurred in email, where no one can see but us.

It wouldn’t take much to go to a blog-based system. What if everyone who was publishing scientific results started a blog (free), then there was a site, run by pubmed, that aggregated the feeds (this would be super cheap to set up/maintain). Then people could comment on blog posts and vote for ones they liked if they had verified accounts. We skipped peer review in favor of just producing results and discussing them. The results that were interesting were shared by email, Twitter, etc.

Why would we do this? Well, the current journal system: (1) significantly slows the publication of research, (2) costs thousands of dollars, and (3) costs significant labor that is not scientifically productive (such as resubmitting).

Almost every paper I have had published has been rejected at least one place, including the “good” ones. This means that the results of even the good papers have been delayed by months. Or in the case of one paper - a full year and a half of delay. Any time I publish open access, it costs me at minimum around $1,500. I like open access because I think science funded by taxpayers should be free. But it is a significant drain on the resources of my group. Finally, most of the resubmission process is wasted labor. It generally doesn’t produce new science or improve the quality of the science. The effort is just in reformatting and re-inputing information about authors.

So why not have everyone just post results on their blog/figshare. They’d have a DOI that could be cited. We’d reduce everyone’s labor in reviewing/editing/resubmitting by an order of magnitude or two and save the taxpapers a few thousand dollars each a year in publication fees. We’d also increase the speed of updating/reacting to new ideas by an order of magnitude.

I still maintain we should be evaluating people based on reading their actual work, not highly subjective and error-prone indices. But if the powers that be insisted, it would be easy to evaluate people based on likes/downloads/citations/discussion of papers rather than on the basis of journal titles and the arbitrary decisions of editors.

So should we stop publishing peer review papers?

Edit: Titus points to a couple of good posts with interesting ideas about the peer review process that are worth reading, here and here. Also, Joe Pickrell et al. are already on this for population genetics, having set up the aggregator Haldane’s Sieve. It would be nice if this expanded to other areas (and people got credit for the papers published there, like they do for papers in journals).

This is part of the ongoing series of pro tips for graduate students, check out parts one and two for the original installments.

Learn how to write papers in a very clear and simple style. Whenever you can write in plain English, skip jargon as much as possible, and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. But simple, clear language leads to much higher use/citation of your work. Examples of great writers are: John Storey, Rob Tibshirani, Robert May, Martin Nowak, etc.

It is a great idea to start reviewing papers as a graduate student. Don’t do too many, you should focus on your research, but doing a few will give you a lot of insight into how the peer-review system works. Ask your advisor/research mentor they will generally have a review or two they could use help with. When doing reviews, keep in mind a person spent a large chunk of time working on the paper you are reviewing. Also, don’t forget to use Google.

Try to write your first paper as soon as you possibly can and try to do as much of it on your own as you can. You don’t have to wait for faculty to give you ideas, read papers and think of what you think would have been better (you might check with a faculty member first so you don’t repeat what’s done, etc.). You will learn more writing your first paper than in almost any/all classes.

Genome Biology: Submitted 11/1/10, rejected 1/5/11. 2/3 referees thought the paper was interesting, few specific concerns raised. I felt they could be addressed so appealed on 1/10/11, appeal accepted 1/20/11, paper resubmitted 1/21/11. Paper rejected 2/25/11. 2/3 referees were happy with the revisions. One still didn’t like it.

Bioinformatics: Submitted 3/3/11, rejected 3/1311 without review. I appealed again, it turns out “I have checked with the editors about this for you and their opinion was that there was already substantial work in validating gene lists based on random sampling.” If anyone knows about one of those papers let me know :-).

Nucleic Acids Research: Submitted 3/18/11, rejected with invitation for revision 3/22/11. Resubmitted 12/15/11 (got delayed by a few projects here) rejected 1/25/12. Reason for rejection seemed to be one referee had major “philosophical issues” with the paper.

An interesting side note is the really brief reviews from the Genome Biology submission inspired me to do this paper. I had time to conceive the study, get IRB approval, build a web game for peer review, recruit subjects, collect the data, analyze the data, write the paper, submit the paper to 3 journals and have it come out 6 months before the paper that inspired it was published!

A really interesting read on randomized controlled trials (RCTs) and public policy. The examples in the boxes are fantastic. This seems to be one of the cases where the public policy folks are borrowing ideas from Biostatistics, which has been involved in randomized controlled trials for a long time. It’s a cool example of adapting good ideas in one discipline to the specific challenges of another.

Roger points to this link in the NY Times about the “Consumer Genome”, which basically is a collection of information about your purchases and consumer history. On Twitter, Leonid K. asks: ‘Since when has “genome” becaome a generic term for “a bunch of information”?’. I completely understand the reaction against the “genome of x”, which is an over-used analogy. I actually think the analogy isn’t that unreasonable; like a genome, the information contained in your purchase/consumer history says something about you, but doesn’t tell the whole picture. I wonder how this information could be used for public health, since it is already being used for advertising….

Elon Musk is one of my favorite entrepreneurs. He tackles what I consider to be some of the most awe-inspiring and important problems around. This article about the Tesla S got me all fired up about how a person with vision can literally change the fuel we run on. Nothing to do with statistics, other than I think now is a similarly revolutionary time for our discipline.

There was some interestingdiscussion on Twitter of the usefulness of the Yelp dataset I posted for academic research. Not sure if this ever got resolved, but I think more and more as data sets from companies/startups become available, the terms of use for these data will be critical.

My libertarian views are qualified because I do think things worked better in the 1950s and 60s, but it’s an interesting question as to what went wrong with DARPA. It’s not like it has been defunded, so why has DARPA been doing so much less for the economy than it did forty or fifty years ago? Parts of it have become politicized. You can’t just write checks to the thirty smartest scientists in the United States. Instead there are bureaucratic processes, and I think the politicization of science—where a lot of scientists have to write grant applications, be subject to peer review, and have to get all these people to buy in—all this has been toxic, because the skills that make a great scientist and the skills that make a great politician are radically different. There are very few people who are both great scientists and great politicians. So a conservative account of what happened with science in the 20thcentury is that we had a decentralized, non-governmental approach all the way through the 1930s and early 1940s. At that point, the government could accelerate and push things tremendously, but only at the price of politicizing it over a series of decades. Today we have a hundred times more scientists than we did in 1920, but their productivity per capita is less that it used to be.

Thiel has a history of making controversial comments, and I don’t always agree with him, but I think that his point about the politicization of the grant process is interesting.

Héctor Corrada Bravo is an assistant professor in the Department of Computer Science and the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park. He moved to College Park after finishing his Ph.D. in computer science at the University of Wisconsin and a postdoc in biostatistics at the Johns Hopkins Bloomberg School of Public Health. He has done outstanding work at the intersection of molecular biology, computer science, and statistics. For more info check out his webpage.

Which term applies to you: statistician/data scientist/computerscientist/machine learner?

I want to understand interesting phenomena (in my case mostly inbiology and medicine) and I believe that our ability to collect a large number of relevantmeasurements and infer characteristics of these phenomena can drivescientific discovery and commercial innovation in the near future.Perhaps that makes me a data scientist and means that depending on thetask at hand one or more of the other terms apply.

A lot of the distinctions many people make between these terms arevacuous and unnecessary, but some are nonetheless useful to thinkabout. For example, both statisticians and machine learners [sic] knowhow to create statistical algorithms that compute interesting and informative objects using measurements (perhaps) obtained through some stochastic or partially observedprocess. These objects could be genomic tools for cancer screening, orstatistics that better reflect the relative impact of baseball playerson team success.

Both fields also give us ways to evaluate and characterize these objects.However, there are times when these objects are tools that fulfill animmediately utilitarian purpose and thinking like an engineer might(as many people in Machine Learning do) is the right approach.Other times, these objects are there to help us get insights about ourworld and thinking in ways that many statisticians do is the rightapproach. You need both of these ways of thinking to do interestingscience and dogmatically avoiding either of them is a terrible idea.

How did you get into statistics/data science (i.e. your history)?

I got interested in Artificial Intelligence at one point, and foundthat my mathematics background was nicely suited to work on this. OnceI got into it, thinking about statistics and how to analyze andinterpret data was natural and necessary. I started working with twowonderful advisors at Wisconsin, Raghu Ramakrishnan (CS) and Grace Wahba (Statistics)that helped shape the way I approach problems from different anglesand with different goals. The last piece was discovering thatcomputational biology is a fantastic setting in which to apply anddevise these methods to answer really interesting questions.

What is the problem currently driving you?

I’ve been working on cancer epigenetics to find specific genomicmeasurements for which increased stochasticity appears to be generalacross multiple cancer types. Right now, I’m really wondering how farinto the clinic can these discoveries be taken, if at all. Forexample, can we build tools that use these genomic measurements toimprove cancer screening?

How do you see CS/statistics merging in the future?

I think that future got here some time ago, but is about to get muchmore interesting.

Here is one example: Computer Science is about creating and analyzingalgorithms and building the systems that can implement them. Some ofwhat many computer scientists have done looks at problems concerning how tokeep, find and ship around information (Operating Systems, Networks,Databases, etc.). Many times these have been driven by very specificneeds, e.g., commercial transactions in databases. In some ways,companies have moved from from asking how do I use data to keep trackof my activities to how do I use data to decide which activities to doand how to do them. Statistical tools should be used to answer thesequestions, and systems built by computer scientists have statisticalalgorithms at their core.

Beyond R, what are some really useful computational tools forstatisticians to know about?

I think a computational tool that everyone can benefit a lot fromunderstanding better is algorithm design and analysis. This doesn’thave to be at a particularly deep level, but just getting a sense ofhow long a particular process might take, and how to devise a different way of doing it that might make it more efficient is really useful. I’ve been toying with the idea of creating a CS course called (something like) “Highlights of continuousmathematics for computer science” that reminds everyone of the coolstuff that one learns in math now that we can appreciate their usefulness. Similarily, I thinkstatistics students can benefit from “Highlights of discretemathematics for statisticians”.

Now a request for comments below from you and readers: (5a) Beyond R,what are some really useful statistical tools for computer scientiststo know about?

Review times in statistics journals are long, should statisticiansmove to conference papers?

I don’t think so. Long review times (anything more than 3 weeks) arereally not necessary. We tend to publish in journals with fairly quickreview times that produce (for the most part) really useful andinsightful reviews.

I was recently talking to senior members in my field who were tellingme stories about the “old times” when CS was moving from mainlypublishing in journals to now mainly publishing in conferences. Butnow, people working in collaborative projects (like computational biology) work in fieldsthat primarily publish in journals, so the field needs to be able toproperly evaluate their impact and productivity. There is no perfectsystem.

For instance, review requests in fields where conferences are the mainpublication venue come in waves (dictated by conference schedule).Reviewers have a lot of papers to go over in a relatively short timewhich makes their job of providing really helpful and fair reviews notso easy. So, in that respect, the journal system can be better. The one thing that is universally true is that you don’t need long review times.

Under the open system, it was possible for authors to see who was reviewing their work. They found that under the open system authors and reviewers tended to cooperate by reviewing each others’ work. Interestingly, they say

It was not immediately clear that cooperation between referees and authors would increase reviewing accuracy. Intuitively, one might expect that players who cooperate would always accept each others solutions - regardless of whether they were correct. However, we observed that when a submitter and reviewer acted cooperatively, reviewing accuracy actually increased by 11%.

I am a huge fan of open access journals. I think open access is good both for moral reasons (science should be freely available) and for more selfish ones (I want people to be able to read my work). If given the choice, I would publish all of my work in journals that distribute results freely.

But it turns out that for most open/free access systems, the publishing charges are paid by the scientists publishing in the journals. I did a quick scan and compiled this little table of how much it costs to publish a paper in different journals (here is a bigger table):

PLoS One $1,350.00

PLoS Biology: $2,900.00

BMJ Open $1,937.28

Bioinformatics (Open Access Option) $3,000.00

Genome Biology (Open Access Option) $2,500.00

Biostatistics (Open Access Option) $3,000.00

The first thing I noticed is that it is minimum about $1,500 to get a paper published open access. That may not seem like a lot of money and most journals offer discounts to people who can’t pay. But it still adds up, this last year my group has published 7 papers. If I paid for all of them to be published open access, that would be at minimum $10,500! That is half the salary of a graduate student researcher for an entire year. For a senior scientist that may be no problem, but for early career scientists, or scientists with limited access to resources, it is a big challenge.

Publishers who are solely dedicated to open access (PLoS, BMJ Open, etc.) seem to have on average lower publication charges than journals who only offer open access as an option. I think part of this is that the journals that aren’t open access in general have to make up some of the profits they lose by making the articles free. I certainly don’t begrudge the journals the costs. They have to maintain the websites, format the articles, and run the peer review process. That all costs money.

A modest proposal

What I wonder is if there was a better place for that money to come from? Here is one proposal (hat tip to Rafa): academic and other libraries pay a ton of money for subscriptions to journals like Nature and Science. They also are required to pay for journals in a large range of disciplines. What if, instead of investing this money in subscriptions for their university, academic libraries pitched in and subsidized the publication costs of open/free access?

If all university libraries pitched in, the cost for any individual library would be relatively small. It would probably be less than paying for subscriptions to hundreds of journals. At the same time, it would be an investment that would benefit not only the researchers at their school, but also the broader scientific community by keeping research open. Then neither the people publishing the work, nor the people reading it would be on the hook for the bill.

This approach is the route taken by ArXiv, a free database of unpublished papers. These papers haven’t been peer reviewed, so they don’t always carry the same weight as papers published in peer-reviewed journals. But there are a lot of really good and important papers in the database - it is an almost universally accepted pre-print server.

The other nice thing about ArXiv is that you don’t pay for article processing, the papers are published as is. The papers don’t look quite as pretty as they do in Nature/Science or even PLoS, but it is also much cheaper. The only costs associated with making this a full fledged peer-reviewed journal would be refereeing (which scientists do for free anyway) and editorial responsibilities (again mostly volunteer by scientists).

All statisticians in academia are constantly confronted with the question of where to publish their papers. Sometimes it’s obvious: A theoretical paper might go to the Annals of Statistics or JASA Theory & Methods or Biometrika. A more “methods-y” paper might go to JASA or JRSS-B or Biometrics or maybe even Biostatistics (where all three of us are or have been associate editors).

But where should the applied papers go? I think this is an increasingly large category of papers being produced by statisticians. These are papers that do not necessarily develop a brand new method or uncover any new theory, but apply statistical methods to an interesting dataset in a not-so-obvious way. Some papers might combine a set of existing methods that have never been combined before in order to solve an important scientific problem.

Well, there are some official applied statistics journals: JASA Applications & Case Studies or JRSS-C or Annals of Applied Statistics. At least they have the word “application” or “applied” in their title. But the question we should be asking is if a paper is published in one of those journals, will it reach the right audience?

What is the audience for an applied stat paper? Perhaps it depends on the subject matter. If the application is biology, then maybe biologists. If it’s an air pollution and health application, maybe environmental epidemiologists. My point is that the key audience is probably not a bunch of other statisticians.

The fundamental conundrum of applied stat papers comes down to this question: If your application of statistical methods is truly addressing an important scientific question, then shouldn’t the scientists in the relevant field want to hear about it? If the answer is yes, then we have two options: Force other scientists to read our applied stat journals, or publish our papers in their journals. There doesn’t seem to be much momentum for the former, but the latter is already being done rather frequently.

Across a variety of fields we see statisticians making direct contributions to science by publishing in non-statistics journals. Some examples are this recent paper in Nature Genetics or a paper I published a few years ago in the Journal of the American Medical Association. I think there are two key features that these papers (and many others like them) have in common:

There was an important scientific question addressed. The first paper investigates variability of methylated regions of the genome and its relation to cancer tissue and the second paper addresses the problem of whether ambient coarse particles have an acute health effect. In both cases, scientists in the respective substantive areas were interested in the problem and so it was natural to publish the “answer” in their journals.

The problem was well-suited to be addressed by statisticians. Both papers involved large and complex datasets for which training in data analysis and statistics was important. In the analysis of coarse particles and hospitalizations, we used a national database of air pollution concentrations and obtained health status data from Medicare. Linking these two databases together and conducting the analysis required enormous computational effort and statistical sophistication. While I doubt we were the only people who could have done that analysis, we were very well-positioned to do so.

So when statisticians are confronted by a scientific problems that are both (1) important and (2) well-suited for statisticians, what should we do? My feeling is we should skip the applied statistics journals and bring the message straight to the people who want/need to hear it.

There are two problems that come to mind immediately. First, sometimes the paper ends up being so statistically technical that a scientific journal won’t accept it. And of course, in academia, there is the sticky problem of how do you get promoted in a statistics department when your CV is filled with papers in non-statistics journals. This entry is already long enough so I’ll address these issues in a future post.