False Positives

Some readers may have noticed a Dutch scandal in the academic psychology industry. See here (h/t Pielke Jr).

The previously undisclosed whistleblower is said to be Uri SImonsohn, co-author of the article: “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” The authors set out the following sensible solution to the problem of false positive publications:

Table 2. Simple Solution to the Problem of False-Positive PublicationsRequirements for authors
1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.
2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.
3. Authors must list all variables collected in a study.
4. Authors must report all experimental conditions, including failed manipulations.
5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.
6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.

Guidelines for reviewers
1. Reviewers should ensure that authors follow the requirements.
2. Reviewers should be more tolerant of imperfections in results.
3. Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.
4. If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.

If these rules were applied by real_climate_scientists, most of the criticisms at Climate Audit would be eliminated.

However, there are no signs that real_climate_scientists have any intention of adopting these rules, as evidenced by Gavin Schmidt’s bilious outrage at the idea that Briffa should have reported the Yamal-Urals regional chronology considered and discarded in favor of the known HS of the small Yamal chronology.

The language of false positives was also used by the Texas sharpshooters, Wahl and Ammann, in connection with the failed verification statistics from MBH98.

I have the impression that in some areas of endeavor, not being required to provide full documentation is a perk of seniority and prestige in one’s field. It protects you from an “audit” that could be embarassing. So when someone outside the field pushes for the data, it is perceived as an insult along the lines of “doesn’t he know who I am?” This might explain how someone like Boulton can first participate in the proverbial circling of the wagons to protect the perks of prestige (my view of Muir Russell and the other sham investigations), then author an opinion that basically advocates the exact opposite.

‘ve long thought that a lot of problems would be solved if researchers were required to publicly state their experimental design and analysis plans BEFORE they did their research, This would prevent all the ad hoc shuffling that goes on with data, trying to generate something that’s publishable

Slightly related:
“The Wellcome Trust plans to withhold a portion of grant money from scientists who do not make the results of their work freely available to the public, in a move that will embolden supporters of the growing open access movement in science. In addition, any research papers that are not freely available will not be counted as part of a scientist’s track record when Wellcome assesses any future applications for research funding. … ”

How would they know? Most real problems the reviewer wouldn’t see unless there are hints in the text that something is missing. The hint would be likely in the supplementary material that probably wouldn’t get reviewed even if it was available at the time?

> 2. Reviewers should be more tolerant of imperfections in results.

I agree with this. As Feynman said, a non-result is as worth publishing as an actual result.

Unfortunately journal editors disagree, as if scientific journals would disappear if they didn’t published facts, but only published incremental, but dodgy positives that are more sensational.

> 3. Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.

Again probably impossible given the “equivication”, “smoke and mirrors” and “move the pee” type language that a lot of papers use. You really need a lawyer (as well as a statistician and computer professional) to review the paper to read between the lines, lol

> 4. If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.

The case mentioned above reg. Dirk Smeesters cherry picking data series is discussed here with contributions from some co-authors. Also mentioned is the case of Dirk Stapel who went a step further: he invented convenient data.

One comment from the link provided above: “What I know, however, is that the publication pressure in social psychology is immense. The worst thing is that journal editors favour “sexy results” rather than sound methodology.”
To illustrate this point it can be pointed out that the Smeesters case was brought to light by a scientific integrity committee of the Erasmus university, not by the reviewers or editors of his publications.

Another comment from the above blog site retractionwatch:
“Today, Dutch web news sites report the findings of at least 2 papers could not be supported. In total, there are 3 suspect papers out of nearly 30 papers. Two were retracted, one was not yet published.
What does not help: it seems all the other papers, results, finds, data – whatever was of importance to his research… can no longer be found. Said professor claims this is because his pc crashed.”

Here is a link to the unedited report from the Erasmus university in Dutch. It mentions that they got into action after Uri Simonsohn contacted one of their professors. It also says that all raw data for all papers(electronic or on paper) are missing and they condemn this in their conclusions.
The culprit mentions that leaving out data to reach significance is common practice in his field according to him and he does not feel guilty about it therefore.

What does not help: it seems all the other papers, results, finds, data – whatever was of importance to his research… can no longer be found. Said professor claims this is because his pc crashed.”

————————–

In a world where archiving data safely has never been easier and cheaper, it is amazing how often we hear this ‘the dog ate my data’ excuse. IMO, it should result in instant nullification of any paper/s that claim to be based on it.

I found this point less obvious though very sensible, “Reviewers should be more tolerant of imperfections in results”. In other words don’t promote definitive media-headline friendly conclusions e.g. “Warming since 1950 unprecedented”.

Am I right in thinking that many scientific papers would not be published without meeting a 95% significance test?

If so, and my results arrived at less than this, then I would not hesitate fudge, adjust, cherry pick, use alternative or new new statistical techniques, until I get to the magic 95%. This makes everything good; my publication record, department, the journal, and, if I’m lucky, politicians. That is human nature.

By dropping the rule of thumb completely we would get more honest results which is what we all (should) care about. Many of the most interesting experiments are when the results do not meet expectations and sometimes “I’m not sure” is the best answer.

These “rules” illustrate beautifully the issues with frequentist statistics William Briggs points out on his blog (point 1 especially). He may well have addressed this paper on his site, but I will drop him a suggestion to comment, or re-visit if he has done so already.

Yet closer examination showed that the trouble ran deeper. Science’s internal controls on bias were failing, and bias and error were trending in the same direction — towards the pervasive over-selection and over-reporting of false positive results.
…
How can we explain such pervasive bias? Like a magnetic field that pulls iron filings into alignment, a powerful cultural belief is aligning multiple sources of scientific bias in the same direction. The belief is that progress in science means the continual production of positive findings. All involved benefit from positive results, and from the appearance of progress. Scientists are rewarded both intellectually and professionally, science administrators are empowered and the public desire for a better world is answered. The lack of incentives to report negative results, replicate experiments or recognize inconsistencies, ambiguities and uncertainties is widely appreciated — but the necessary cultural change is incredibly difficult to achieve

This is a big problem in many fields. Many people just will not publish results that show minimal correlation of experimental results with their hypothesis; I cannot be wrong! But shining the light on failed hypotheses is instrumental to building sound science. Those who will only publish the “my ideas are always right” papers are setting back progress in their field (as their failed ideas need multiple refutation papers before the BS is identified). But, my experience is that it is very infrequent that these “perfectionists” pay much of a penalty for publishing lousy papers; they just move on to another neighborhood in their field as they can declare their old stomping grounds as “settled”.

I do blame the reviewers somewhat but often they don’t have a significant interest in the veracity of a paper; what harm does a little cherry picking do? And besides, experts in a particular field usually know who the BSers are and they are often tolerated as harmless, almost comical characters.

What makes the climate field fascinating is the that the imperfections are so glaring, the mafia supporting these marginal works are so cocky, and the societal implications of a changing climate are so massive, and yet so few in the field do anything to point out these quite obvious shortcomings. Quite the soap opera…..

The tone of comments above suggests that many readers consider the false positive problem to be widespread. One might even think that no science is conducted without this complication causing unmanageable problems.

Many times now, I have used the example of drilling out a mineral deposit to determine if it holds economic grades. There is a whole set of lessons beginning with the math of surveying, to the scientific design of chemical analysis procedures and continuing through to vexed questions like how many drill holes is enough; and how to treat the nugget effect.

Whereas Simmons et al caution against stopping or continuing to gather data on the basis of interim results, there is no practical option when drilling very expensive holes. Indeed, a stage is approached where more drill holes tend to change the grade estimate hardly at all, but do improve confidence that that grade is correct.

This points to a different modus in different branches of science. To determine the grade of an ore deposit, it is usually pointless to cheat, fabricate or use poor experimental design. This is because you will be found out quite promptly in the vast number of cases. However, in softer sciences like psychology, the experimenter knows that the main answers will be stated in subjective language by the test candidate, whereas in mining the assay is rather less prone to equivocation.

As usual, I close with a plea not to let the horrible experiences of climate workers to give the false impression to readers that all science has gone sour. It has not. I loved that NASA letter comment “Can you imagine one of your predecessors, Dr. Thomas Paine, declaring, “Our Apollo 11 Lunar Lander’s target is the Sea of Tranquility, but we may make final descent within a range that includes Crater Clavius”?”

Imagine a country where there is only one party. … we don’t have to imagine, because it happens all the time, and we all know the media and bureaucracy become corrupted to “tow the line” and you get complete nonsense.

The maddening thing for such a country, is that the very thing which they so hate: opposition, is the thing that ensure that other countries have a thorough and robust debate which prevent them going into the cul-de-sac group-think of single-party states.

But then look at all those failed one-party states. They had rules regulations they had bureaucracy and auditing and police states. And did any of this stop them going down the group-think path to ruin and becoming backward countries?

Yes, rules are useful, but rules without an “official opposition”, without public money being made available to create a group whose job it is to scrutinise the “government” is almost useless.

The problem with modern science is not that it doesn’t have enough rules … it is that they no longer tolerate dissent … particularly in areas like climatology. (It doesn’t count as science).

One of the things I find is that contradictions do not bother lots of people. They prefer points of view or schools of thought as a framework and are not really interested in experiments that contradict it. The response to a negative experiment/analysis (ie, finding no effect) is to simply claim you aren’t a very good scientist. And of course being sloppy will give a negative result, so they could be right. To conduct a truly powerful refutation of a school of thought is very difficult and may still be ignored. Freudian analysis was contradicted for 50 yrs by lots of studies, but only gradually declined (sorry, not dead, just rarer than before).

Being a quality manager in a high tech company, much of this sounds to me like basic knowledge in quality management:
1. Requirements for products and test should be decided on and documented up front.
2. Sample sizes for any data collection need to be determined up front and must be large enough to yield statistically significant results.
3. Parameters to be measured are also determined up front, e.g. through a process control plan.
4. All measurement results are to be documented (raw data) reported (summary data).
5. So called “outliers” are to be highlighted. A justification is to be provided if any outlier is removed from the data set. Ideally, an impact of such removal on the results is shown.
6. Any analysis method, if not standard, is to be described.

In a business setting, I find that the concept of risk assessment plays a more dominant role than in an academic environment. This may be recognized by readers who work in a business.

I remain surprised that what is considered to be standard rigor in business and engineering is the subject of current literature in academia and climate science.

Nature published an interview with Uri Simonsohn called The data detective. Some quotes: “I was working on another project on false positives and had become pretty good at picking up on the tricks that people pull to get a positive result3. With the Smeesters paper, I couldn’t find any red flags, but there were really far-fetched predictions.The basic idea is to see if the data are too close to the theoretical prediction, or if multiple estimates are too similar to each other. I looked at several papers by Smeesters and asked him for the raw data, which he sent. I did some additional analyses on those and the results looked less likely.”

“Journals should be embarrassed when they publish fake data, but there’s no stigma. They’re portrayed as the victims, but they’re more like the facilitators”

“Simply that it is wrong to look the other way. If there’s a tool to detect fake data, I’d like people to know about it so we can take findings that aren’t true out of our journals. And if it becomes clear that fabrication is not an unusual event, it will be easier for journals to require authors to publish all their raw data. It’s extremely hard for fabrication to go undetected if people can look at your data.”