Psychology is evolving faster than ever. For decades now, many areas in psychology have relied on what academics call “questionable research practices” – a comfortable euphemism for types of malpractice that distort science but which fall short of the blackest of frauds, fabricating data.

But now a new generation of psychologists is fed up with this game. Questionable research practices aren’t just being seen as questionable – they are being increasingly recognised for what they are: soft fraud. In fact, “soft” may be an understatement. What would your neighbours say if you told them you got published in a prestigious academic journal because you cherry-picked your results to tell a neat story? How would they feel if you admitted that you refused to share your data with other researchers out of fear they might use it to undermine your conclusions? Would your neighbours still see you as an honest scientist – a person whose research and salary deserves to be funded by their taxes?

For the first time in history, we are seeing a co-ordinated effort to make psychology more robust, repeatable, and transparent.

“Soft fraud”? (Is this like “white collar” fraud?) Is it possible that holding social psych up as a genuine replicable science is, ironically, creating soft frauds too readily?

Or would it be all to the good if the result is to so label large portions of the (non-trivial) results of social psychology?

The sentiment in the Guardian article is that the replication program in psych is just doing what is taken for granted in other sciences; it shows psych is maturing, it’s getting better and better all the time …so long as the replication movement continues. Yes? [0]

^^^^^^^^

It’s hard to entirely dismiss the concerns of the pushback, dubbed in some quarters as “Repligate”. Even in this contrarian mode, you might sympathize with “those who fear that psychology’s growing replication movement, which aims to challenge what some critics see as a tsunami of suspicious science, is more destructive than corrective” (e.g., Professor Wilson, at U Va) while at the same time rejecting their dismissal of the seriousness of the problem of false positives in psych. The problem is serious, but there may be built-in obstacles to fixing things by the current route. From the Chronicle:

Still, Mr. Wilson was polite. Daniel Gilbert, less so. Mr. Gilbert, a professor of psychology at Harvard University, … wrote that certain so-called replicators are “shameless little bullies” and “second stringers” who engage in tactics “out of Senator Joe McCarthy’s playbook” (he later took back the word “little,” writing that he didn’t know the size of the researchers involved).

Wow. Let’s read a bit more:

Scrutiny From the Replicators

What got Mr. Gilbert so incensed was the treatment of Simone Schnall, a senior lecturer at the University of Cambridge, whose 2008 paper on cleanliness and morality was selected for replication in a special issue of the journal Social Psychology.

….In one experiment, Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental…..These studies fit into a relatively new field known as embodied cognition, which examines how one’s environment and body affect one’s feelings and thoughts. …

For instance, political extremists might literally be less capable of discerning shades of grey than political moderates—or so Matt Motyl thought until his results disappeared. Now he works actively in the replication movement.[1]

Aside: Nosek, Spies and Motyl wrote an interesting article: “Scientific Utopia: II. Restructuring Incentives and Practices to Promote Truth Over Publishability.” From a quick read: I agree with their goal of promoting “truth over publishability” and some of their strategies might well help, if followed. My main gripe is that they felt the need for footnote 1 to soften their notion of truth: “We endorse a perspectivist approach…––the idea that all claims may be true given the appropriate conditions…” Well if a statement S is not a self contradiction, then it has a model and so conditions under which it comes out true, but that’s not at all helpful in an article urging truth over publishability. The rest of note 1 gets squishier, their galoshes sinking further into murky swamplands. I could have helped if they’d asked! There’s no need to backtrack on “truth”, especially if it’s already in your title.

7/1: By the way, since Schnall’s research was testing “embodied cognition” why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?

^^^^^^^^^^

Another irony enters: some of the people working on the replication project in social psych are the same people who hypothesize that a large part of the blame for lack of replication may be traced to the reward structure, to incentives to publish surprising and sexy studies, and to an overly flexible methodology opening the door to promiscuous QRPs (you know: Questionable Research Practices.) Call this the “rewards and flexibility” hypothesis. If the rewards/flex hypothesis is correct, as is quite plausible, then wouldn’t it follow that the same incentives are operative in the new psych replication movement? [2]

A skeptic of the movement in psychology could well ask how the replication can be judged sounder than the original studies? When RCTs fail to replicate observational studies, the presumption is that RCTs would have found the effect, were it genuine. That’s why it’s taken as an indictment of the observational study. But here, one could argue, it’s just another study, not obviously one that corrects the earlier. The question some have asked, “Who will replicate the replicators?” is not entirely without merit. Triangulation for purposes of correction, I say, is what’s really needed. [3]

Daniel Kahneman, who first called for the “daisy chain” (after the Stapel scandal), likely hadn’t anticipated the tsunami he was about to unleash.[4]

Daniel Kahneman, a Nobel Prize winner who has tried to serve as a sort of a peace broker, recently offered some rules of the road for replications, including keeping a record of the correspondence between the original researcher and the replicator, as was done in the Schnall case. Mr. Kahneman argues that such a procedure is important because there is “a lot of passion and a lot of ego in scientists’ lives, reputations matter, and feelings are easily bruised.”

That’s undoubtedly true, and taking glee in someone else’s apparent misstep is unseemly. Yet no amount of politeness is going to soften the revelation that a published, publicized finding is bogus. Feelings may very well get bruised, reputations tarnished, careers trashed. That’s a shame, but while being nice is important, so is being right.

Is the replication movement getting psych closer to “being right”? That is the question. What if inferences from priming studies and ”embodied cognition” really are questionable. What if the hypothesized effects are incapable of being turned into replicable science?

^^^^^^^^^

The sentiment voiced in the Guardian bristles at the thought; there is pushback even to Kahneman’s apparently civil “rules of the road”:

For many psychologists, the reputational damage [from a failed replication]… is grave – so grave that they believe we should limit the freedom of researchers to pursue replications. In a recent open letter, Nobel laureate Daniel Kahneman called for a new rule in which replication attempts should be “prohibited” unless the researchers conducting the replication consult beforehand with the authors of the original work. Kahneman says, “Authors, whose work and reputation are at stake, should have the right to participate as advisers in the replication of their research.” Why? Because method sections published by psychology journals are generally too vague to provide a recipe that can be repeated by others. Kahneman argues that successfully reproducing original effects could depend on seemingly irrelevant factors – hidden secrets that only the original authors would know. “For example, experimental instructions are commonly paraphrased in the methods section, although their wording and even the font in which they are printed are known to be significant.”

“Hidden secrets”? This was a remark sure to enrage those who take psych measurements as (at least potentially) akin to measuring the Hubble constant:

If this doesn’t sound very scientific to you, you’re not alone. For many psychologists, Kahnemann’s cure is worse than the disease. Dr Andrew Wilson from Leeds Metropolitan University points out that if the problem with replication in psychology is vague method sections then the logical solution – not surprisingly – is to publish detailed method sections. In a lively response to Kahnemann, Wilson rejects the suggestion of new regulations: “If you can’t stand the replication heat, get out of the empirical kitchen because publishing your work means you think it’s ready for prime time, and if other people can’t make it work based on your published methods then that’s your problem and not theirs.”

Prime time for priming research in social psych?

Read the rest of the Guardian article. Second installment later on…maybe….

What do readers think?
^^^^^^^^^^^^^^

2nd Installment 7/1/14

Naturally the issues that interest me the most are statistical-methodological. Some of the methodology and meta-methodology of the replication effort is apparently being developed hand-in-hand with the effort itself—that makes it all the more interesting, while also potentially risky.

The replicationist’s question of methodology, as I understand it, is alleged to be what we might call “purely statistical”. It is not: would the initial positive results warrant the psychological hypothesis, were the statistics unproblematic? The presumption from the start was that the answer to this question is yes. In the case of the controversial Schnall study, the question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s runover? At least not directly. In other words, the statistical-substantive link was not at issue. The question is limited to: do we get the statistically significant effect in a replication of the initial study, presumably one with high power to detect the effects at issue. So, for the moment, I too will retain that as the sole issue around which the replication attempts revolve.

Checking statistical assumptions is, of course, a part of the pure statistics question, since the P-value and other measures depend on assumptions being met at least approximately.

The replication team assigned to Schnall (U of Cambridge) reported results apparently inconsistent with the positive ones she had obtained. Schnall shares her experiences in “Further Thoughts on Replications, Ceiling Effects and Bullying” and “The Replication Authors’ Rejoinder”:http://www.psychol.cam.ac.uk/cece/blog

The replication authors responded to my commentary in a rejoinder. It is entitled “Hunting for Artifacts: The Perils of Dismissing Inconsistent Replication Results.” In it, they accuse me of “criticizing after the results are known,” or CARKing, as Nosek and Lakens (2014) call it in their editorial. In the interest of “increasing the credibility of published results” interpretation of data evidently needs to be discouraged at all costs, which is why the special issue editors decided to omit any independent peer review of the results of all replication papers. (Schnall)

Perhaps her criticisms are off the mark, and in no way discount the failed replication (I haven’t read them), but CARKing? Data and model checking are intended to take place post-data. So the post-data aspect of a critique scarcely renders it illicit. The statistical fraud-busting of a Smeesters or a Jens Forster were all based on post-data criticisms. So it would be ironic if in the midst of defending efforts to promote scientific credentials they inadvertently labeled as questionable post-data criticisms. top

Simonsohn does not reject out of hand Schnall’s allegation that the lack of replication is explained away (e.g., by a “ceiling effect”). (In fact, he has elsewhere discussed a case that was rightfully absolved thereby [6].) Simonsohn provides statistical grounds for denying a ceiling effect is to be blamed in Schnall’s case. However, he also agrees with Schnall’s discounting the replicators’ reaction to the charge of a ceiling effect by simply lopping off the most extreme results.

In their rejoinder (.pdf), the replicators counter by dropping all observations at the ceiling and showing the results are still not significant.

What follows from this? What follows is that the analysis of the evidential import of failed replications in this field is an unsettled business. Despite the best of intentions of the new replicationists, there are grounds for questioning if the meta-methodology is ready for the heavy burden being placed on it. I’m not saying that facets for the necessary methodology aren’t out there, but that the pieces haven’t been fully assembled ahead of time. Until they are,the basis for scrutinizing failed (and successful) replications will remain in flux.

^^^^^^^^^^

Final irony. If the replication researchers claim they haven’t caught on to any of the problems or paradoxes I have intimated for their enterprise, let me end with one more. ..No, I’ve save it for installment 4. top

^^^^^^^^^^

4th installment 7/5/14

Statistical significance testers in psychology (and other areas) often maintain there is no information, or no proper inference, to be obtained from statistically insignificant (negative) results. This, despite power analyst Jacob Cohen toiling amongst them for years. Maybe they’ve been misled by their own constructed animal, the so-called NHST (no need to look it up, if you don’t already know).

The irony is that much replication analysis turns on interpreting non statistically significant results!

One of my first blogposts talks about interpreting negative results and I’ve been publishing on this for donkey’s years[7]. Here are some posts for your Saturday night reading:

[0] Unsurprisingly, replicationistas in psych are finding well-known results from experimental psych to be replicable. Interestingly, similar results are found in experimental economics, dubbed “experimental exhibits”. Expereconomists recognize that rival interpretations of the exhibits are still open to debate.

[1] In Nuzzo’s article: “For a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had discovered that extremists quite literally see the world in black and white”.

(Glory, I tell you!)

[2] Some of the results are now published in Social Psychology. Perhaps it was not such an exaggeration to suggest, in an earlier post, that “non-significant results are the new significant results”. At the time I didn’t know the details of the replication project; I was just reacting to graduate students presenting this as the basis for a philosophical position, when philosophers should have been performing a stringent methodological critique.

[3] By contrast, statistical fraudbusting and statistical forensics have some rigorous standards that are hard to evade, e.g., recently Jens Forster.

[4] In Kahneman’s initial call (Oct, 2012) “He suggested setting up a ‘daisy chain’ of replication, in which each lab would propose a priming study that another lab would attempt to replicate. Moreover, he wanted labs to select work they considered to be robust, and to have the lab that performed the original study help the replicating lab vet its procedure.”

[5] Simonsohn is always churning out the most intriguing and important statistical analyses in social psychology. The field needs more like him.

Post navigation

59 thoughts on “Some ironies in the ‘replication crisis’ in social psychology (4th and final installment)”

Some might be thinking, a pox on both their houses? The thing is, the replicationist’s question is not whether the initial positive results would warrant the psychological hypothesis,if only the statistics is unproblematic. They assume it would! And the statistics being unproblematic just boils down to being able to replicate the result. Thus, at issue is really just the “pure” statistics question of whether the effect can be found by means of a statistical replication of the initial study, presumably one with good power to detect the effects at issue. Even so, a mismatch just shouts “inconsistent” rather than falsifying or questioning the unreplicated effect. I forget if that was Lakatos…
Yet, in a sense, it’s touching that they have such faith in their statistics.
I’ll consider this in the second installment.

Some of the key players are saying on twitter that they don’t understand this post in the least! That’s worrisome. I (a) described the background of competing sides of an ongoing issue in their field, and (b) brought out some ironies in the current approach. One is that the “rewards and flexibility” hypothesis, if true, would have to apply to the new replicationists as well. A second was identified by the question: “Is it possible that holding social psych up as a genuine replicable science is, ironically, creating soft frauds too readily?” and so on. I don’t know if it’s a matter of not comprehending irony, or being disinclined to ponder reflectively the assumptions and corollaries of the research program. I admit to having mixed views at the moment. I hope to learn from thoughtful readers.

Thank you very much for your clarifications. Part of the reason some people on twitter, including me, honestly did not understand the point(s) of the original post may have been the use of suggestive questions, and your use of the concept of “irony” which may or may not have been used to implicitly accuse the so-called “new replicationists” of double standards. I suspect many readers did not want to interpret too much into these suggestive questions and identifications of irony, because they weren’t sure how they were intented. So I’m going to stick my neck out and try to guess how they were intended in this reply. If I’m wrong, please just let me know and tell me how they were intented instead.

As for your point that the “rewards and flexibility” hypothesis would have to apply to the new replicationists as well, I have two comments.

First: What do you mean by “new replicationists”? Replication has always been an essential part of science, which includes psychology. Your formulation ‘new replicationists’ seems to suggests that a few rogue psychologists recently had a swell new idea, and started to replicate things. This suggestion, intended or not, is false: there have always been replication studies in psychology, successful ones and failed ones, and they have always been taken seriously in the field. What is new, in my view, is that now that a number of high impact results have been shown to be at best “hard to replicate” has caused some psychologists to act as if replication attempts are somehow not polite (Kahneman), or even amount to “bullying” (Gilbert, Schnall). That’s the really new development regarding replication. And that’s not ironic, as far as I can tell, but it is surprising, and also misguided, as I also argued in http://osc.centerforopenscience.org/2014/05/28/train-wreck-prevention/

Second: the conscientious and principled psychologists who are now engaging in systematic, collaborative, pre-reviewed, pre-agreed replication attempts of high-impact findings in social psychology are doing so precisely in order to *reduce* the flexibility part from the “rewards and flexibility” equation. So the correct answer to your question, assuming it was not rhetorical, is: yes, this applies to the replicationists as well, and this is the very reason they are doing their replication studies, and it is why they are doing them so cautiously and carefully. If your point also was that the replicationists are reaping rewards from this work, then I’d agree with that point. But why wouldn’t they? Aren’t we all seeking rewards from our work? Aren’t the original authors of these high impact findings not doing so too?

Finally, I would like to ask a clarification question: Are you suggesting, by using the phrase “holding social psych up as a genuine replicable science” that social psychology is, in your view, not a genuine replicable science? If not, I sincerely apologize for having been tempted to interpret it that way. If that is indeed what you wanted to suggest, however, I think it would really be worth discussing this assumption explicitly in a separate thread.

JP de Ruiter: I appreciate your points, and yes a separate discussion is called for. However, a number of issues regarding statistical inference and significance tests (including those you mention on your interesting link) have been discussed a great deal on this blog, so you might search at some point.

Rather than viewing Kahneman as having “changed his mind” as you suggest in your link, consider that his initial call referred specifically to priming studies in which subtle cues are essential. The irony here is that Kahneman’s endorsing this area of research (despite the cases that made him speak of a “train wreck”), if taken seriously, could well entail the concern he is now raising about the subtle issues of valid replication. By contrast (to Kahneman’s position), holding these areas of research to an overly strict standard of replication can lead to illicit failures of replication (and even illicit suggestions of “soft fraud”).

I certainly didn’t allude to any rogue psychologists, I’m actually very impressed with the scope and seriousness of the program. Your point about trying to reduce flexibility is all to the good, and it should be tackled head-on. All the good intentions and resources, however, will not protect this effort from the possible self defeating logical consequences that I am on about. It would require a truly self-reflective standpoint to prevent that, and I don’t see it so far.
(And this is distinct from some unhappy frictions that might arise in the field.)

The methodological paradoxes revolving around this episode are fascinating in their own right, for a logician-philosopher. But only psych researchers can save the effort from an ironic degeneration. Still, as I granted, merely diminishing QRPs has some positive payoff, though not obviously equal to the effort being expended (and that was already being done in new journal guidelines). I repeat that I have mixed feelings, and am deliberately posting to get clearer about this.

Related statistical issues that I haven’t discussed in the first installment are also of great interest to me, e.g., what statistical properties should a replication attempt have in order to warrant various conclusions.

Some people forget Kahneman’s position all along: “It’s that he thinks the field has been, in some cases, unfairly maligned. A failure to replicate doesn’t mean, Kahneman said, that the original study was flawed. Priming research is often subtle, and those who replicate studies may not follow the same procedures—which is why he emphasizes communication between labs. ‘Some of these failures to replicate are not utterly convincing,’ Kahneman said.” The ‘daisy chain’ was to include checking replicators.”http://chronicle.com/blogs/percolator/daniel-kahneman-sees-train-wreck-looming-for-social-psychology/31338

JP de Ruiter: the point isn’t whether researchers ought to be rewarded–it’s that the “reward and flexibility” hypothesis (put forward by leaders in the psych replication program) alleges that the reward structure + being an area with flexible methodology leads/encourages exploiting that very flexibility. It’s the irony again.

Thanks for your constructive reply to my comments. I would really be looking forward to a further post in which you elaborate on the ‘self defeating logical consequences’ you mentioned, as — and I want to point out explicitly here that this is not a rhetorical device or implicit critique — I still do not understand what you refer to. I studied philosophy of science as part of my cognitive science education, so I’m aware of some of the well documented and discussed limits of knowledge acquisition in the scientific enterprise, but here I’m just not following. I think it would be great if you could make your argument and the assumptions underlying it very explicit, so it could be appreciated by a large(r) audience.

replications in the social psychology issue are most of the time not “one shot replications”: they are all multiple attempts replications. the fact that so many so-called “classic” social psychology results do not replicate repeatedly is scary, telling and really problematic. it can only mean that that people have been beefing up their data, collectively. if a one-shot-replication attempt “fails”, then it is hard to tell whether we should believe the original result or the replication-result. however, if several attempts at replication fail, then the only valid conclusion is that the original result is not robust and probably non-existent without excessive data tampering.

the comments of kahneman, gilbert, and wilson are ridiculous. what are they trying to prove? science can only prosper and grow if there is total transparency. it is a ridiculous idea to suggest that replicators should always contact the manufacturers of the studies they are trying to replicate. science should be impersonal and methods sections should be crystal clear. when a result of a study does not replicate, the original contributor has a problem, not the replicator. when the method section of a study is unclear because there are “hidden procedures” (really? jeesz!), then that study should not have been published in the first place. kahneman, wilson, and gilbert are too concerned about defending a science that is in need of a big change. this type of defensive reasoning is not the way forward. bullying people, irony, sarcasm, and elitist verbosity is not the way forward. what does it mean when several groups of scientist at multiple labs in different countries cannot replicate well-known, “classic” social psychology effects?

why defend a field that is (partly / mainly) built on questionable empirical practices? why not end it completely and start anew and afresh?

Pierre: Thanks for your comment.
It is interesting that you say: “if a one-shot-replication attempt “fails”, then it is hard to tell whether we should believe the original result or the replication-result. however, if several attempts at replication fail, then the only valid conclusion is that the original result is not robust and probably non-existent without excessive data tampering.” That makes sense. But in the sciences I am familiar with, a result does not become ‘high impact’ unless it’s been replicated, both directly and through interconnected checks, ideally with a growing body of theory. If the “several attempts” all allude to these replication studies, then it’s not obvious that they have greater weight than the positive results that rendered them high impact—especially if they do not show the statistical assumptions underlying their studies are met.

As I read the literature, the bullying accusations arose because Schnall’s criticisms of the assumptions of the data in the replication attempt were dismissed, and she was not given opportunity to publish her specific objections. I gather that Kehneman is recognizing that priming effects are sufficiently subtle that legitimate questions can arise as to whether the “treatment” was actually applied. For instance, I read that Schnall objected to on-line applications of “cleanliness” because you wouldn’t know if the subjects were sitting at a messy desk or whatever.

agree, but i am not sure that all (or most) of social psychology’s high impact studies are as well and as successfully replicated as you suggest. on the contrary, my impression and experience is that many (or all) of social psychology’s classic findings are very hard to replicate. perhaps this means that all (or most) of social psychology’s findings are “subtle” (as you paraphrase kahneman), but what does that mean? that you can only replicate the finding when you do exactly what they did in the original study? that conceptual, “irrelevant” changes immediately wash the effect away? that the effect is much more specific than the paper in which it is published suggests? that is terrible! if findings can only be replicated when you do exactly what was in the original paper (+ the “hidden procedures” that you need to find out by contacting the original contributors –kahneman’s suggestion), then what do these results mean? ultimate specificity. impossible generalization.

Pierre: True, that would be a parody of science. The issue arises in much more serious settings as with Anil Potti who claimed Bagley and Coombes simply failed to follow his method. http://errorstatistics.com/2014/05/31/what-have-we-learned-from-the-anil-potti-training-and-test-data-fireworks-part-1/
His results, later retracted, were taken as grounds for clinical trials! My point really is that critiques of results also need to pass scientific muster. There needs to be a reliable way to pinpoint the flaw. In the Potti case, one of the key criticisms, as I understand it, had to do with the way the prediction model (for targeted cancer treatments) was validated.

To refer to my second installment, I don’t think the critique in social psych should be limited to the “purely” statistical.

What would you recommend as a way the field can be genuinely reformed?

“But in the sciences I am familiar with, a result does not become ‘high impact’ unless it’s been replicated, both directly and through interconnected checks, ideally with a growing body of theory.”

This was perhaps true of physics research from 60 years ago, but it’s absolutely not true of most sciences or social sciences today.

I normally enjoy Kahneman’s writings, but I found his suggestions here embarrassing.

Perhaps if the press releases that went along with these papers should have emphasized how subtle and difficult to replicate the effects are, the field wouldn’t be in so much trouble… but then people probably wouldn’t have cared about most of the results to begin with if they were presented truthfully.

vl: That’s the thing, the social psych people are thoroughly immersed in a program wherein these practices and measurements and modes of analysis are taught as legitimate. Having been taught as kosher practices, one could argue that it’s not the researchers’ faults.

vl: I would be wary of giving 1950s era physics too much credit when it comes to questioning detection claims. The origin of the “5-sigma rule” in particle physics stems from a series new discoveries of some notable statistical significance that would disappear upon the collection of more data.

Mayo: I think the standards and practices issue you mention goes also helps explain the defensiveness of researchers to the replication effort. And I can’t help but be sympathetic to that response, at least from younger researchers. But non-replicable detection claims have consequences.

Weber’s claims of having detected gravitational waves in the 1960s spurred a furious effort to replicate his instrument. When others failed to find signals, the sub-field became embroiled in arguments concerning calibration methods, source astrophysics, and blind data analysis. When problem’s with his work were pointed out, sometimes brutally, in print and in person, Weber never admitted that results were in error. By the 80s, he was irrelevant to the cutting-edge work of the field.

West: I’m very familiar with the whole gravity waves search, and also with Collins (whom I generally radically disagree with, but in his recent book–something like, Are we all experts now? he backtracks a fair amount from his previous social constructivism).

” I can’t help but be sympathetic to that response, at least from younger researchers.” I agree, I mean who can deny there aren’t conflicts of interest. The “rewards and flexibility hypothesis” they champion says as much. That’s one of the ironies. But the social psych people I hear from don’t seem to get it. It would be interesting if they let their data be analyzed blindly, or by someone outside the field.

I happen to come from a field where the main practice is criticism, criticism, criticism of everyone else. Things can and do get unfair in philosophy too, but at least we’re used to it.

Knowledge comes in many forms ranging from the universal tautological truths to tacit knowledge intimately tied to the individual. A key feature of scientific knowledge is that it is disentangled from the scientist that produced. It remains both true and useful after to scientist is gone. This is what lets us publish, transmit and build upon scientific knowledge. It is not tied to its origin. This is not to say that scientific results must hold everywhere and all time, but rather that a key part of the scientific knowledge is the conditions under which the result manifests.

I would not expect the original researcher to uncover all the conditions for the result. One service replication performs is that it can help uncover additional conditions for when the result holds true. But these “hidden secrets” you mention seem to suggest there are conditions for the result that the original researcher knows of but left out of the publication. This seems to me a somewhat questionable publishing practice.

Regarding Khaneman’s idea of involving the original researcher in the replication attempt. One could take a step back and regard the original result as well as the replications to be data points in and of themselves. A meta-analysis of sorts. In that case a good argument could be made that involving the original researcher is a bad idea. Involving the original researcher would form a dependence between the data points which in turn would undermine the ability to apply standard statistical tools to the meta-analysis, since statistical models usually assume independence.

Johan: Thanks for your comment. Here’s my favorite quote from Fisher as regards what is required to show a “phenomenon is experimentally demonstrable”:
“[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14).

What counts as “knowing how” depends on the context; but we can discern its absence when a purported effect is irreproducible–provided that the failed attempt is due to the nonexistence of the effect. Kahneman’s advice (about checking with those who purport to have successfully replicated) could well be his way of saying that he has doubts about some of the failed replications.
As E. Berk reminds us, in his initial call for the daisy chain, Kahneman said: ‘Some of these failures to replicate are not utterly convincing.”

My definition of questionable or pseudo- inquiry is when successes/failures cannot be reliably taken as either crediting or blaming claims under test.

Johan: as I said in my reply to John Byrd, it’s likely impossible to squeeze all of the relevant information necessary for replication into a paper given the various constraints of academic publishing. That’s why making available supplemental material online is so important.

In the “Olden Days,” if you really wanted to know how something was done you had to go to that person’s lab or sit down with them for coffee at a conference. And despite the array of additional tools, that personal interaction with other researchers and their methods remains important. Nothing ever works like the manual says it should. And sometimes even the original researcher isn’t aware that certain decisions are crucial to making an experiment work. Over time that tacit knowledge known only to a few pioneers because more broadly known and later codified in textbooks.

The best way to replicate a study is to try to get as close to the original methodology as possible. If replications fail to detect the same signal, the it could either be that certain crucial conditions were not met or that the previous claim was just statistical noise. And that is why having at least an amiable relationship with the original researchers is important. You can’t have this debate without them.

West:
Can you give an example of the sort of thing you have in mind?
The psych people are, by an large, rejecting any such thing with a vehemence that might make sense for some straightforward cases but is slightly absurd for social psych.

The sociologist of science Harry Collins has made much of the idea of tacit knowledge amongst scientists, and the role that it plays within a given community. There are many things that must be learned from mentors, not manuals. Most of it is part of the culture of the particular field. In cutting edge research, the mentorship has no opportunity to occur at the time of first publication (as far as the reader is concerned). The challenge to the author is to fully explain the methods used in sufficient detail to support replication by the reader. My experience as an editorial board member for many years has been that I seldom see a draft with an adequate methods description to support replication. For some studies, this could be an onerous task. My opinion is that the methods section must be written to that standard. If it is not, then it is no surprise that replications fail. Perhaps the high failure rates discussed are in part due to poor standards for journal articles. In any event, one explanation for failed replication is that the original paper was poorly written. Critical methodological details were left out. Difficulties in acquiring the data are not mentioned. Can members of the original team observe the same items or phenomena and record the same data? Can a third party? If not, then all subsequent statistics are less interesting and we can expect replication to fail.

John: Is it even possible for a standard journal article to contain all the relevant information to aid in a proper replication effort give the stylistic and length restrictions?

It makes sense for papers to come with online supplements. Electronic logbooks are easy to come by. Analysis software and data products can be hosted on sites like GitHub. While this no doubt requires extra effort on the part of researchers, it is already SOP in many research projects. So it can be done without too much hassle.

Should these supplementary materials be made a threshold require for peer-review publication?

West: I answer that the supplemental material should be required in some cases. Particularly so when it is unclear from the paper how to make the primary observations that are the basis for all subsequent analysis. If the basic data is cryptic, then how can others really evaluate it?

John: I’m not too big on Collins and “tacit knowledge”. It’s real, obviously, but the point of scientific methodology is to control it. Here there’s a danger it’s being used to essentially retain one’s interpretation no matter what–pseudo inquiry, no stringency. (But notice that Potti’s defense of himself was just as bad, and there we’re talking cancer research! At least here, people aren’t dying from errant social psych theories.)

But back to social psych, I think that the main problem is not even being addressed here. The “rewards and flexibility” hypothesis can explain a lot, sure, but not fix it.

Mayo: I certainly do not see tacit knowledge as an excuse for cryptic writing. I am simply noting that sometimes we presume others just know what we did when we give a cursory explanation of methods. In some cases, this is reasonable, if such procedures are common knowledge in the given field. However, new methods are typically not common knowledge and should get a more extensive treatment precisely so that others can successfully replicate the procedure.

@vl: I am curious if you have any specific examples in mind, at least in physics. As in discoveries that were heralded as ground-breaking only upon successful replication. For I am at a loss to think of one.

Mayo: As can I, but that is not the core of my contention with vl. I was warning against the temptation to romanticize (for lack of a better term) the past in light of the problems of the present. The question was whether any results deemed major discoveries were only considered as such after successful replication and I could not think of one example.

Demands for replication (in various guises) along with quantification of uncertainties/errors act as checks against the researchers self-confidence in pattern recognition in his or her data. And as methods and subjects change, so to must the ways the researcher keeps herself from finding something that isn’t there. What we need is innovation not nostalgia.

Some good points, but the central one is incorrect,and instantiates the final irony on this blogpost. It’s not true that “you can’t ‘prove’ a negative”. You can do so in just as good a manner as you can ‘prove’ a positive. That is, they will both have to be determined on a case-by-case basis. (Qualification: I don’t mean that any experiment is an equally good test of H and not-H—this is definitely NOT the case.)

You comment “By the way, since Schnall’s research was testing ‘embodied cognition’ why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?” is interesting in that, to me, it points to a big problem with a lot of social and behavioral science research, which is a vagueness of research hypotheses and an attitude that anything that rejects the null hypothesis is evidence in favor of the researcher’s preferred theory.

Just to clarify, I’m not saying that this is a particular problem with classical statistical methods; the same problem would occur if, for example, researchers were to declare victory when a 95% posterior interval excludes zero. The problem that I see here, and that I’ve seen in other cases too, is that there is little or no concern with issues of measurement. Scientific measurement can be analogized to links on a chain, and each link–each place where there is a gap between the object of study and what is actually being measured–is cause for concern.

Andrew: Thanks much for your comment. This is exactly where I think they should be focusing, i.e., on the links between the “object of study and what is actually being measured”. That was my point about getting beyond what I called the “pure statistics” problem. A few points:

(1) statistical tests, even of the simple (I had written “pure” but that will confuse with a different use of that word in the previous sentence) sig test variety, do not properly allow moving from statistical significance to substantive theory (the mistakes in inferences to the latter differ quite a bit from the former, so the latter are poorly probed by the stat test alone). The so-called NHST in psych is an invention, a fallacious one.

(2) I grant that psych, like other fields, has its own theories about connecting things like word primes with various “treatments” –an outsider (like me) shouldn’t assume their intuitions about relevant measurements are correct (though I think the researchers should defend theirs).

(3) It is not “distance” alone (between what is studied and what is measured) that matters. The wire monkey or whatever can be a powerful analogue to find out the importance of affection versus food, say. What matters, I maintain, is being able to test or probe a claim of interest via the experiment. The argument should be what we call “convergent” (rather than “linked”), so as to create the testable connection. (Point (3) is to defend against those who might protest against requiring “external validity” for purposes of learning.)

Yes, all of this is a line of reasoning that is crucial to science but is often ignored (in my own field of political science as well, where we often just accept survey responses as data without thinking about what they correspond to in the real world). One area where measurement is taken very seriously is psychometrics, but it seems that the social psychologists don’t think so much about reliability and validity. One reason, perhaps, is that psychometrics is about quantitative measurement, whereas questions in social psychology are often framed in a binary way (Is the effect there or not?). And once you frame your question in a binary way, there’s a temptation for a researcher, once he or she has found a statistically significant comparison, to just declare victory and go home.

Andrew: The measurements themselves in psych are quantitative, whatever dichotomy they introduce later, e.g., how much do you disapprove of eating your dog if he’s runover? I’m not saying that’s quite the question asked, only that they are measuring “how judgmental” someone is, so I assume there is this type of scale used. Actually, a student of mine posted something yesterday about Schnall which suggests she only asked people how they felt after the word unscrambling, lest the subjects figure out the purpose of the study. I don’t know, and won’t be investigating (though my student might).

I find almost all questionnaires annoying/frustrating in the extreme.

I doubt the issues in survey sampling are quite as problematic, even though there are usually ways to ask leading questions.

Yes, the measures in social psychology are often quantitative; what I’m talking about here is that the research hypotheses are framed in a binary way (really, a unary way in that the researchers just about always seem to think their hypotheses are actually true). This motivates the “I’ve got statistical significance and I’m outta here” attitude. And, if you’ve got statistical significance already and that’s your goal, then who cares about reliability and validity, right? At least, that’s the attitude, that once you have significance (and publication), it doesn’t really matter exactly what you’re measuring, because you’ve proved your theory.

Andrew: I don’t think researchers in social psych are well-described in this superficial a manner. Granted “the researchers just about always seem to think their hypotheses are actually true” which is one big reason that combining the data with their prior beliefs is the wrong way to go.

You write, “I don’t think researchers in social psych are well-described in this superficial a manner.” Indeed, I don’t think it’s possible to describe well any person, let alone a group of people, using a single paragraph! That said, I do think that the framing of hypotheses in a binary or even unary way is standard in many many areas of research (not just social psychology). And I definitely think the “I’ve got statistical significance and I’m outta here” attitude is standard. My above paragraph is not intended to be cynical or to imply that I think these researchers are trying to do bad science. I just think that the combination of binary or unary hypotheses along with a data-based decision rule leads to serious problems.

Andrew: The psych people all report effect sizes (never mind how meaningful), and most I know are confidence interval pumpers, some want to ban significance tests altogether. G. Cummings’ “Understanding the New Stat” couldn’t be more damning of binary tests or any tests frankly. The bottom line is they (i.e, the ‘CIs only’ people) share your critical view of tests. I do not. (Addition: That said, I too like CIs–a great deal. But even they require supplements with SEV assessments. A reader might look up:reforming the reformers”.)

Remember: this mini-discussion in comments arose because I noted your observation that the measurements used in the paper in question were not closely related to the researcher’s scientific hypotheses. I thought this was a good observation on your part, and I conjecture that one reason that researchers can be so sanguine about using measurements that are far from the object of study is that these sorts of research studies typically seem to have the goal of confirmation, which they happen to achieve by rejecting a null hypothesis. To me, the important part of this discussion is not whether researchers are using hypothesis tests or confidence intervals or whatever. Rather, what is important is that the research projects are framed as quests for confirmation of a theory. And once confirmation (in whatever form) is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements.

Andrew: I agreed that “the measurements used in the paper in question were not” obviously adequately probing the substantive hypothesis. I don’t know that the projects are framed as quests “for confirmation of a theory”,rather than quests for evidence of a statistical effect (in the midst of the statistical falsification arg at the bottom of this comment). Getting evidence of a genuine, repeatable effect is at most a necessary but not a sufficient condition for evidence of a substantive theory that might be thought to (statistically) entail the effect (e.g., a cleanliness prime causes less judgmental assessments of immoral behavior—or something like that). I’m not sure that they think about general theories–maybe “embodied cognition” could count as general theory here. Of course the distinction between statistical and substantive inference is well known. I noted, too, that the so-called NHST is purported to allow such fallacious moves from statistical to substantive and, as such, is a fallacious animal not permissible by Fisherian or NP tests.

I agree that issues about the validity and relevance of measurements are given short shrift and that the emphasis–even in the critical replication program–is on (what I called) the “pure” statistical question (of getting the statistical effect).

I’m not sure I’m getting to your concern Andrew, but I think that they see themselves as following a falsificationist pattern of reasoning (rather than a confirmationist one). They assume it goes something like this:

If the theory T (clean prime causes less judgmental toward immoral actions) were false, then they wouldn’t get statistically significant results in these experiments, so getting stat sig results is evidence for T.

I think these researchers are following a confirmationist rather than falsificationist approach. Why do I say this? Because when they set up a nice juicy hypothesis and other people fail to replicate it, they don’t say: “Hey, we’ve been falsified! Cool!” Instead they give reasons why they haven’t been falsified. Meanwhile, when they falsify things themselves, they falsify the so-called straw-man null hypotheses that they don’t believe.

The pattern is as follows: Researcher has hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A. I don’t see this as falsificationist reasoning, because the researchers’ actual hypothesis (that is, hypothesis A) is never put to the test. It is only B that is put to the test. To me, testing B in order to provide evidence in favor of A is confirmationist reasoning.

Again, I don’t see this as having anything to do with Bayes vs non-Bayes, and all the same behavior could happen if every p-value were replaced by a confidence interval.

Andrew: “when they set up a nice juicy hypothesis and other people fail to replicate it, they don’t say: “Hey, we’ve been falsified! Cool!” Instead they give reasons why they haven’t been falsified.”

They are correct to do so, and there is no reason that they’d say THEY had been falsified when they mean only to claim that the failed replication doesn’t count against their successful stat sig result. In other words, what you’ve written above is equivocal, but if I give subscripts it will freak people out. It would go something like this: In objecting to a failed replication ~stat sig (2) of a their stat sig result (call it stat sig (1)) they can be denying that ~stat sig(2) counts against stat sig(1), or they can be denying that ~stat sig(2) counts against H*: the causal or other claim that stat sig (1) had been taken as evidence for. Either way, their primary inference can still be seen as taking stat sig (1) as statistically falsifying ~H*.

As for the rest of your remarks, as we say in philosophy: one man’s modes ponens is another’s modus tollens. That said, to get to the most important point:
if they were playing the confirmation game of Bayes boosting, then taking stat sig(1) as evidence for H* would be legit (since the posterior is boosted), and not open to the severe tester’s criticism.

“if they were playing the confirmation game of Bayes boosting, then taking stat sig(1) as evidence for H* would be legit (since the posterior is boosted), and not open to the severe tester’s criticism.”

I don’t think this is always true.

It may be true in simple limiting cases where there are only two hypotheses, but few problems in a domain as complex as social psych should be modeled in this way. The space of hypotheses is probably better captured as a continuous latent parameter or some latent categorical mixture parameter. For example there may be mixture components modeling various confounding factors as a hypothesis.

In these cases, stat significance against a point null doesn’t have to correspond to increasing the posterior probability of the specific hypothesis they have in mind.

vl: I’m not sure about your example (would the posterior stay the same?) but I was specifically referring to Gelman’s suggestion that the hypothesis of interest H* is thought to entail or explain the stat sig result, and the stat sig result is claimed to “confirm” H*.

@mayo the posterior wouldn’t stay the same, although it could given just the right pathological observation.

In a complex systems (e.g. “social psychology”) I think you have to conceptually incorporate that due to the study design a multitude of explanations. You could have a component which models a common confounding process or a known background signal. Thus both the posterior and prior consist would be mixture distributions.

Whether your particular hypothesis goes up or down and by how much depends on the details of the data and the structure of competing models.

You might say that having to conceptualize multiple competing explanations for the data is reflective of a study (or perhaps field) that shouldn’t exist in the first place and in some cases I might agree.

However, conditional on the analysis being done on such data, it would be wrong to ignore the complexity that’s there. In the bayesian approach this would be regarded as the sin of “throwing away information”.

I understand falisificationism to be that you take the hypothesis you love, try to understand its implications as deeply as possible, and use these implications to test your model, to make falsifiable predictions. The key is that you’re setting up your own favorite model to be falsified.

In contrast, the standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify and thus represent as evidence in favor of A.

As I said above, this has little to do with p-values or Bayes; rather, it’s about the attitude of trying to falsify the null hypothesis B rather than trying to trying to falsify the researcher’s hypothesis A.

Take Daryl Bem, for example. His hypothesis A is that ESP exists. But does he try to make falsifiable predictions, predictions for which, if they happen, his hypothesis A is falsified? No, he gathers data in order to falsify hypothesis B, which is someone else’s hypothesis. To me, a research program is confirmationalist, not falsificationist, if the researchers are never trying to set up their own hypotheses for falsification.

That might be ok–maybe a confirmationalist approach is fine, I’m sure that lots of important things have been learned in this way. But I think we should label it for what it is.

Andrew: Now I see what you’re alluding to, not a falsificationist logic* but a stringent, self-critical, testing account. Indeed, I consider any inquiry questionable if it hasn’t taken responsibility for error-probing and pointing up weak spots in the analysis from the data collection, to the modeling and inference. That’s my my major kvetch against these studies. At times they show a patina of self-criticism (we wondered if we were only picking up on y rather than x), after which a story about a distinct analysis is given, supposedly quashing the concern so honestly raised.
*Your criticism is essentially pointing up the unsoundness of the falsification argument I laid out for them, in one of my comments above.

the blog started December 1, 2014. Observed power is also called post-hoc power. It is estimated by converting p-values into z-scores and using z-scores as estimates of the non-centrality parameter for a power analysis, typically with p < .05 (two-tailed), z = 1.96 as criterion value.

Does it use the observed effect size as the hypothesized effect size in computing power? If so, it’s what I call “shower” in my blog (search for it). A generally invalid measure.I’m not saying their critique is otherwise flawed.

I am well aware of the problems and critique of observed power to evaluate a single study. However, as always observed power, which is really based on observed es has a sampling distribution and a set of observed power values can estimate true power. It can also be used to examine whether the percentage of significant results is higher than power.