Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem

Added January 30, 2018: A formal letter to the editor of JPSP, calling for a retraction of the article (Letter).

“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)

In 2011, the Journal of Personality and Social Psychology published a highly controversial article that claimed to provide evidence for time-reversed causality. Time reversed causality implies that future events have a causal effect on past events. These effects are considered to be anomalous and outside current scientific explanations of human behavior because they contradict fundamental principles of our current understanding of reality.

The article reports 9 experiments with 10 tests of time-reversed causal influences on human behavior with stunning results. “The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. ” (Bem, 2011, p. 407).

The publication of this article rocked psychology and triggered a credibility crisis in psychological science. Unforeseen by Bem, the article did not sway psychologists to believe in time-reversed causality. Rather, it made them doubt other published findings in psychology.

In response to the credibility crisis, psychologists started to take replications more seriously, including replications of Bem’s studies. If Bem’s findings were real, other scientists should be able to replicate them using the same methodology in their labs. After all, independent verification by other scientists is the ultimate test of all empirical sciences.

The first replication studies were published by Ritchie, Wiseman, and French (2012). They conducted three studies with a total sample size of N = 150 and did not obtain a significant effect. Although this finding casts doubt about Bem’s reported results, the sample size is too small to challenge the evidence reported by Bem which was based on over 1,000 participants. A more informative replication attempt was made by Galek et al. (2012). A set of seven studies with a total of N = 3,289 participants produced an average effect size of d = 0.04, which was not significantly different from zero. This massive replication failure raised questions about potential moderators (i.e., variables that can explain inconsistent findings). The authors found “the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not.” (p. 941).

Galek et al. (2012) also speculate about the nature of the moderating factor that explains Bem’s high success rate. One possible explanation is that Bem’s published results do not represent reality. Published results can only be interpreted at face value, if the reported data and analyses were not influenced by the result. If, however, data or analyzes were selected because they produced evidence for time-reversed causality, and data and analyses that failed to provide evidence for it were not reported, the results cannot be considered empirical evidence for an effect. After all, random numbers can provide evidence for any hypothesis, if they are selected for significance (Rosenthal, 1979; Sterling, 1959). It is irrelevant whether this selection occurred involuntarily (self-deception) or voluntary (other-deception). Both, self-deception and other-deception introduce bias in the scientific record.

Replication studies cannot provide evidence about bias in original studies. A replication study only tells us that other scientists were unable to replicate original findings, but they do not explain how the scientist who conducted the original studies obtained significant results. Seven years after Bem’s stunning results were published, it remains unknown how he obtained significant results in 9 out of 10 studies.

I obtained Bem’s original data (email on February 25, 2015) to examine this question more closely. Before I present the results of my analysis, I consider several possible explanations for Bem’s surprisingly high success rate.

1. Luck

The simplest and most parsimonious explanation for a stunning original result that cannot be replicate is luck. The outcome of empirical studies is partially determined by factors outside an experimenter’s control. Sometimes these random factors will produce a statistically significant result by chance alone. The probability of this outcome is determined by the criterion for statistical significance. Bem used the standard criterion of 5%. If time-reversed causality does not exist, 1 out of 20 attempts to demonstrate the phenomenon would provide positive evidence for it.

If Bem or other scientists would encounter one successful attempt and 19 unsuccessful attempts, they would not consider the one significant result evidence for the effect. Rather, the evidence would strongly suggest that the phenomenon does not exist. However, if the significant result emerged in the first attempt, Bem could not know (unless he can see into the future) that the next 19 studies will not replicate the effect.

Attributing Bem’s results to luck would be possible, if Bem had reported a significant result in a single study. However, the probability of getting lucky decreases with the number of attempts. Nobody gets lucky every time they try. The luck hypothesis assumes that Bem got lucky 9 out of 10 times with a probability of 5% on each attempt.
The probability of this event is very small. To be exact, it is 0.000000000019 or 1 out of 53,612,565,445.

Given this small probability, it is safe to reject the hypothesis that Bem’s results were merely the outcome of pure chance. If we assume that time-reversed causality does not exist, we are forced to believe that Bem’s published results are biased by involuntarily or voluntarily presenting misleading evidence; that is evidence that strengthens beliefs in a phenomenon that actually does not exist.

2. Questionable Research Practices

The most plausible explanation for Bem’s incredible results is the use of questionable research practices (John et al., 2012). Questionable research practices increase the probability of presenting only supportive evidence for a phenomenon at the risk of providing evidence for a phenomenon that does not exist. Francis (2012) and Schimmack (2012) independently found that Bem reported more significant results than one would expect based on the statistical power of the studies. This finding suggests that questionable research practices were used, but they do not provide information about the actual research practices that were used. John et al. listed a number of questionable research practices that might explain Bem’s findings.

2.1. Multiple Dependent Variables

One practice is to collect multiple dependent variables and to report only dependent variables that produced a significant result. The nature of Bem’s studies reduces the opportunity to collect many dependent variables. Thus, the inclusion of multiple dependent variables cannot explain Bem’s results.

2.2. Failure to report all conditions

This practice applies to studies with multiple conditions. Only Study 1 examined precognition for multiple types of stimuli and found a significant result for only one of them. However, Bem reported the results for all conditions and it was transparent that the significant result was only obtained in one condition, namely with erotic pictures. This weakens the evidence in Study 1, but it does not explain significant results in the other studies that had only one condition or two conditions that both produced significant results.

2.3 Generous Rounding

Sometimes a study may produce a p-value that is close to the threshold value of .05. Strictly speaking a p-value of .054 is not significant. However, researchers may report the p-value rounded to the second digit and claim significance. It is easy to spot this questionable research practice by computing exact p-values for the reported test-statistics or by redoing the statistical analysis from original data. Bem reported his p-values with three digits. Moreover, it is very unlikely that a p-value falls into the range between .05 and .055 and that this could happen in 9 out of 10 studies. Thus, this practice also does not explain Bem’s results.

2.4 HARKing

Hypothesizing after results are known (Kerr, 1998) can be used to make significant results more credible. The reason is that it is easy to find significant results in a series of exploratory analyses. A priori predictions limit the number of tests that are carried out and the risk of capitalizing on chance. Bem’s studies didn’t leave much room for HARKing, except Study 1. The studies build on a meta-analysis of prior studies and nobody has questioned the paradigms used by Bem to test time-reversed causality. Bem did include an individual difference measure and found that it moderated the effect, but even if this moderator effect was HARKed, the main effect remains to be explained. Thus, HARKing can also not explain Bem’s findings.

2.5 Excluding of Data

Sometimes non-significant results are caused by an an inconvenient outlier in the control group. Selective exclusion of these outliers based on p-values is another questionable research practice. There are some exclusions in Bem’s studies. The method section of Study 3 states that 100 participants were tested and three participants were excluded due to a high error rate in responses. The inclusion of these three participants is unlikely to turn a significant result with t(96) = 2.55, p = .006 (one-tailed), into a non-significant result. In Study 4, one participant out of 100 participants was excluded. The exclusion of a single participant is unlikely to change a significant result with t(98) = 2.03, p = .023 into a non-significant result. Across all studies, only 4 participants out of 1075 participants were excluded. Thus, exclusion of data cannot explain Bem’s robust evidence for time-reversed causality that other researchers cannot replicate.

2.6 Stopping Data Collection Early

Bem aimed for a minimum sample size of N = 100 to achieve 80% power in each study. All studies except Study 9 met this criterion before excluding participants (Ns = 100, 150, 97, 99, 100, 150, 200, 125, 50). Bem does not provide a justification for the use of a smaller sample size in Study 9 that reduced power from 80% to 54%. The article mentions that Study 9 was a modified replication of Study 8 and yielded a larger observed effect size, but the results of Studies 8 and 9 are not significantly different. Thus, the smaller sample size is not justified by an expectation of a larger effect size to maintain 80% power.

In a personal communication, Bem also mentioned that the study was terminated early because it was the end of the semester and the time stamp in the data file shows that the last participant was run on December 6, 2009. Thus, it seems that Study 9 was terminated early, but Bem simply got lucky that results were significant at the end of the semester. Even if Study 9 is excluded for this reason, it remains unclear how the other 8 studies could have produced significant results without a real effect.

2.7 Optional Stopping/Snooping

Collecting more data, if the collected data already show a significant effect can be wasteful. Therefore, researchers may conduct statistical significance tests throughout a study and terminate data collection when a significant result is obtained. The problem with this approach is that repeated checking (snooping) increases the risk of a false positive result (Strube, 2006). The increase in the risk of a false positive results depends on how frequently and how often researchers check results. If researchers use optional stopping, sample sizes are expected to vary because sampling error will sometimes produce a significant result quickly and sometimes after a long time. Second, sample size would be negatively correlated with observed effect sizes. The reason is that larger samples are needed to achieve significance with smaller observed effect sizes. If chance produces large effect sizes early on, significance is achieved quickly and the study is terminated with a small sample size and a large effect size. Finally, optional stopping will produce p-values close to the significance criterion because data collection is terminated as soon as p-values reach the criterion value.

The reported statistics in Bem’s article are consistent with optional stopping. First, sample sizes vary from N = 50 to N = 200. Second, sample sizes are strongly correlated with effect sizes, r = -.91 (Alcock, 2011). Third, p-values are bunched up close to the criterion value, which suggests studies may have been stopped as soon as significance was achieved (Schimmack, 2015).

Despite these warning signs, optional stopping cannot explain Bem’s results, if time-reversed causality does not exist. The reason is that the sample sizes are too small for a set of 9 studies to produce significant results. In a simulation study, with a minimum of 50 participants and a maximum of 200 participants, only 30% of attempts produced a significant result. Even 1,000 participants are not enough to guarantee a significant result by simply collecting more data.

2.8 Selective Reporting

The last questionable practice is to report only successful studies that produce a significant result. This practice is widespread and contributes to the presence of publication bias in scientific journals (Fraonco et al., 2014).

Selective reporting assumes that researchers conduct a series of studies and report only studies that produced a significant result. This may be a viable strategy for sets of studies with a real effect, but it does not seem to be a viable strategy, if there is no effect. Without a real effect, a significant result with p < .05 emerges in 1 out of 20 attempts. To obtain 9 significant results, Bem would have had to conduct approximately 9*20 = 180 studies. With a modal sample size of N = 100, this would imply a total sample size of 18,000 participants.

Engber (2017) reports that Bem conducted his studies over a period of 10 years. This may be enough time to collect data from 18,000 participants. However, Bem also paid participants $5 out of his own pocket because (fortunately) this research was not supported by research grants. This would imply that Bem paid $90,000 out of pocket.

As a strong believer in ESP, Bem may have paid $90,000 dollars to fund his studies, but any researcher of Bem’s status should realize that obtaining 9 significant results in 180 attempts does not provide evidence for time-reversed causality. Not disclosing that there were over 100 failed studies, would be a breach of scientific standards. Indeed, Bem (2010) warned graduate students in social psychology.

“The integrity of the scientific enterprise requires the reporting of disconfirming results.”

2.9 Conclusion

In conclusion, none of the questionable research practices that have been identified by John et al. seem to be plausible explanations for Bem’s results.

3. The Decline Effect and a New Questionable Research Practice

When I examined Bem’s original data, I discovered an interesting pattern. Most studies seemed to produce strong effect sizes at the beginning of a study, but then effect sizes decreased. This pattern is similar to the decline effect that has been observed across replication studies of paranormal phenomena (Schooler, 2011).

Figure 1 provides a visual representation of the decline effect in Bem’s studies. The x-axis is the sample size and the y-axis is the cumulative effect size. As sample sizes increase, the cumulative effect size approaches the population effect size. The grey area represents the results of simulation studies with a population effect size of d = .20. As sampling error is random, the grey area is a symmetrical funnel around the population effect size. The blue dotted lines show the cumulative effect sizes for Bem’s studies. The solid blue line shows the average cumulative effect size. The figure shows how the cumulative effect size decreases by more than 50% from the first 5 participants to a sample size of 100 participants.

The selection effect is so strong that Bem could have stopped 9 of the 10 studies after collecting a maximum of 15 participants with a significant result. The average sample size for these 9 studies would have been only 7.75 participants.

Table 1 shows the one-sided p-values for Bem’s datasets separately for the first 50 participants and for participants 51 to 100. For the first 50 participants, 8 out of 10 tests are statistically significant. For the following 50 participants none of the 10 tests is statistically significant. A meta-analysis across the 10 studies does show a significant effect for participants 51 to 100, but the Test of Insufficient Variance also shows insufficient variance, Var(z) = 0.22, p = .013, suggesting that even these trials are biased by selection for significance (Schimmack, 2015).
Table 1. P-values for Bem’s 10 datasets based on analyses of the first group of 50 participants and the second group of 50 participants.

EXPERIMENT

S 1-50

S 51-100

EXP1

p = .004

p = .194

EXP2

p = .096

p = .170

EXP3

p = .039

p = .100

EXP4

p = .033

p = .067

EXP5

p = .013

p = .069

EXP6a

p = .412

p = .126

EXP5b

p = .023

p = .410

EXP7

p = .020

p = .338

EXP8

p = .010

p = .318

EXP9

p = .003

NA

There are two interpretations of the decrease in effect sizes over the course of an experiment. One explanation is that we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continuous to collect more data to see whether the effect is real. Although the effect size decreases, the strong effect during the initial trials that motivated a researcher to collect more data is sufficient to maintain statistical significance because sampling error also decreases as more participants are added. These results cannot be replicated because they capitalized on chance during the first trials, but this remains unnoticed because the next study does not replicate the first study exactly. Instead, the researcher makes a small change to the experimental procedure and when he or she peeks at the data of the next study, the study is abandoned and the failure is attributed to the change in the experimental procedure (without checking that the successful finding can be replicated).

In this scenario, researchers are deceiving themselves that slight experimental manipulations apparently have huge effects on their dependent variable because sampling error in small samples is very large. Observed effect sizes in small samples can range from 1 to -1 (see grey area in Figure 1), giving the illusion that each experiment is different, but a random number generator would produce the same stunning differences in effect sizes. Bem (2011), and reviewers of his article, seem to share the believe that “the success of replications in psychological research often depends on subtle and unknown factors.” (p. 422). How could Bem reconcile this believe with the reporting of 9 out of 10 successes? The most plausible explanation is that successes are a selected set of findings out of many attempts that were not reported.

There are other hints that Bem peeked at the data to decide whether to collect more data or terminate data collection. In his 2011 article, he addressed concerns about a file drawer stuffed with failed studies.

“Like most social-psychological experiments, the experiments reported here required extensive pilot testing. As all research psychologists know, many procedures are tried and discarded during this process. This raises the question of how much of this pilot exploration should be reported to avoid the file-drawer problem, the selective suppression of negative or null results.”

Bem does not answer his own question, but the correct answer is clear: all of the so-called pilot studies need to be included if promising pilot studies were included in the actual studies. If Bem had clearly distinguished between promising pilot studies and actual studies, actual studies would be unbiased. However, it appears that he continued collecting data after peeking at the results after a few trials and that the significant results are largely driven by inflated effect sizes in promising pilot studies. This biased the results and can explain how Bem obtained evidence for time-reversed causality that others could not replicate when they did not peek at the data and terminated studies when the results were not promising.

Additional hints come from an interview with Engber (2017).

“I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” Bem told me recently. Some of these changes were reported in the article; others weren’t. “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t,” he said. Given that the studies spanned a decade, Bem can’t remember all the details of the early work. “I was probably very sloppy at the beginning,” he said.

In sum, a plausible explanation of Bem’s successes that others could not replicate is that he stopped studies early when they did not show a promising result, then changed the procedure slightly. He also continued data collection when results looked promising after a few trials. As this research practices capitalizes on chance to produce large effect sizes at the beginning of a study, the results are not replicable.

Although this may appear to be the only hypothesis that is consistent with all of the evidence (evidence of selection bias in Bem’s studies, decline effect over the course of Bem’s studies, failed replications), it may not be the only one. Schooler (2011) proposed that something more intriguing may cause decline effects.

“Less likely, but not inconceivable, is an effect stemming from some unconventional process. Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.”

Researchers who are willing to believe in time-reversed causality are probably also open to the idea that the process of detecting these processes is subject to quantum effects that lead to a decline in the effect size after attempts to measure it. They may consider the present findings of decline effects within Bem’s experiment a plausible explanation for replication failures. If a researcher collects too many data, the weak effects in the later trials wash out the strong effects during the initial trials. Moreover, quantum effect may not be observable all the time. Thus, sometimes initial trials will also not show the effect.

I have little hope that my analyses of Bem’s data will convince Bem or other parapsychologists to doubt supernatural phenomena. However, the analysis provides skeptics with rational and scientific arguments to dismiss Bem’s findings as empirical evidence that requires a supernatural explanation. Bad research practices are sufficient to explain why Bem obtained statistically significant results that could not be replicated in honest and unbiased replication attempts.

Discussion

Bem’s 2011 article “Feeling the Future” has had a profound effect on social psychology. Rather than revealing a supernatural phenomenon, the article demonstrated fundamental flaws in the way social psychologists conducted and reported empirical studies. Seven years later, awareness of bad research practices is widespread and new journal editors are implementing reforms in the evaluation of manuscripts. New statistical tools have been developed to detect practices that produce significant results by capitalizing on chance. It is unlikely that Bem’s article would be accepted for publication these days.

The past seven years have also revealed that Bem’s article is not an exception. The only difference is that the results contradicted researchers’ a priori beliefs, whereas other studies with even more questionable evidence were not scrutinized because the claims were consistent with researchers a priori beliefs (e.g., the glucose theory of will-power; cf. Schimmack, 2012).

The ability to analyze the original data of Bem’s studies offered a unique opportunity to examine how social psychologists deceived themselves and others into believing that they tested theories of human behavior when they were merely confirming their own beliefs, even if these beliefs defied basic principles of causality. The main problem appears to be a practice to peek at results in small samples with different procedures and to attribute differences in results to the experimental procedures, while ignoring the influence of sampling error.

Conceptual Replications and Hidden Moderators

In response to the crisis of confidence about social psychology, social psychologists have introduced the distinction between conceptual and exact replications and the hidden moderator hypothesis. The distinction between conceptual and exact replications is important because exact replications make a clear prediction about the outcome. If a theory is correct and an original study produced a result that is predicted by the theory, then an exact replication of the original study should also produce a significant result. At least, exact replications should be successful more often than fail (Tversky and Kahneman, 1971).

Social psychologists also realize that not reporting the outcome of failed exact replications distorts the evidence and that this practice violates research ethics (Bem, 2000).

The concept of a conceptual replication provides the opportunity to dismiss studies that fail to support a prediction by attributing the failure to a change in the experimental procedure, even if it is not clear, why a small change in the experimental procedure would produce a different result. These unexplained factors that seemingly produced a success in one study and a failure in the other studies are called hidden moderator.

Social psychologists have convinced themselves that many of the phenomena that they study are sensitive to minute changes in experimental protocols (Bem, 2011). This belief sustains beliefs in a theory despite many failures to obtain evidence for a predicted effect and justifies not reporting disconfirming evidence.

The sensitivity of social psychological effects to small changes in experimental procedures also justifies that it is necessary to conduct many studies that are expected to fail, just like medieval alchemists expected many failures in their attempts to make gold. These failures are not important. They are simply needed to find the conditions that produce the desired outcome; a significant result that supports researchers’ predictions.

The attribution of failures to hidden moderators is the ultimate attribution error of social psychologists. It makes them conduct study after study in the search for a predicted outcome without realizing that a few successes among many failures are expected simply due to chance alone. To avoid realizing the fragility of these successes, they never repeat the same study twice. The ultimate attribution error has enabled social psychologist to deceive themselves and others for decades.

Since Bem’s 2011 article was published, it has become apparent that many social psychological articles report results that fail to provide credible evidence for theoretical claims because they do not report results from an unknown number of failed attempts. The consequences of this inconvenient realization are difficult to exaggerate. Entire textbooks covering decades of research will have to be rewritten.

P-Hacking

Another important article for the replication crisis in psychology examined the probability that questionable research practices can produce false positive results (Simmons, Nelson, & Simonsohn, 2011). The article presents simulation studies that examine the actual risk of a type-I error when questionable research practices are used. They find that a single questionable practice can increase the chances of obtaining a false positive result from the nominal 5% to 12.6%. A combination of four questionable research practices increased the risk to 60.7%. The massive use of questionable research practices is called p-hacking. P-hacking may work for a single study, if a researcher is lucky. But it is very unlikely that a researcher can p-hack a series of 9 studies to produce 9 false positive results, (p = .69 = 1%).

The analysis of Bem’s data suggest that a perfect multiple-study article requires omitting failed studies from the record, and hiding disconfirming evidence violates basic standards of research ethics. If there is a known moderator, the non-significant results provide important information about boundary conditions (time-reversed causality works with erotic pictures, but not with pictures of puppies). If the moderator is not known, it is still important to report this finding to plan future studies. There is simply no justification for excluding non-significant results from a series of studies that are reported in a single article.

To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results. This overall effect is functionally equivalent to the test of the hypothesis in a single study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results (Schimmack, 2012, p. 563).

Thus, a simple way to improve the credibility of psychological science is to demand that researchers submit all studies that tested relevant hypotheses for publication and to consider selection of significant results scientific misconduct. Ironically, publishing failed studies will provide stronger evidence than seemingly flawless results that were obtained by omitting nonsignificant results. Moreover, allowing for the publication of non-significant results reduces the pressure to use p-hacking, which only serves the goal to obtain significant results in all studies.

Should the Journal of Personality and Social Psychology Retract Bem’s Article?

Journals have a high threshold for retractions. Typically, articles are retracted only if there are doubts about the integrity of the published data. If data were manipulated by fabricating them entirely or by swapping participants from one condition to another to exaggerate mean differences, articles are retracted. In contrast, if researchers collected data and selectively reported only successful studies, articles are not retracted. The selective publishing of significant results is so widespread that it seems inconceivable to retract every article that used this questionable research practice. Francis (2014) estimated that at least 80% of articles published in the flagship journal Psychological Science would have to be retracted (Francis, 2014). This seems excessive.

However, Bem’s article is unique in many ways, and the new analyses of original data presented here suggest that bad research practices, inadvertently or not, produced Bem’s results. Moreover, the results could not be replicated in other studies. Retracting the article would send a clear signal to the scientific community and other stakeholders in psychological science that psychologists are serious about learning from mistakes by flagging the results reported in Bem as erroneous. Unless the article is retracted, uniformed researchers will continue to cite the article as evidence for supernatural phenomena like time-reversed causality.

“Experimentally, such precognitive effects have manifested themselves in a variety of ways. … as well as precognitive priming, where behaviour can be influenced by primes that are shown after the target stimulus has been seen (e.g. Bem, 2011; Vernon, 2015).” (Vernon, 2017, p. 217).

Vernon (2017) does cite failed replication studies, but interprets these failures as evidence for some hidden moderator that could explain inconsistent findings that require further investigation. A retraction would make it clear that there are no inconsistent findings because Bem’s findings do not provide credible evidence for the effect. Thus, it is unnecessary and maybe unethical to recruit human participants to further replication studies of Bem’s paradigms.

This does not mean that future research on paranormal phenomena should be banned. However, future studies cannot be based on Bem’s paradigms or results to plan future studies. For example, Vernon (2017) studied a small sample of 107 participants, which would be sufficient based on Bem’s effect sizes, but these effect sizes are not trustworthy and cannot be used to plan future studies.

A main objection to retractions is that Bem’s study made an inadvertent important contribution to the history of social psychology that triggered a method revolution and changes in the way social psychologist conduct research. Such an important article needs to remain part of the scientific record and needs to be cited in meta-psychological articles that reflect on research practices. However, a retraction does not eradicate a published article. Retracted articles remain available and can be cited (RetractionWatch, 2018). Thus, it is possible to retract an article without removing it from the scientific record. A retraction would signal clearly that the article should not be cited as evidence for time-reversed causality and that the studies should not be included in meta-analyses because the bias in Bem’s studies also biases all meta-analytic findings that include Bem’s studies (Bem, Ressoldi, Rabeyron, & Duggan (2015).

[edited January, 8, 2018]
It is not clear how Bem (2011) thinks about his article these days, but one quote in Enbger’s article suggests that Bem realizes now that he provided false evidence for a phenomenon that does not exist.

When Bem started investigating ESP, he realized the details of his research methods would be scrutinized with far more care than they had been before. In the years since his work was published, those higher standards have increasingly applied to a broad range of research, not just studies of the paranormal. “I get more credit for having started the revolution in questioning mainstream psychological methods than I deserve,” Bem told me. “I was in the right place at the right time. The groundwork was already pre-prepared, and I just made it all startlingly clear.”

If Bem wants credit for making it startlingly clear that his evidence was obtained with questionable research practices that can mislead researchers and readers, he should make it startlingly clear that this was the case by retracting the article.

Like this:

Post navigation

62 thoughts on “Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem”

Hi Uli,
following your reasoning, do you think the same proposal (to retract the paper, but to let it available as a bad example of how to carry out a scientific investigation), should be applied to all papers whose findings were not confirmed by the multiple replications projects (see http://curatescience.org) or used statistics that inflated false-positive rates? (e.g. Eklund, A., Nichols, T. E., & Knutsson, H. (2016). Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. Proceedings of the National Academy of Sciences, 201602413.)

I think the decision would have to based on a case by case basis. I am making the case for one paper here, but I would be willing and able to make a case for other papers as well. The examination of Jens Foerster’s articles, some of which have been retracted and some that haven’t, seems relevant here.

My comment was based on the idea that the population effect size is d = .2, which is what I simulated in Figure 1.

If you mean, an observed effect size of .22, the risk of a false positive result depends on the amount of sampling error. The effect size itself is irrelevant for decisions about probabilities of false positives.

The risk of a false positive result in a series of studies is set by alpha. It is 5% (1 out of 20 attempts).

You may mean something like the ratio of false to true positive results when you state that the false positive risk can be over 90%, but that is not really relevant to the point of the paper, that it is not at all easy to get 9 out of 10 results with p < .05.

By false positive risk (FPR), I mean the probability that if you are wrong if declare that the p value that you observe implies that there is a real effect. This is different from, and always bigger than, the type 1 error, The type 1 error is conditional on the null hypothesis being true. But to get the FPR you need the total number of “positive” tests, both true and false. See, for example, Figure 2 in http://rsos.royalsocietypublishing.org/content/1/3/140216
(in that paper I referred to the FPR as the “false discovery rate”. which I later regretted because that term is used by people who are interested in the problem of multiple comparisons).

This fascinating article serves only to increase my concerns about Bem’s work.
You may like to see my recent article in ‘Skeptical Inquirer’ describing Bem’s dubious practices and suppression of data over ESP in the ganzfeld back in the 1980 and 90s. Perhaps, at last, this is all coming out.
Blackmore, S. 2018, Daryl Bem and psi in the ganzfeld, Skeptical Inquirer, 42:1, 44-45https://www.susanblackmore.co.uk/daryl-bem-psi-ganzfeld/

I found this article very interesting and much more reasonable than most of the criticisms of Daryl Bem’s paper, but I’m puzzled by several things.

(1) You mention 10 attempted replications as collectively unsuccessful, but the meta-analysis you cite – Bem et al. (2015) – analysed 80 attempted replications, and claimed that as a whole they were successful even when Bem’s own studies were excluded. Perhaps there’s a reason you don’t accept this claim, but shouldn’t it be discussed?

I did not comment on the Bem et al. (2015) meta-analysis in detail. I merely mention that Bem’s studies should not be included. I am glad to hear that they also report results excluding Bem’s studies, but excluding Bem’s biased studies does not ensure that the other studies are unbiased. I don’t know how the authors examined or controlled for bias in the remaining studies.

(2) In your discussion of selective reporting as a potential explanation for Bem’s results, you estimate that this would require 18,000 participants in total (95% of whose results would have to be discarded). This does seem implausible. But have you attempted to calculate corresponding figures for your alternative scenario of running large numbers of pilot studies and continuing only the most promising ones? Obviously this wouldn’t require as many participants as running large numbers of full studies, but it seems to me the requirement would be of the same order. Particularly to get an end result of 9 significant studies out of 10, because the promising results of the pilot studies would be diluted by the chance results of the continuations, so a more stringent criterion than 5% significance would need to be applied to the pilots. I wonder whether this explanation is really much more feasible than the other possibilities you reject.

I have not attempted to compute the number and there are other problems with Bem’s studies. Here is a quick calculation. You run “pilot studies” with N = 10 and follow up only on studies that are significant with N = 10.

Assuming the nil-hypothesis is true (no effect), you need about 20 pilot studies to get p < .05 (one-tailed).

That is 200 participants for each significant pilot study. Then you run the remaining 90 participants to get significance with N = 100. That is another 90.

So, in total you need 390 participants per study and a total N of 3,900 for 10 studies.

This is still a large number, but much more doable than N = 20,000 (for 10 studies).

Thank you for replying to my questions. I think this is really the crucial one – how many trials/participants would be needed to produce the published results, in your scenario?

My problem with your illustrative calculation above is that it seems to assume implicitly that a statistically significant pilot experiment of 10 trials will remain significant when 90 continuation trials are added to it. In fact, most of the time the significance will be lost, and there would have to be a further stage of selection of the complete 100-trial experiments – which would mean a significantly larger number of pilot experiments would be required. (As you pointed out to me, you do envisage selection even in the trial range 51-100.)

You are simulating getting p < .05 (one-tailed / two-tailed ?) and then continue data collection. We do not need simulations to figure out that the effect size will often be diluted so much that the effect is no longer significant after 100 trials.

However, many of Bem's N = 100 studies are actually cobbled together from independent studies with 40/60 or 50/50 participants.

In addition, there is no reason to aim for p < .05 after 10 trials. I could use a higher criterion or effect size, avg. d = .8 for 20 trials to consider a pilot study promising.

Whatever Bem did to get his results, it doesn't mean we need to assume that there is a real effect.

(3) It’s not really true to say that Bem didn’t answer the question about what pilot explorations should be reported. Immediately after the paragraph you quote, he listed the three pilot explorations which he thought should be mentioned in the 2011 paper. Two concerned psychological variables that had been analysed in relation some of the trials reported in the paper, but were later dropped. Apparently only one concerned trials which weren’t reported in the 2011 paper – “a small retroactive habituation experiment that used supraliminal rather than subliminal exposures”, which had been reported in 2003.

I was concerned to see the comments attributed to Bem in the 2017 online article by Engber, and find them difficult to reconcile with what he wrote in 2011. As you have been in contact with Bem, have you checked whether he acknowledges the accuracy of those quoted comments, and if so why he didn’t feel it was relevant to mention the unsuccessful abandoned experiments? Surely we have to be clear that if he carried out and discarded many more unsuccessful trials than he published – as you envisage – then his presentation of the work in 2011 was not an honest one?

First, I am in contact with Darly Bem and he said he would comment on the blog when he has time.

Second, here is the full paragraph that addresses the question about pilot studies.

“This problem arose most acutely in our two earliest experiments, the retroactive habituation studies, because they required the most extensive pilot testing and served to set the basic parameters and procedures for all the subsequent experiments. ”
[none of these acute problem pilot studies were included or mentioned in more detail. So, an acute problem is acknowledged but not addressed]

I can identify three sets of findings omitted from this report so far that should be
mentioned lest they continue to languish in the file drawer.
[This does not mean that there are not a lot more. It merely means that according to Bem these three should be mentioned, but it is not clear why the others should not be mentioned.]

First, several individual-difference variables that had been reported in the psi literature to predict psi performance were pilot tested in these two experiments, including openness to experience; belief in psi; belief that one has had some psi experiences in everyday life; and practicing a mental discipline such as meditation,
yoga, self-hypnosis, or biofeedback. None of them reliably predicted psi performance, even before application of a Bonferroni correction for multiple tests.
[Not sure why they should be mentioned. They are totally irrelevant to the file-drawer of studies that tested the main effect of ESP and did not find the effect]

Second, an individual-difference variable (negative reactivity) that I reported as a correlate of psi in my convention presentation of these experiments (Bem, 2003) failed to emerge as significant in the final overall database.
[Once more irrelevant to the file-drawer of pilot studies with non-significant results.]

Finally, as also reported in Bem (2003), I ran a small retroactive habituation experiment that used supraliminal rather than subliminal exposures. It was conducted as a matter of curiosity after the regular (subliminal) experiment and its replication had been successfully completed. It yielded chance findings for both negative and erotic trials. As I warned in the convention presentation,
supraliminal exposures fundamentally change the phenomenology of the experiment for participants.
[This brings up another issue with Bem’s studies. He really obtained 9 significant results out of 9 tests with subliminal stimuli. He reported a non-significant results for one study with supraliminal sitmuli and invoked presentation mode as a hidden moderator. This is also BS because the p-value for the supraliminal stimuli was close to significant and it was significant after the first 50 trials. So, what happened here is that he was expecting another significant result, but then chance made the effect disappear after 100 and did not return a favorable result after 150 or 200 trials, which is when he gave up. Mentioning another failed replication with supraliminal stimuli and attributing it to presentation mode does not address the file-drawer problem for the studies with subliminal stimuli. ]

In conclusion, Bem played a familiar game here that everybody (including me) has played before. You raise only concerns that you can address in the limitation section and do not mention real concerns offer editors a reason to reject a paper on a silver platter.

Very interesting Uli! I’m confident you’re familiar with Meehl, but if you haven’t listened to the recordings of his Philosophy of Psychology lectures all the way through, you might find the start of his 8th especially interesting. In the first 10 minutes or so he describes the problem with optional discarding of “pilot” studies, in particular, and suggests that the field would greatly benefit from journals granting space for very brief reports of failed pilot studies (seems to touch on one of your goals at Meta-Psychology).

As I mentioned, an outline of my calculation is on the page I linked to, so that the details can be seen if they are of interest, and checked if necessary. It wasn’t really a simulation, but an exact calculation based on the binomial probabilities. I calculated the number of pilot experiments that would be needed to make the expected number of significant full experiments (at p<0.05) equal to 9. I then calculated the expected total number of trials corresponding to that number of pilot experiments (not conditioned on the end result). The significance criteria were one-tailed, as you had suggested.

The only change I made to your prescription – in an attempt to be as fair as possible to your hypothesis – was a slight relaxation of the significance criterion for the pilot experiments, to p<0.055. That meant that a result of 8 or more successes out of 10 was counted as significant. (Applying p<0.05 strictly would have meant that 9 or more successes were required, which would actually represent p<0.01, and would result in the total number of trials required being more than 50,000.)

I agree that we don't need calculations to tell us that the significance of pilot experiments will usually be diluted by 90 further trials. The point was that your illustrative calculation didn't take that effect into account, and therefore underestimated the number of trials required. I did the calculation to find out how serious the underestimate was.

In addition to considering the parameters you had suggested, I did also consider pilot studies with 20 trials (similarly relaxing the criterion for pilot studies to p<0.058), but that calculation indicated that even more trials – about 20,400 – would be required than for the 10-trial pilots.

The figure can be reduced somewhat by relaxing the significance criterion for the pilot studies further. For 10-trial pilot experiments, the optimal p value is 0.17. That implies that about 13,300 trials would be required. But I don't think such a large p value would be consistent with your finding that, considering the first 15 trials, 9 out of the 10 experiments had reached significance (presumably meaning p<0.05) by that point, after 7.75 trials on average.

As to whether we need to assume there is a real effect, when I think about these matters I try to steer clear of unnecessary assumptions, either pro or con. The purpose of the calculations was to gauge whether the scenario you described in your article provided a plausible explanation for Bem's results. On the numbers, I don't think it does. No doubt other explanations will be suggested in the future, as they have been in the past, but my feeling is that the strong decline effect you have shown makes explanations in non-paranormal terms more difficult, not easier.

However, I do feel the decline effect is interesting and important, so I congratulate you on discovering it, and thank you for making it known.

Sorry, that I misunderstood what you did, but substantially it doesn’t matter whether we use simulations or math. We both agree that just getting p < .05 (one-tailed) with N = 20 and then collecting another 80 participants with no results will typically wash out the initial effect. However, it is not difficult to find the amount of bias that is needed to get significance with a high probability in the long run.

Maybe you can simulate / compute this.
Get one more participant. If negative sign, terminate study. Continue study if first 20 participants all have a positive effect.

I think requiring a very high success rate in the pilot experiment is probably going in the wrong direction, Although 20 out of 20 gives a highish probability (0.63) that the score will remain significant when 80 further trials are added, 20 out of 20 would be achieved in only about one pilot experiment in a million, so the total number of trials required would be very large indeed.

In principle, doing smaller experiments and then selecting the best ones and combining them into larger ones would be an efficient way of getting significant results, but I don’t see how that would produce the strong decline effect you have found.

Looking again at Bem’s results, of course the number of binary trials isn’t equal to the number of participants, but is some multiple of it. Therefore it wasn’t appropriate to use the exact binomial probabilities in my calculations. It would have been better to use the normal distribution as an approximation, in working out the probability of statistical significance being maintained in going from a pilot experiment to a completed experiment. I believe the results of the calculations I posted should be roughly correct, but they shouldn’t be taken as exact.

Interesting finding. I wonder at the idea that you would need to run 20 pilot studies to get one with a significant result, though. I suspect that with some flexibility in what is measured and what might serve as an outcome, that this number can be substantially reduced.

You mention that you don’t find much scope for QRP’s to produce an excess of findings. I have spent a lot of time on Bem’s experiments since it first came to our attention. This is a list of some of the opportunities I have found to increase the chance of finding a positive result.

Experiment 1
Pictures are rated on arousal (low to high) and valence (positive to neutral to negative) which allows for a variety of eminently justifiable ways of forming groups in which an effect is ‘expected’ or ‘not expected’. Plus, Bem mentions that a large number of ‘non-arousing’ trials were run along with the 36 trials he selected out to report on, which offers an opportunity to use a selected sample from a larger pool.
Note that he forms different groups in this study than he does using the same categories in experiments 5 and 6.

Experiment 2
Allowed for 3 different outcomes to serve as the main outcome – first 100 subjects, second 50 subjects, or all 150 subjects. A failure in any of those groups can be explained away.

Experiment 3 and 4
No explanation is offered for why the timing differs in the length of time before the prime is presented and the length of time the prime is presented, between the forward and backward condition. Once there are no restrictions on this, it allows for the possibility of testing multiple variations in time. Priming experiments in the literature differ in the length of time the prime is presented (from subliminal to explicit) and in the length of time between prime and picture presentation, with the findings that there is a window where priming is most effective, and then the effect is lost as the time increases. The forward priming trials fall within this window, while the retroactive trials are too long to do so. This raises the question of why?
Ratcliff’s recommendations to deal with the right skew of the data are to either use cutoffs or transformations, not to transform data on which cutoffs have been applied, like Bem performed. The choice of cutoff or method of transformation has substantial effects on the power of the study, which then makes the false-positive risk, mentioned by Colquhoun, relevant.
Also, more results were excluded than the 4 subjects who had more than 16 errors. Trials in which errors were made were excluded across all subjects which resulted in the exclusion of about 9% of the trials, in addition to those excluded by the choice of cutoff.

Experiment 5 and 6
This experiment was previously written up, so we can compare the original report with this new report. The original report describes presenting 6 categories of pictures (as per Experiment 1). There were multiple hypotheses available for use, depending upon which category or combinations of categories were found to have a finding which differed from chance, in either direction. For example, the idea which this experiment was based on, Mere Exposure, would predict target preference in any category. Bem’s idea, Retroactive Habituation, predicts target preference or avoidance, depending upon the category. A failure to uphold the Mere Exposure and part of the Retroactive Habituation hypothesis is explained away, in this case.
There are trials in this report which were not included in the original report (at least 50). And there are sets of trials in the original report (at least 60), which have not been included in this report. In addition, trials which were originally reported as separate series are now combined and treated as though they were a single preplanned experiment in this report.

Experiment 7
The description of this experiment is different from the initial report, which included strongly negative and erotic pictures. Either Bem neglected to include the results from 146 of the subjects, or neglected to include all the trials from each subject.

Experiment 8 and 9
The DR % is a novel outcome measure. Without the constraint of using an established outcome measure, this allows for flexibility in outcome measures.

All of these points help to produce significant results. You comparisons with initial reports are particularly informative. Can you share these reports? Based on your own investigation of Bem’s article, would you support the call for a retraction?

As for retraction…I’m a semi-retired physician. Medicine began taking a serious look at these issues starting back in the early 90’s with evidence-based medicine. And I think the fallout from Bem’s paper informs those practices now, as well. Yes, I think his paper is misleading, if taken at face value. But the problems in his paper are transparent, once you learn to be skeptical of self-serving reports, and go looking for QRPs. So I don’t think it needs to be retracted on that basis.

If we want to retract it to show that psychologists are serious about learning from their mistakes, then it might be better to target more traditional psychology research first. Otherwise, it looks like we’re just picking on parapsychology (which, as far as I can tell, isn’t any better or worse than the rest of psychology in this regard). And it gives mainstream psychologists an out, a way to avoid identifying themselves as part of the problem, if it is associated with parapsychology only.

Of course, many people have suggested that Bem was lying when he said his hypotheses were fixed in advance, and that in fact they were chosen retrospectively after analysis had been performed for multiple hypotheses.

The problem is that while this scenario would tend to increase the number of statistically significant results, it wouldn’t tend to produce a strong decline effect of the kind reported in the article above. That’s why I said I thought the decline effect made it more difficult to find non-paranormal explanations, not easier.

I suspect this would produce a strong decline effect. While these opportunities might help to produce significant results in the initial sets of trials, it will become more difficult as the hypotheses become more firmly fixed (based on those initial sets) in subsequent sets of trials.

Hi Linda,
is there a way for you to share the original report. Failure to disclose these discrepancies strengthen the case for retraction.
I understand your concern about targeting ESP, but if you read my 2012 article, you will see that I am aware that similar problems exist in other articles. Unfortunately, these authors often do not share the raw data and without raw data it is difficult to make a strong case for retraction.

Experiment 5 and 6

This experiment was previously written up, so we can compare the original report with this new report. The original report describes presenting 6 categories of pictures (as per Experiment 1). There were multiple hypotheses available for use, depending upon which category or combinations of categories were found to have a finding which differed from chance, in either direction. For example, the idea which this experiment was based on, Mere Exposure, would predict target preference in any category. Bem’s idea, Retroactive Habituation, predicts target preference or avoidance, depending upon the category. A failure to uphold the Mere Exposure and part of the Retroactive Habituation hypothesis is explained away, in this case.
There are trials in this report which were not included in the original report (at least 50). And there are sets of trials in the original report (at least 60), which have not been included in this report. In addition, trials which were originally reported as separate series are now combined and treated as though they were a single preplanned experiment in this report.

Experiment 7
The description of this experiment is different from the initial report, which included strongly negative and erotic pictures. Either Bem neglected to include the results from 146 of the subjects, or neglected to include all the trials from each subject.

I’d also be curious to know what is the initial report of Experiment 7, which Linda refers to. (Evidently it’s not the report she linked to, as that was a presentation given 18 months before Experiment 7 was done.)

There’s no doubt that the scenario described in the article above, in which unsuccessful pilot experiments were suppressed and successful ones were retained and completed as full experiments, could – in qualitative terms – produce a strong decline effect.

But the post hoc selection of hypotheses applied to complete experiments, as suggested previously, obviously doesn’t tend to produce a decline effect.

Of course, one could speculate about combining multiple hypotheses and selection of pilot experiments, and one could see whether on that basis the observed decline could be reproduced quantitatively.

I think whichever way one looks at it, the decline must be telling us something important.

Also, I want to make it clear that I am not suggesting that Bem was necessarily lying. As Gelman writes in the article, “The Garden of Forking Paths”, it doesn’t need to be a deliberate fishing expedition when the details of data analysis are highly contingent on the data.

That being said, for experiment 5, Bem does explicitly state in his earlier report that he was looking for a significant psi effect on any of the six kinds of targets (high or low arousal, positive or neutral or negative valence). But the write-up in Feeling the Future for experiment 5 specifies that he was only looking for a significant psi effect for high-arousal negative targets and the remaining 5 kinds of targets (including positive high-arousal pictures) were called “neutral controls”.

Yes. Thanks to you and to Daryl Bem for providing the data files and the informative exchange of emails. I shall look forward to studying them further when I have a chance, though that may not be for a while.

If what Prof. Bem says in his emails is accepted, I can’t at the moment see much scope for explaining the decline effect in non-paranormal terms. But perhaps something will emerge from an examination of the data.

I agree. On the flip side this implies that we cannot accept what Bem says, if paranormal effects do not exist. In this regard it is disconcerting that the effects and the decline effect have only been observed in his data.

I was not able to figure out how to leave a comment on your blog post at the website. (I kept being asked to register a site of my own.) So, I thought I would simply write you a note. You are free to publish it as my response to your most recent post if you wish.

In reading your posts on my precognitive experiments, I kept puzzling over why you weren’t mentioning the published Meta-analysis of 90 “Feeling the Future” studies that I published in 2015 with Tessoldi, Rabeyron, & Duggan. After all, the first question we typically ask when controversial results are presented is “Can Independent researchers replicate the effect(s)?” I finally spotted a fleeting reference to our meta-analysis in one of your posts, in which you simply dismissed it as irrelevant because it included my own experiments, thereby “contaminating” it.

But in the very first Table of our analysis, we presented the results for both the full sample of 90 studies and, separately, for the 69 replications conducted by independent researchers (from 33 laboratories in 14 countries on 10,000 participants).

These 69 (non-Bem-contaminated) independent replications yielded a z score of 4.16, p =1.2 x E-5. The Bayes Factor was 3.85—generally considered large enough to provide “Substantial Evidence” for the experimental hypothesis.

Of these 69 studies, 31 were exact replications in that the investigators used my computer programs for conducting the experiments, thereby controlling the stimuli, the number of trials, all event timings, and automatic data recording. The data were also encrypted to ensure that no post-experiment manipulations were made on them by the experimenters or their assistants. (My own data were similarly encrypted to prevent my own assistants from altering them.) The remaining 38 “modified” independent replications variously used investigator-designed computer programs, different stimuli, or even automated sessions conducted online.

Both exact and modified replications were statistically significant and did not differ from one another. Both peer reviewed and non-peer reviewed replications were statistically significant and did not differ from one another. Replications conducted prior to the publication of my own experiments and those conducted after their publication were each statistically significant and did not differ from one another.

We also used the recently introduced p-curve analysis to rule out several kinds of selection bias (file drawer problems), p-hacking, and to estimate “true” effect sizes.
There was no evidence of p-hacking in the database, and the effect size for the non-bem replications was 0.24, somewhat higher than the average effect size of my 11 original experiments (0.22.) (This is also higher than the mean effect size of 0.21 achieved by Presentiment experiments in which indices of participants’ physiological arousal “precognitively” anticipate the random presentation of an arousing stimulus.)

For various reasons, you may not find our meta-analysis any more persuasive than my original publication, but your website followers might.

While it is useful to know that the raw data is encrypted, since that data is subject to additional manipulation before it is analyzed and presented, it doesn’t prevent additional researcher degrees of freedom from creeping in, even if using the same software. For example, Wiseman noticed, when using Bem’s software for experiments 8 and 9, that ‘unknown’ words were flagged by the program so the experimenter could go through and correct (if misspelled) or ignore (if not on the list) those words. The process was not blind, since the columns of target and control words were next to the column of words recalled by the subject.https://richardwiseman.wordpress.com/2010/11/18/bems-esp-research/

I’m not suggesting that this specifically made a difference to the results, but just offering it as an illustrative example.

you seem to be very familiar with Dr. Bem’s work. In an earlier comment you wrote that Experiment 5 used various emotional pictures.
I checked the article and found this description.

The retroactive version of this protocol simply reverses Steps 1
and 2: On each trial, the participant is first shown a pair of matched
pictures on the computer screen and asked to indicate which
picture he or she prefers. The computer then randomly selects one
of the two pictures to serve as the habituation target and displays
it subliminally several times. This first retroactive habituation
experiment comprised trials using either strongly arousing negative
picture pairs or neutral control picture pairs; positively arousing
(i.e., erotic) picture pairs were not introduced until Experiment
6, reported below.

There is no mention of other emotional pictures. It is clearly implied that there were only two types of picture pairs.

If we perform the same analysis on the data on the first 50 subjects from experiment 5 from “Feeling the Future,” we get:
34 women
16 men
negative/high arousal hit rate = 55.8%
t-test(49) = 2.41
p = 0.01 one-tailed*

It’s pretty clear that both reports are talking about the same data (minor error in the report of a p-value aside). The description of this experiment from 2003 states:

“For the PH studies, the pictures were divided into six categories defined by crossing 3 levels of valence (negative, neutral, positive) with 2 levels of arousal (low, high)…

The first, Experiment 101, was designed to see if the PH procedure would yield a significant psi effect on any kind of target. Accordingly, the 6 kinds of picture pairs composed by crossing 3 levels of valence (negative, neutral, positive) with 2 levels of arousal (low, high) were equally represented across the 48 trials of the session, 8 of each kind…

The results were clear cut: Only the negative/high arousal pictures produced a significant psi effect…

After the fact, then, this experiment can be conceptualized as comprising 8 negative trials and 40 low-affect (“control”) trials.”

But the description of this experiment, eight years later, in “Feeling the Future,” states:

There is no mention of the fact that Bem started by looking for an effect for any kind of target, not just negative/high arousal. And that further experiments were planned on the basis of those results. Also there is no mention that the “neutral controls” were a post-hoc compilation of pictures with a variety of valence and arousal levels, some of which were not “neutral” or not “low arousal”.

Even in your recent email exchange with Bem he states, “Nor did I discard failed experiments or make decisions on the basis of the results obtained.” This is clearly false in this case.

I think this may be sufficient evidence to consider calling for a retraction.

Clearly the statement you quote from “Feeling the Future” about Experiment 5 – “This first retroactive habituation experiment comprised trials using either strongly arousing negative picture pairs or neutral control picture pairs;” – is wrong.

Even if we didn’t have the 2003 paper, the wrongness of that statement would be obvious from the instructions to participants in Experiment 5, quoted in “Feeling the Future”, which say “Most of the pictures range from very pleasant to mildly unpleasant, but in order to investigate a wide range of emotional content, some of the pictures contain very unpleasant images (e.g., snakes and bodily injuries).” That makes it clear that there was a range of images, including strongly arousing positive images and weakly arousing negative ones.

[This is extremely misleading, because positive arousing (but not erotic) pictures WERE included in Experiment 5. Bem also tested (for the first 50 participants), whether positive arousing pictures produced a signiifcant effect and didn’t find one. Only after the fact, did he lump positivre arousing stimuli into a set of control stimuli.

Chris asked me to add this comment because he had problems posting it.

Chris
……………………………………………………………………………………………………………

“The relationship of the remaining 50 participants of Experiment 5 with the 60 of 102 is more problematical, as the numbers don’t match.”

On that point, the percentages for participants 51-100 in Experiment 5 appear to indicate there were 15 “negative” images and 45 “control”
images per participant. That matches the design of Experiment 102 described in the 2003 paper. So I suspect participants 51-100 of Experiment 5 represent 50 of the 60 participants of Experiment 102.

I’m not sure why 10 of the 60 should have been excluded, but there is a note in Table 1 of the 2003 paper that emotional reactivity scores weren’t available for 7 participants in the 100 series (=101+102+103).
This might have been a reason for later excluding these 7, as the 2011 paper says the overall database had been tested for correlations with “negative reactivity” (presumably meaning the same measure), but it had been found to be non-significant. Even if that were the case, and the 7 were all in Experiment 102, it would still leave a discrepancy of 3 participants between Experiment 5 and Experiments 101+102.

Actually, looking back at the old discussion, I see that I did have some doubts then about whether 101 was included in Experiment 5. Recently I have just been taking that as read (though that was partly because I saw the values taken by the percentages in the file were consistent with 8 trials). At any rate, it’s helpful that Linda has compared the statistics and confirmed it.

The relationship of the remaining 50 participants of Experiment 5 with the 60 of 102 is more problematical, as the numbers don’t match.