I have been trying to reproduce several studies and have noticed that the reporting of results from these studies often presents a much stronger impression of results than I get from an investigation of the data itself. I plan to report some of these reproduction attempts, so I have been reading literature on researcher degrees of freedom and the file drawer problem. Below I’ll post and comment on some interesting passages that I have happened upon.

—

To put it another way: without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology. (Gelman and Loken, 2013, 14-15, emphasis in the original)

I wonder how many people in the general population take seriously general claims based on only small mTurk and college student samples, provided that these people are informed that these general claims are based only on small unrepresentative samples; I suspect that some of the “taking seriously” that leads to publication in leading psychology journals reflects professional courtesy among peer researchers whose work is also largely based on small unrepresentative samples.

We focused on a powerful objection to afﬁrmative action – that afﬁrmative action harms its intended beneﬁciaries by undermining their self-esteem. We tested whether White Americans would raise the harm to beneﬁciaries objection particularly when it is in their group interest. When led to believe that afﬁrmative action harmed Whites, participants endorsed the harm to beneﬁciaries objection more than when led to believe that afﬁrmative action did not harm Whites. Endorsement of a merit-based objection to afﬁrmative action did not differ as a function of the policy’s impact on Whites. White Americans used a concern for the intended beneﬁciaries of afﬁrmative action in a way that seems to further the interest of their own group.

So who were these white Americans?

Sixty White American students (37% female, mean age = 19.6) at the University of Kansas participated in exchange for partial course credit. One participant did not complete the dependent measure, leaving 59 participants in the ﬁnal sample. (p. 898)

I won’t argue that this sort of research should not be done, but I’d like to see this sort of exploratory research replicated with a more representative sample. One of the four co-authors listed her institutional affiliation at California State University San Bernardino, and two other co-authors listed their institutional affiliation at Tulane University, so I would have liked to have seen a second study among a different sample of students. At the very least, I’d like to see a description of the restricted nature of the sample in the abstract to let me and other readers make a more informed judgment about the value of investing time in the article.

—

The Gelman and Loken (2013) passage cited above reminded me of a recent controversy regarding a replication attempt of Schnall et al. (2008). I read about the controversy in a Nicole Janz post at Political Science Replication. The result of the replication (a perceived failure to replicate) was not shocking because Schnall et al. (2008) had reported only two experiments based on data from 40 and 43 University of Plymouth undergraduates.

My graduate students are worried about publishing their work out of fear that data detectives might come after them and try to find something wrong in their work. Doing research now involves anticipating a potential ethics or even criminal investigation.

I like the term “data detectives” a bit better than “replication police” (h/t Nicole Janz), so I think that I might adopt the label “data detective” for myself.

I can sympathize with the graduate students’ fear that someone might target my work and try to find an error in that work, but that’s a necessary occupational hazard for a scientist.

The best way to protect research from data detectives is to produce reproducible and perceived replicable research; one of the worst ways to protect research from data detectives is to publish low-powered studies in a high-profile journal, because the high profile draws attention and the low power increases suspicions that the finding was due to the non-reporting of failed experiments.

Researchers who try to serve the interests of science are going to find themselves out-competed by those who elect to “play the game,” because the ethical researcher will conduct a number of studies that will prove unpublishable because they lack statistically significant findings, whereas the careerist will find ways to achieve significance far more frequently. (p. 77)

This reflects part of the benefit produced by data detectives and the replication police: a more even playing field for researchers reluctant to take advantage of researcher degrees of freedom.

—

This Francis (2012) article is an example of a data detective targeting an article to detect non-reporting of experiments. Balcetis and Dunning (2010) reported five experiments rejecting the null hypothesis; the experiments had Ns, effect sizes, and powers as listed below in a table drawn from Francis (2012) p. 176.

Francis summed the powers to get 3.11, which indicates the number of times that we should expect the null hypothesis to be rejected given the observed effect sizes and powers of the 5 experiments; Francis multiplied the powers to get 0.076, which indicates the probability that the null hypothesis will be rejected in all 5 experiments.

—

Here is Francis again detecting more improbable results. And again. Here’s a back-and-forth between Simonsohn and Francis on Francis’ publication bias studies.

—

Here’s the Galak and Meyvis (2012) reply to another study in which Francis claimed to have detected non-reporting of experiments in Galak and Meyvis (2011). Galak and Meyvis admit to the non-reporting:

We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant. (p. 595)

…but argue that it’s not a problem because they weren’t interested in effect sizes:

However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions. (p. 595)

Even if it is true that the authors were unconcerned with effect size, I do not understand how that justifies not reporting results that fail to reach conventional levels of statistical significance.

So what about readers who *are* interested in effect sizes? Galak and Meyvis write:

If a researcher is interested in estimating the size of an effect reported in a published paper, we recommend asking the authors for their file drawer and conducting a meta-analysis. (p. 595-596)

That’s an interesting solution: if you are reading an article and wonder about the effect size, put down the article, email the researchers, hope that the researchers respond, hope that the researchers send the data, and then — if you receive the data — conduct your own meta-analysis.

I discussed here some weird things that SPSS does with regard to weighting. Here’s another weird thing, this time in Stata:

The variable Q1 has a minimum of 0 and a maximum of 99,999. For this particular survey question, 99,999 is not a believable response; so, instead of letting 99,999 and other unbelievable responses influence the results, I truncated Q1 at 100, so that all responses above 100 equaled 100. There are other ways of handling unbelievable responses, but this can work as a first pass to assess whether the unbelievable responses influenced results.

The command replace Q1trunc = 100 if Q1 > 100 tells Stata to replace all responses over 100 with a response of 100; but notice that this replacement increased the number of observations from 2008 to 2065; that’s because Stata treated the 57 missing values as positive infinity and replaced these 57 missing values with 100.

Stata has a reason for treating missing values as positive infinity, as explained here. But — unless users are told of this — it is not obvious that Stata treats missing values as positive infinity, so this appears to be a source of potential error for code with a > sign and missing values.

This post presents selected excerpts from Jesper W. Schneider’s 2014 Scientometrics article, “Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations” [ungated version here]. For the following excerpts, most citations have been removed, and page numbers references to the article have not been included because my copy of the article lacked page numbers.

The first excerpt notes that the common procedure followed in most social science research is a mishmash of two separate procedures:

What is generally misunderstood is that what today is known, taught and practiced as NHST [null hypothesis significance testing] is actually an anonymous hybrid or mix-up of two divergent classical statistical theories, R. A. Fisher’s ‘significance test’ and Neyman’s and Pearson’s ‘hypothesis test’. Even though NHST is presented somewhat differently in statistical textbooks, most of them do present p values, null hypotheses (H0), alternative hypotheses (HA), Type I (α) and II (β) error rates as well as statistical power, as if these concepts belong to one coherent theory of statistical inference, but this is not the case. Only null hypotheses and p values are present in Fisher’s model. In Neyman–Pearson’s model, p values are absent, but contrary to Fisher, two hypotheses are present, as well as Type I and II error rates and statistical power.

The next two excerpts contrast the two procedures:

In Fisher’s view, the p value is an epistemic measure of evidence from a single experiment and not a long-run error probability, and he also stressed that ‘significance’ depends strongly on the context of the experiment and whether prior knowledge about the phenomenon under study is available. To Fisher, a ‘significant’ result provides evidence against H0, whereas a non-significant result simply suspends judgment—nothing can be said about H0.

They [Neyman and Pearson] specifically rejected Fisher’s quasi-Bayesian interpretation of the ‘evidential’ p value, stressing that if we want to use only objective probability, we cannot infer from a single experiment anything about the truth of a hypothesis.

The next excerpt reports evidence that p-values are overstated. I have retained the reference citations here:

Using both likelihood and Bayesian methods, more recent research have demonstrated that p values overstate the evidence against H0, especially in the interval between significance levels 0.01 and 0.05, and therefore can be highly misleading measures of evidence (e.g., Berger and Sellke 1987; Berger and Berry 1988; Goodman 1999a; Sellke et al. 2001; Hubbard and Lindsay 2008; Wetzels et al. 2011). What these studies show is that p values and true evidential measures only converge at very low p values. Goodman (1999a, p. 1008) suggests that only p values less than 0.001 represent strong to very strong evidence against H0.

This next excerpt emphasizes the difference between p and alpha:

Hubbard (2004) has referred to p < α as an ‘alphabet soup’, that blurs the distinctions between evidence (p) and error (α), but the distinction is crucial as it reveals the basic differences underlying Fisher’s ideas on ‘significance testing’ and ‘inductive inference’, and Neyman–Pearson views on ‘hypothesis testing’ and ‘inductive behavior’.

The next excerpt contains a caution against use of p-values in observational research:

In reality therefore, inferences from observational studies are very often based on single non-replicable results which at the same time no doubt also contain other biases besides potential sampling bias. In this respect, frequentist analyses of observational data seems to depend on unlikely assumptions that too often turn out to be so wrong as to deliver unreliable inferences, and hairsplitting interpretations of p values becomes even more problematic.

The next excerpt cautions against incorrect interpretation of p-values:

Many regard p values as a statement about the probability of a null hypothesis being true or conversely, 1 − p as the probability of the alternative hypothesis being true. But a p value cannot be a statement about the probability of the truth or falsity of any hypothesis because the calculation of p is based on the assumption that the null hypothesis is true in the population.

The final excerpt is a hopeful note that the importance attached to p-values will wane:

Once researchers recognize that most of their research questions are really ones of parameter estimation, the appeal of NHST will wane. It is argued that researchers will find it much more important to report estimates of effect sizes with CIs [confidence intervals] and to discuss in greater detail the sampling process and perhaps even other possible biases such as measurement errors.

The Schneider article is worthwhile for background and information on p-values. I’d also recommend this article on p-value misconceptions.

Jeremy Freese recently linked to a Jason Mitchell essay that discussed perceived problems with replications. Mitchell discussed many facets of replication, but I will restrict this post to Mitchell’s claim that “[r]ecent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value.”

Mitchell’s claim appears to be based on a perceived asymmetry between positive and negative findings: “When an experiment succeeds, we can celebrate that the phenomenon survived these all-too-frequent shortcomings. But when an experiment fails, we can only wallow in uncertainty about whether a phenomenon simply does not exist or, rather, whether we were just a bit too human that time around.”

Mitchell is correct that a null finding can be caused by experimental error, but Mitchell appears to overlook the fact that positive findings can also be caused by experimental error.

—

Mitchell also appears to confront only the possible “ex post” value of replications, but there is a possible “ex ante” value to replications.

Ward Farnsworth discussed ex post and ex ante thinking using the example of a person who accidentally builds a house that extends onto a neighbor’s property: ex post thinking concerns how to best resolve the situation at hand, but ex ante thinking concerns how to make this problem less likely to occur in the future; tearing down the house is a wasteful decision through the perspective of ex post thinking, but it is a good decision from the ex ante perspective because it incentivizes more careful construction in the future.

In a similar way, the threat of replication incentivizes more careful social science. Rational replicators should gravitate toward research for which the evidence appears to be relatively fragile: all else equal, the value of a replication is higher for replicating a study based on 83 undergraduates at one particular college than for replicating a study based on a nationally-representative sample of 1,000 persons; all else equal, a replicator should pass on replicating a stereotype threat study in which the dependent variable is percent correct in favor of replicating a study in which the stereotype effect was detected only using the more unusual measure of percent accuracy, measured as the percent correct of the problems that the respondent attempted.

Mitchell is correct that there is a real possibility that a researcher’s positive finding will not be replicated because of error on the part of the replicator, but, as a silver lining, this negative possibility incentivizes researchers concerned about failed replications to produce higher-quality research that reduces the chance that a replicator targets their research in the first place.

Comments to this scatterplot post contained a discussion about when one-tailed statistical significance tests are appropriate. I’d say that one-tailed tests are appropriate only for a certain type of applied research. Let me explain…

Statistical significance tests attempt to assess the probability that we mistake noise for signal. The conventional 0.05 level of statistical significance in social science represents a willingness to mistake noise for signal 5% of the time.

Two-tailed tests presume that these errors can occur because we mistake noise for signal in the positive direction or because we mistake noise for signal in the negative direction: therefore, for two-tailed tests we typically allocate half of the acceptable error to the left tail and half of the acceptable error to the right tail.

One-tailed tests presume either that: (1) we will never mistake noise for signal in one of the directions because it is impossible to have a signal in that direction, so that permits us to place all of the acceptable error in the other direction’s tail; or (2) we are interested only in whether there is an effect in a particular direction, so that permits us to place all of the acceptable error in that direction’s tail.

Notice that it is easier to mistake noise for signal in a one-tailed test than in a two-tailed test because one-tailed tests have more acceptable error in the tail that we are interested in.

So let’s say that we want to test the hypothesis that X has a particular directional effect on Y. Use of a one-tailed test would mean either that: (1) it is impossible that the true direction is the opposite of the direction predicted by the hypothesis or (2) we don’t care whether the true direction is the opposite of the direction predicted by the hypothesis.

I’m not sure that we can ever declare things impossible in social science research, so (1) is not justified. The problem with (2) is that — for social science conducted to understand the world — we should always want to differentiate between “no evidence of an effect at a statistically significant level” and “evidence of an effect at a statistically significant level, but in the direction opposite to what we expected.”

To illustrate a problem with (2), let’s say that we commit before the study to a one-tailed test for whether X has a positive effect on Y, but the results of the study indicate that the effect of X on Y is negative at a statistically significant level, at least if we had used a two-tailed test. Now we are in a bind: if we report only that there is no evidence that X has a positive effect on Y at a statistically significant level, then we have omitted important information about the results; but if we report that the effect of X on Y is negative at a statistically significant level with a two-tailed test, then we have abandoned our original commitment to a one-tailed test in the hypothesized direction.

—

Now, when is a one-tailed test justified? The best justification that I have encountered for a one-tailed test is the scenario in which the same decision will be made if X has no effect on Y and if X has a particular directional effect on Y, such as “we will switch to a new program if the new program is equal to or better than our current program”; but that’s for applied science, and not for social science conducted to understand the world: social scientists interested in understanding the world should care whether the new program is equal to or better than the current program.

—

In cases of strong theory or a clear prediction from the literature supporting a directional hypothesis, it might be acceptable — before the study — to allocate 1% of the acceptable error to the opposite direction and 4% of the acceptable error to the predicted direction, or some other unequal allocation of acceptable error. That unequal allocation of acceptable error would provide a degree of protection against unexpected effects that is lacking in a one-tailed test.

List experiments are a commonly used social scientific tool for measuring the prevalence of illegal or undesirable attributes in a population. In the context of electoral fraud, list experiments have been successfully used in locations as diverse as Lebanon, Russia and Nicaragua. They present our best tool for detecting fraudulent voting in the United States.*

I’m not sure that list experiments are the best tool for detecting fraudulent voting in the United States. But, first, let’s introduce the list experiment.

The list experiment goes back at least to Judith Droitcour Miller’s 1984 dissertation, but she called the procedure the item count method (see page 188 of this 1991 book). Ahlquist, Mayer, and Jackman (2013) reported results from list experiments that split a sample into two groups: members of the first group received a list of 4 items and were instructed to indicate how many of the 4 items applied to themselves; members of the second group received a list of 5 items — the same 4 items that the first group received, plus an additional item — and were instructed to indicate how many of the 5 items applied to themselves. The difference in the mean number of items selected by the groups was then used to estimate the percent of the sample and — for weighted data — the percent of the population to which the fifth item applied.

Ahlquist, Mayer, and Jackman (2013) reported four list experiments from September 2013, with these statements as the fifth item:

“I cast a ballot under a name that was not my own.”

“Political candidates or activists offered you money or a gift for your vote.”

“I read or wrote a text (SMS) message while driving.”

“I was abducted by extraterrestrials (aliens from another planet).”

Figure 4 of Ahlquist, Mayer, and Jackman (2013) displayed results from three of these list experiments:

My presumption is that vote buying and voter impersonation are low frequency events in the United States: I’d probably guess somewhere between 0 and 1 percent, and closer to 0 percent than to 1 percent. If that’s the case, then a list experiment with 3,000 respondents is not going to detect such low frequency events. 95 percent confidence intervals for weighted estimates in Figure 4 appear to span 20 percentage points or more: the weighted 95 percent confidence interval for vote buying appears to range from -7 percent to 17 percent. Moreover, notice how much estimates varied between the December 2012 and September 2013 waves of the list experiment: the point estimate for voter impersonation in December 2012 was 0 percent, and the point estimate for voter impersonation in September 2013 was -10 percent, a ten-point swing in point estimates.

So, back to the original point, list experiments are not the best tool for detecting vote fraud in the United States because vote fraud in the United States is a low frequency event that list experiments cannot detect without an improbably large sample size: the article indicates that at least 260,000 observations would be necessary to detect a 1% difference.

If that’s the case, then what’s the purpose of a list experiment to detect vote fraud with only 3,000 observations? Ahlquist, Mayer, and Jackman (2013, p. 31) wrote that:

From a policy perspective, our ﬁndings are broadly consistent with the claims made by opponents of stricter voter ID laws: voter impersonation was not a serious problem in the 2012 election.

The implication appears to be that vote fraud is a serious problem only if the fraud is common. But there’s a lot of problems that are serious without being common.

So, if list experiments are not the best tool for detecting vote fraud in the United States, then what is a better way? I think that — if the goal is detecting the presence of vote fraud and not estimating its prevalence — then this is one of those instances in which journalism is better than social science.

—

* This post was based on the October 30, 2013, version of the Ahlquist, Mayer, and Jackman manuscript, which was located here. A more recent version is located here and has replaced the “best tool” claim about list experiments:

List experiments are a commonly used social scientific tool for measuring the prevalence of illegal or undesirable attributes in a population. In the context of electoral fraud, list experiments have been successfully used in locations as diverse as Lebanon, Russia, and Nicaragua. They present a powerful but unused tool for detecting fraudulent voting in the United States.

It seems that “unused” is applicable, but I’m not sure that a “powerful” tool for detecting vote fraud in the United States would produce 95 percent confidence intervals that span 20 percentage points.

P.S. The figure posted above has also been modified in the revised manuscript. I have a pdf of the October 30, 2013, version, in case you are interested in verifying the quotes and figure.