The authors conclude that, across experimental, cross-sectional, and longitudinal research designs, the recovered literature indicates significant effects of violent games on aggressive thoughts, feelings, and behaviors. Effects are moderate in size (r = ~.2).

Our paper challenges some of the conclusions from that paper. Namely,

The original authors reported that there was "little evidence of selection (publication) bias." We found, among some sets of experiments, considerable evidence of selection bias.

The original authors reported that better experiments found larger effects. We found that it instead may be the case that selection bias is stronger among the "best" experiments.

The original authors reported short-term effects on behavior of r = .21, a highly significant result of medium size. We estimated that effect as being r = .15 at the most and possibly as small as r = .02.

We do not challenge the results from cross-sectional or longitudinal research. The cross-sectional evidence is clear: there is a correlation between violent videogames and aggressive outcomes, although this research cannot demonstrate causality. There is not enough longitudinal research to try to estimate the degree of publication bias, so we are willing to take that research at its word for now. (Besides, an effect of hundred of hours of games over a year is more plausible than an effect of a single 15-minute game session.)

Signs of selection bias in aggressive behavior experiments

With regard to short-term effects on aggressive behavior, the funnel plot shows some worrying signs. Effect sizes seem to get smaller as the sample size gets larger. There is a cluster of studies that fall with unusual accuracy in the .01 < p < .05 region. And when filtering for the "best practices" experiments, nearly all the nonsignificant results are discarded, leaving a starkly asymmetrical funnel plot. See these funnel plots from experiments on aggressive behavior:

When filtering for what the original authors deemed "best-practices" experiments, most null results are discarded. Effect sizes are reported in Fisher's Z, with larger effects on the right side of the x-axis. The average effect size increases, but so does funnel plot asymmetry, indicating selection bias. Studies fall with unusual regularity in the .01 < p < .05 region, shaded in dark grey.

The p-curve doesn't look so hot either:

P-curve of experiments of aggressive behavior coded as "best-practices". The curve is generally flat. This suggests either (1) the null is true or (2) the null is false but there is p-hacking.

Where naive analysis suggests r = .21 and trim-and-fill suggests r = .18, p-curve estimates the effect as r = .08. Let's put that in practical terms. If Anderson and colleagues are right, a good experiment needs 140 participants for 80% power in a one-tailed test. If p-curve is right, you need 960 participants.

Given that 4 out of 5 "best-practices" studies have fewer than 140 participants, I suspect that we know very little about short-term causal effects of violent games on behavior.

Reply from Kepes, Bushman, and Anderson

You can find a reply by Kepes, Bushman, and Anderson here. They provide sensitivity analyses by identifying and removing outliers and by applying a number of other adjustments to the data: random-effects trim-and-fill, averaging the five most precise studies, and a form of selection modeling that assumes certain publication probabilities for null results.

They admit that "selective publishing seems to have adversely affected our cumulative knowledge regarding the effects of violent video games." However, they conclude that, because many of their adjustments are not so far from the naive estimate, that the true effects are probably only modestly overstated. In their view, the lab effect remains theoretically informative.

They do a fine job of it, but I must point out that several of their adjustments are unlikely to fully account for selection bias. We know that trim-and-fill doesn't get the job done. An average of the five most precise studies is also unlikely to fully eliminate bias. (In our preprint, we looked at an average of the ten most precise studies and later dropped it as uninteresting. You shed only a little bias but lose a lot of efficiency.)

I know less about the Vevea and Woods selection model they use. Still, because it uses a priori weights instead of estimating them from the data, I am concerned it may yet overestimate the true effect size if there is p-hacking or if the selection bias is very strong. But that's just my guess.

Summary

I am deeply grateful to Psychological Bulletin for publishing my criticism. It is my hope that this is the first of many similar re-analyses increasing the transparency, openness, and robustness of meta-analysis. Transparency opens the black box of meta-analysis and makes it easier to tell whether literature search, inclusion/exclusion, and analysis were performed correctly. Data sharing and archival also allows us to apply new tests as theory or methods are developed.

I am glad to see that we have made some progress as a field. Where once we might have debated whether or not there is publication bias, we can now agree that there is some publication bias. We can debate whether there is only a little bias and a medium effect, or whether there is a lot of bias and no effect. Your answer will depend somewhat on your choice of adjustment model, as Kepes et al. make clear.

To that end, I hope that we can start collecting and reporting data that does not require such adjustment. Iowa State's Douglas Gentile and I are preparing a Registered Replication Report together. If we find an effect, I'll have a lot to think about and a lot of crow to eat. If we don't find an effect, we will need to reevaluate what we know about violent-game effects on the basis of brief laboratory experiments.