Replication Updates

11/08/2018

First, in Supplemental Chapter S1, there's a comparison of psychological to medical effect sizes (page 477 of the 3rd Edition). Figure S1.12 reproduces a chart by Bushman and Anderson (2001) comparing the correlation between violent media and aggression to effect sizes from various medical studies, such as the association between smoking and lung cancer. The chart makes psychology look good--our effect sizes appear larger than common medical wisdom!

I've learned that this figure is problematic. One issue with the figure concerns its medical study comparators. The examples suggest that effect sizes for medical treatments and risks are rather small (such as the link between asbestos and cancer or lead exposure and IQ). But over the years, critics have argued that the examples in Bushman and Anderson's figure are artificially low or inaccurately computed. (Notably, Bushman and Anderson have defended their figure, more than once.)

Another issue is that the effect sizes for violent media and aggression are based on meta-analytic estimates that vary depending on the techniques used. Certain researchers report medium sized effects, but others argue that media effects on violence are negligible. (This debate led me to remove a media violence meta-analysis from Chapter 14 when I prepared the 3e.)

Ultimately, I am persuaded by Ferguson's (2009) explanation of how effect sizes of medical treatment studies vary dramatically, depending on whether they include "hypothesis irrelevant" cases. Especially when effect sizes are based on population-based epidemiological frequency counts, they can drastically underestimate the effect size of a vaccine, drug, or risk factor.

Chapter 8: Aspirin and heart attack

Chapter 8 uses the classic example of "aspirin prevents heart attack" as an illustration of how a very small effect size (r = .03) can, in fact, have large life-or-death impact. But if the aspirin-heart attack effect size came from a study that included hypothesis-irrelevant cases, it probably underestimated the true effect size. Indeed, the Physician's Health Study estimated that aspirin reduces the risk of a first heart attack by 44%--a much larger effect.

As always, I am grateful to colleagues who read my text carefully and alert me to problematic examples. Thanks!

07/10/2018

Her school achievement later in life can be predicted from her ability to wait for a treat (or by her family's SES). Photo: Manley099/Getty Images

There's a new replication study about the famous "marshmallow study", and it's all over the popular press. You've probably heard of the original research: Kids are asked to sit alone in a room with a single marshmallow (or some other treat they like, such as pretzels). If the child can wait for up to 15 minutes until the experimenter comes back, they receive two marshmallows. But if they eat the first one early, they don't. As part of the original study, kids were tracked over several years. One of the key findings was that the longer children were able to wait at age 4, the better they were doing in school as teenagers. Psychologists have often used this study as an illustration of how self-control is related to important life outcomes.

The press coverage of this year's replication study illustrates at least two things. First, it's a nice example of multiple regression. Second, it's an example of how different media outlets assign catchy--but sometimes erroneous--headlines on the same study.

First, let's talk about the multiple regression piece. Regression analyses often try to understand a core bivariate relationship more fully. In this case, the core relationship they start with is between the two variables, "length of time kids waited at age 4" and "test performance at age 15." Here's how it was described by Payne and Sheeran in the online magazine Behavioral Scientist:

The result? Kids who resisted temptation longer on the marshmallow test had higher achievement later in life. The correlation was in the same direction as in Mischel’s early study. It was statistically significant, like the original study. The correlation was somewhat smaller, and this smaller association is probably the more accurate estimate, because the sample size in the new study was larger than the original. Still, this finding says that observing a child for seven minutes with candy can tell you something remarkable about how well the child is likely to do in high school.

a) Sketch a well-labelled scatterplot of the relationship described above. What direction will the dots slope? Will they be fairly tight to a straight line, or spread out?

b) The writers (Payne and Sheeran) suggest that a larger sample size leads to a more accurate estimate of a correlation. Can you explain why a large sample size might give a more accurate statistical estimate? (Hint: Chapter 8 talks about outliers and sample size--see Figures 8.10 and 8.11.)

Now here's more about the study:

The researchers next added a series of “control variables” using regression analysis. This statistical technique removes whatever factors the control variables and the marshmallow test have in common. These controls included measures of the child’s socioeconomic status, intelligence, personality, and behavior problems. As more and more factors were controlled for, the association between marshmallow waiting and academic achievement as a teenager became nonsignificant.

c) What's proposed above is that social class is a third variable ("C") that might be associated with both waiting time ("A") and school achievement ("B"). Using Figure 8.15. draw this proposal. Think about it, too: Why does it make sense that lower SES might go both with lower waiting time (A)? Why might lower SES go with lower school achievement (B)?

d) Now create a mockup regression table that might fit the pattern of results being described above. Put the DV at the top (what is the DV?), then list the predictor variables underneath, starting with Waiting time at Age 4, and including things like Child's Socioeconomic Status and Intelligence. Which betas should be significant? Which should not?

Basically, here we have a core bivariate relationship (between wait time and later achievement), and then a critic suggests a possible third variable (SES). They used regression to see if the core relationship was still there when the third variable was controlled for. The core relationship went away, suggesting that SES was a third variable that can help explain why kids who wait longer do better in school later on.

Next let's talk about some of the hype around this replication study. The Behavioral Scientist piece (quoted above) is one of the more balanced descriptions. Its headline was, Try to Resist Misinterpreting the Marshmallow Test. It emphasized that the core relationship was replicated. It also explains in some detail why SES is related to self-control, and how the two probably cannot be meaningfully separated--it's a nuanced report. But other press coverage had a doomsday feel:

One person on Twitter even wrote,"The marshmallow/delayed gratification study always felt "wrong" to me - this year it was reported to be hopelessly flawed"

Are these headlines and comments fair? Probably not. As Payne and Sheeran write in Behavioral Scientist,

The problem is that scholars have known for decades that affluence and poverty shape the ability to delay gratification. Writing in 1974, Mischel observed that waiting for the larger reward was not only a trait of the individual but also depended on people’s expectancies and experience. If researchers were unreliable in their promise to return with two marshmallows, anyone would soon learn to seize the moment and eat the treat. He illustrated this with an example of lower-class black residents in Trinidad who fared poorly on the test when it was administered by white people, who had a history of breaking their promises. Following this logic, multiple studies over the years have confirmed that people living in poverty or who experience chaotic futures tend to prefer the sure thing now over waiting for a larger reward that might never come. But if this has been known for years, where is the replication crisis?

04/04/2018

In Chapter 8, one of the examples features a study that found that the more "deep talk" people engage in (as measured by the EAR), the happier they reported being (Mehl et al., 2010) (see Figure 8.1).

The same 2010 study also reported that the amount of engagement in small talk was associated with lower well being (this result is presented in Figure 8.9).

Now a team of researchers (including many of the same researchers) have published a second study with similar methodology (Milek et al., in press). The team collected new data in a larger, more heterogeneous sample of U.S. adults. (the original study was only on college students.) The authors used Bayesian analytic techniques, including pooling the new samples with the sample from the 2010 study. You can view a preprint of the report here. It's in press at Psychological Science.

The new paper confirmed evidence for the "deep talk" effect. That is, substantive conversations were linked with greater well being, with a moderate effect size. But the team did not find evidence for the complementary effect of small talk. That is, in the new analysis, the estimate of the small talk was not different from zero.

If you teach this example, it's worth updating students: The "deep talk" result in Figure 8.1 has been replicated, but one of the effects in in Figure 8.9 (the small talk effect) has not.

08/30/2017

My textbook describes several studies from Dr. Brian Wansink's lab. They make excellent teaching examples because students are able to understand the theory and hypotheses almost immediately and therefore focus on the methodological details. For example, in Chapter 10 of the 2nd and 3rd editions, I feature studies in which pasta was served from either large or small serving bowls (van Kleef, Shimizu, & Wansink, 2012). And in the Supplemental Chapters on statistics, I feature a study involving stale and fresh popcorn serving sizes (Wansink & Kim, 2005).

Instructors and students should know that Wansink's lab has come under intense scrutiny over the past year. First, he was attacked for publishing a (seemingly innocent) blog post admitting to questionable research practices (including HARKing--see Chapter 14 of the 3rd edition). Second, some researchers have alleged impossible values in data tables which suggest some sloppy statistical reporting. Third, he has admitted to using the same wording in more than one publication (sometimes called self-plagiarizing) and publishing some of his data in two places. According to reports, Wansink appears open to checking all past work and publishing corrections as needed. This story in The Chronicle summarizes the issues in a fairly balanced report (from March, 2017), and this story explains the results of an inquiry by Wansink's institution, Cornell University (from April, 2017).

In two of my own classes, I conducted demonstration versions of portion size studies, and I have obtained the predicted pattern, with large effect sizes, both times. In my opinion, the portion size effect is real. However, it's definitely worth telling students about the alleged problems with Wansink's work.

So far, the study in Chapter 10 (van Kleef et al., 2012) has not been identified as problematic. However, the popcorn study (Wansink & Kim, 2005) was alleged to have reported impossible values on a key table; the problems were described as "relatively minor" (Source). I changed that table for Figure S1.7 of the 3rd Edition in order to delete the ANOVA values that were found to be problematic. The entire table and discussion will be omitted from the 4th edition. However, despite my changes to the table, I think instructors should use Figure S1.7 only as an example for how to read data tables, and not endorse it as a replicable scientific finding.

Dr. Wansink has agreed to resign from Cornell University after six of his articles were retracted from the journal JAMA and after his university concluded that he had engaged in academic misconduct. Here's a CNN story on the situation.

08/09/2017

This post isn't a replication update per se. However, this blog post by Daniel Lakens (2017, July 3) challenges an interrupted time series design featured in Chapter 13 (Figure 13.2). The original study, by Danziger et al. (2011), found that judges were more likely to grant parole at the beginning of the day or after a snack break.

Science progresses one study at a time. As scientists conduct research and make the results public, we enable others to build upon, replicate, and critique our work, improving the field and building a body of knowledge. Even the studies in the textbook are not certain "truths," but rather steps on a scientific path, selected at one moment in time.

As the third edition of the textbook explains (Chapter 14), psychologists are investing new energy in improving our field. More than ever, psychologists are conducting replication studies, improving data analysis techniques, and making science open and transparent.

In same spirit, we've added a new section of the blog called Replication Updates. When I learn that a study that is featured in the textbook has not been replicated or has had its conclusions questioned, I will devote a post to the issue so that instructors can stay informed. To read them, click the appropriate filter on the blog menu.

I hope that as teachers, we will find ways to include students in thinking about issues of replication, scientific openness, and progress. As we do, let's keep in mind that some (but not all) of the critiques appear in non-peer reviewed outlets, and discuss this aspect with them as well.

Going forward, I hope readers will let me know when they hear updates on studies featured in the book!

If you’re a research methods instructor or student and would like us to consider your guest post for everydayresearchmethods.com, please contact Dr. Morling. If, as an instructor, you write your own critical thinking questions to accompany the entry, we will credit you as a guest blogger.