Monday, January 22, 2018

In Study 1 of this article, we
introduced N = 54 men to both a White
and a Black female confederate in two separate face-to-face interactions. These
two confederates – we’ll call them “Hannah” and “Kiara” (not their real names)
– played their roles superbly and never forgot their lines. The study was a model
of experimental control.

But the inferences I drew from these data were incorrect
because of a statistical issue I did not appreciate at the time.

How would you label this pair of "conditions"?

What we found was this: The men in our study (all of them
White) tended to like the White confederate to the extent that they were
politically conservative, but the men liked the Black confederate to the extent
that they were liberal. I drew the inference that political orientation was
associated with whether the men were attracted to members of their racial ingroup
(i.e., the White partner) or outgroup (i.e., the Black partner).

But a logically equivalent description of these results
reveals my inferential overreach: The men in our study liked Hannah more to the
extent that they were politically conservative, but they liked Kiara more to
the extent that they were liberal. The results might have been attributable to
the women’s race…or to any of the other myriad differences between these two particular
women.[1]

This is why you sample stimuli as well as participants. Arguably, my sample size was not N = 54 (the number of participants), but
N = 2 (the number of stimuli).

===============

The above example may seem pretty straightforward to you,
but the same issue frequently turns up in subtler—but equally problematic—forms.
Let’s say I hypothesize that attractiveness inspires romantic desire more for
men than for women in a face-to-face, heterosexual interaction. This makes
intuitive sense…anecdotally, men seem to talk more about how hot women are than
vice versa. Perhaps surprisingly, then, this sex difference does not emerge in
speed-dating contexts where people meet a slew of opposite-sex partners who
naturally vary in attractiveness (see here and direct
replication here). But maybe it would emerge with a manipulation of attractiveness: If men
and women each met an attractive and an unattractive partner, maybe this within-subjects
attractiveness manipulation would inspire romantic desire more for men than for
women?

From Li et al. (2013). Each bar was
generated by 42-51 raters but only 2 targets.

Here’s
a study that used exactly this approach to test the hypothesis that attractiveness
will matter more for inspiring romantic desire in men than in women. It seems to find—and is frequently cited as
showing—evidence for the hypothesized sex difference: In the figure on the right,
one can clearly see that men differentiated the attractive and unattractive
confederates much more strongly than women did.

But notice that this study has
the same serious flaw that I described above with my confederate study. To see
why, let’s once again use (fake) names: The men desired Rachel and Sally much
more than Amanda and Liz, whereas women desired Brian and Karl just a bit more
than James and Dan. The results certainly tell us something about the
desirability of these particular confederates. But with such a small N (only 2
confederates per condition), we cannot generalize these findings to say
anything meaningful about attractive and
unattractive targets in general.

What is the N of this design: 93 or 8?

The problem here is that stimuli (in this case, confederates)
are nested
within condition, just like participants are nested within condition in a
between-subjects design. In order to generalize our results beyond the specific
people who happen to be in our sample, we have to treat participant as a random
factor in our designs. The same logic applies to stimuli: When they are nested
within condition, we need to treat stimuli (e.g., confederates) as random
factors because we want to generalize the beyond the 2 or 4 or 8 confederates who
happened to be part of our study.

What happens if you regularly equate confederate with
condition and use small samples of stimuli? Your effect size estimates will tend
to be extremely unstable. Consider this
study, which used N = 389 participants but only 10 male and 11 female
confederates. They found an enormous sex difference in the opposite direction from the study
described above: Confederate attractiveness affected women’s romantic
desire much more strongly than men’s. If you were including this study in a meta-analysis, it
would be more appropriate to assign it a N of 21 rather than 389 to reflect the
imprecision of this particular sex-difference estimate.

So what to do? Power calculations with these designs are complex, but
a good start would be to use at least N = 40 or 50 stimuli per condition and
treat stimuli as a random factor. Then, any incidental differences between the experimental
stimuli would likely wash out, and we could be reasonably confident that any effects
of the “manipulation” were truly due to attractiveness. Yes, that’s probably
too many stimuli for a study involving live confederates, so you may need to
get creative—for example, many speed-dating studies provide this kind of
statistical power. [2]

It’s easy to get tripped up by this issue, especially when
you have confederates that you’ve carefully selected to differ in an obvious
way. But don’t make the mistake. If a confederate is nested within condition in
your design, you likely need to reconsider your design.

[1]
Study 2 of the same paper replicated this interaction using N = 2,781 White participants
and N = 24,124 White and Black targets, which allows us to have more confidence
in the inference that this interaction is about race rather than peculiarities
of particular stimuli. Nevertheless, I assure you that at the time, I would
have tried to publish the two-confederate study on its own had I not had access
to this larger Study 2 sample.

[2]
Alternatively, you could manipulate the attractiveness of a single confederate
(e.g., using makeup and clothing); at least one study has successfully done so
(see Figure 1 here),
although we have found executing such a manipulation to be challenging in our
lab.

Tuesday, January 9, 2018

Long ago and far away, in Chicago, in 2006, I submitted one of my first papers as a graduate student. The topic was controversial, and so we were not particularly surprised, when the reviews came back, to see that the reviewers were skeptical of the conclusions we drew from our findings. They wanted more (as JPSP reviewers often do). They thought maybe we had overlooked a moderator or two…in fact, they could think of a whole laundry list of moderators that might produce the effect they thought we should have found in our data. So we ran 1,497 additional tests.

No, seriously. We counted. 1,497 post-hoc analyses to make sure that we hadn’t somehow overlooked the tests that would support Perspective X. We conducted them all and described them in the article (but there was still no systematic evidence for Perspective X).

If your work involves controversy, you’ve probably experienced something like this. It’s been standard operating procedure, at least in some areas of psychology.

Now, fast forward to 2017. My student Leigh Smith and I are about to launch a new study in the same controversial topic area, and it’s likely that we’ll get results that someone doesn’t like, one way or another. But this time, before we start conducting the study, we write up an analysis plan and submit it to Comprehensive Results in Social Psychology (CRSP), which specializes in registered reports. The analysis plan goes out for review, and reviewers—who have the luxury of not knowing whether the data will support Perspective X or Y or Z—thoughtfully recommend a small handful of additional analyses that could shed better light on the research question.

The analysis plan that emerges is one that everyone agrees should offer the best test of the hypotheses; importantly, the tests will be meaningful however they turn out. We run the study and report the tests. We submit the paper.

And then, instead of getting a decision letter back asking for 1,497 additional suggestions that someone thought would surely show support for Perspective X…the paper is simply published. The data get to stand as they are, with no poking and prodding to try to make them say something else.

There’s a lot to like about this brave new world.

Our new paper in CRSP addresses whether attractiveness (as depicted in photographs of opposite-sex partners) is more appealing to men than to women. I, like most other evolutionary psychologists, had always assumed that the answer to this question was “yes.”

But you know what? Those prior studies finding that sex difference in photograph contexts? Most of them were badly underpowered by today’s standards. Our CRSP paper used a sample that was powered to detect whether the sex difference was q = .10 (i.e., a small effect) or larger (using a sample of N = ~1,200 participants and ~600 photographs). These photographs came from the Chicago Face Database, and we used the ratings in the database of the attractiveness of each face (based on a sample of independent raters).

The paper has two take-home lessons that are relevant to the broader discussion of best practices:

Is attractiveness more appealing to menthan to women when people look at photographs?Yes, although the effect is quite small, andthere's little evidence of hidden moderators.

1. Even though prior studies of this sex difference were underpowered, the sex difference was there in our new study: r(Men) = .41, r(Women) = .28, q = .13, 95% CI (.18, .08). There is no chance that the prior studies were powered to find a sex difference as small as what we found. But it was hiding in there, nevertheless.[1]

Lesson #1: Perhaps weakly powered studies in the published literature can still manage to converge on truth. At least, perhaps this happens in cases where the presence or absence of p < .05 is/was not a hard criterion for publication. Sex differences might be one such example. (Still no substitute for a high powered, direct test, of course.)

2. In this literature, scholars have posited many moderators in an attempt to explain why some studies show sex differences and some do not. For example, sex differences in the appeal of attractiveness are supposed to be bigger when people imagine a serious relationship, or when people evaluate potential partners in the low-to-moderate range of attractiveness. Sometimes, sex differences are only supposed to emerge when 2 or 3 or 4 moderators combine, like the Moderator Avengers or something. That wasn’t the case here: These purported moderators did not alter the size of the sex difference in the predicted manner, whether alone or in Avenger-mode combination.

Lesson #2: Perhaps we should be extremely skeptical of moderators that are hypothesized, frequently post hoc, to explain why Study X shows a significant finding but Study Y does not. Moderators within study? I’m on board. Moderators across studies? I’ll believe it when I see it meta-analytically.

For every single research question I dream up going forward, I will consider whether it could be a good candidate for a registered report. When I think about an idealized, all-caps form of SCIENCE that stays untethered from prior perspectives or ideology, that CRSP experience pretty much captures it. [2]

Notes:

[1] This statement may shock some who think of me as some sort of sex-differences naysayer. Rather, my perspective is that this sex difference is larger in photograph contexts than live face-to-face contexts. Indeed, q = .13 is about 2-4 times larger than meta-analytic estimates of the same sex difference in initial attraction contexts or established close relationships (which are q = .05 or smaller). (Does it make me a naysayer to suggest that the sex differences here are extremely small, and that prior single studies are unlikely to have been powered to detect them?)

[2] And did I mention fast? This project went from “vague idea” to “in press” in less than 11 months. My prior best time for an empirical piece was probably twice as long.