January 2018

01/20/2018

Do people with wider faces (left) show more antisocial tendencies than those with narrow faces (right)? Photo: istockphoto

According to several previous studies in psychological science, men with wider faces--a greater ratio of width to height (like in the photo on the left, compared to the right)--tend to show antisocial tendencies such as racial bias, exploitation, and even aggression. Researchers attributed this link to exposure to testosterone during development, which, they say, causes both wider facial structure and antisocial behavior.

Kosinski found that previous studies often had methodological shortcomings such as small sample sizes. Half of the previous studies that he identified involved fewer than 25 participants and the average sample size was 40. And seven out of ten of the studies only just crossed the conventional threshold for significance of p=.05.

These factors led Kosinski to conduct a large-scale study of face measurements and behavioral tendencies. His research, published in Psychological Science, finds no relationship between facial width-to-height ratios (fWHR) and behavioral tendencies in a large sample of over 135,000 participants.

Questions

a) Review the material in Chapters 11 and 14, and explain why studies based on small samples can lead to results that are difficult to replicate. (You might also want to review the "kindergarten height" example in this recent blog post).

b) Why is it a problem that, in 7 out of 10 studies, the results "only just crossed the conventional threshold for significance?"

Now read a bit more about the "big data" methods that Kosinski employed in his research:

Kosinski turned to a very large dataset collected via a Facebook app called MyPersonality.org. The app comprised a collection of psychometric tests and surveys that Facebook users could take and then see how they scored — they could also volunteer their scores and Facebook profile data to be used in research projects. Using this bank of over 800,000 users’ surveys and over 2 million profile pictures, Kosinski tested his research question: Do broad faces indicate antisocial tendencies? [...]

After a preliminary experiment with 1,692 users showed that a computer could measure width-to-height ratios with the same accuracy that humans could, Kosinski analyzed 173,241 photos from 137,163 male and female participants (some users had multiple profile pictures and their measurements were averaged before analysis).

The results showed that facial broadness didn’t substantially correlate with any of the 55 personality measures tested....For example, broader-faced people reported themselves to be more prosocial, sympathetic, trusting, and cooperative,” says Kosinski. “Also, broader-faced people reported less interest in drug use, weapons, piercing, and tattoos. Moreover, broader-faced people did not score significantly higher on any of the traits positively related to antisocial and aggressive behavioral tendencies, including the personality facets of excitement-seeking and anger, impulsiveness, and militarism (i.e., interest in paramilitary groups, the armed forces, bodybuilding, martial arts, and survivalism).”

c) According to this description, Kosinski is basically running a series of bivariate correlations. Each one was between a self-reported trait and _________?

d) Pick one of the personality variables tested in the study. Now sketch a scatterplot of the result, labelling your axes carefully.

e) Kosinski's sample included more than a hundred thousand users. Why might this lead to a more stable estimate of the true relationship between facial broadness and personality? (This is the complement to question a), above)

f) Kosinki's study is an example of a "failure to replicate." Review the concepts in Table 14.1 and indicate which elements might apply in this case.

g) What questions might you ask about the construct validity of the personality measures used in Kosinski's study?

Suggested answers

a) and e) Small samples are more likely to be affected by one or two extreme scores, whereas in very large samples, the extreme scores are much more likely to be balanced out by other scores. The gifs in this blog post show the principle dynamically.

b) Some researchers have proposed that when a manuscript reports p-values very close to the conventional cutoff of .05 (p-values of .04 or .03), it's a sign that a researcher might have "p-hacked" the study. P-hacking is when a researcher goes through a series of options when analyzing the data, such as eliminating outliers, adding covariates, or testing multiple dependent measures, stopping analysis only when p just crosses under the .05 threshold. Therefore, when, in a body of literature, most of the p-values are just below .05, we might suspect that the underlying finding is a fluke, not a real result.

c) Facial broadness, as measured by width-to-height ratio.

d) One axis should be labelled "facial broadness" and the other might be labelled "interest in drug use." The cloud of points should be extremely spread out, showing no pattern or discernible slope.

e) see a) answer above.

f) The concepts in Table 14.1 that seem to apply best are the third (the original study's sample was very small) and perhaps the fourth (the original study may have tried multiple statistical analyses). (We cannot be sure without more investigation into the original studies, but these are the two issues raised in the APS summary of Kosinski's work.)

g) Indeed, we don't know much about the personality measures used in the study. The full manuscript might report more about whether data collected with these personality measures shows that they are reliable and valid.

01/10/2018

When the general public critiques research, I often hear them say that the samples are "too small." It's true that sample sizes (N) in psychology research should be large. One of the outcomes of the so-called "replication crisis" is that large samples are more and more important in psychology. But why?

A common misconception--held by both students and the general public--is that large samples are important because they ensure external validity. This misconception is incorrect. External validity (that is, the ability to generalize from a sample to a population of interest) is about how a sample has been recruited, not how many people are in it (see Chapter 7, 14). For example, say you recruited a sample of 1000 fans attending the national championship college game. You'd have a pretty large sample, but you couldn't generalize from that sample to college students in the U.S. (for example). In fact, unless the 1000 fans were selected at random from the 70,000 fans at the game, you couldn't even generalize from this sample to "people attending the national championship football game."

If not external validity, why are large samples important? It's about accuracy of our statistical estimates. When estimating values in the population such as means or differences between means, large samples are less likely to be influenced by chance variability. For example, imagine you're estimating the mean height of kindergarteners in your local school. Now imagine that you select 5 kindergarteners at random, one of whom, by chance, turns out to be extremely tall for her age. That tall kindergartener is going to "pull" the mean estimate upwards when combined with only 4 other kids. But what if you select 25 kindergarteners instead? Now the tall kindergartener is going to be balanced out by 24 other scores, and her height will have less influence on the mean estimate.

Below is a pair of animations that illustrate this principle. They come from the data science blog R Explorations. The animation used the program R to run a simulation study over and over and over. First, they created a very large population of scores whose mean was known to be 10.0 and whose standard deviation was known to be 1.0. Then they asked the computer to draw a random sample of size 10, compute the mean of the 10 scores, and plot them. You can watch the samples appear in real time on the animation below. Here, xbar is the sample's mean and s is the sample's standard deviation. The red line represents the mean for each sample as it is drawn:

Questions

a) First, watch the top animation, where N = 10. What do you notice about the movement of the vertical red line representing the mean in the top animation? What is it doing, and what does that represent?

b) Now watch the bottom animation, where N = 1000. What do you notice about the movement of the vertical red line representing the mean in this second animation? What is it doing, and what does that represent?

c) What do you notice about the s values of the two animations? Which animation has a steadier estimate of s?

d) Answer this one only if you've had a statistics course: Which of the two animations will have a smaller standard error? How is the standard error represented in the two animations?

e) Given the behavior of the two animations, explain why a large sample is important for research.

f) Which validity does sample size best address, if not external validity?

g) Let's tie this concept back to the "replication crisis" (or, as some are now calling it, "credibility revolution"*). When a finding in psychology has not replicated in a direct replication study, one reason might be that the original study used a small sample. Another reason might be that the replication study used a small sample. Why might the sample size of a study be linked to its replicability? Explain in your own words.

If you’re a research methods instructor or student and would like us to consider your guest post for everydayresearchmethods.com, please contact Dr. Morling. If, as an instructor, you write your own critical thinking questions to accompany the entry, we will credit you as a guest blogger.