Chapter 07; Sampling

05/20/2018

Collecting accurate data in a poll is difficult business. Many of us focus on the sampling strategies of polling organizations, and rightly so: External validity depends on whether we include "cell phone only" numbers in our samples, how we account for different rates of responding across certain groups, and so on. However, this post reminds us that question wording also matters. Construct validity is just as important for polls.

In this report, Pew Research Center describes how people's opinions change depending on how a question is posed. The poll asked Americans whether we should increase the size of the House of Representatives. People in the poll were randomly assigned to hear either the original question, or the same question with additional context. Notice how people's support changes when additional context is added:

e) Why might Democrats be more influenced by the contextual information provided?

f) This could be viewed as a 2 x 2, IV x PV factorial design, in which the IV is "question wording" (original or added context) and the PV is "political affiliation" (Democrat or Republican). If you've studied Chapter 12, perhaps you can identify whether there are main effects and interactions in this pattern of data.

The same Pew report contains additional examples of how question wording alters responses to a question about the Senate, and shows how different subgroups respond to questions about the Electoral College.

01/10/2018

When the general public critiques research, I often hear them say that the samples are "too small." It's true that sample sizes (N) in psychology research should be large. One of the outcomes of the so-called "replication crisis" is that large samples are more and more important in psychology. But why?

A common misconception--held by both students and the general public--is that large samples are important because they ensure external validity. This misconception is incorrect. External validity (that is, the ability to generalize from a sample to a population of interest) is about how a sample has been recruited, not how many people are in it (see Chapter 7, 14). For example, say you recruited a sample of 1000 fans attending the national championship college game. You'd have a pretty large sample, but you couldn't generalize from that sample to college students in the U.S. (for example). In fact, unless the 1000 fans were selected at random from the 70,000 fans at the game, you couldn't even generalize from this sample to "people attending the national championship football game."

If not external validity, why are large samples important? It's about accuracy of our statistical estimates. When estimating values in the population such as means or differences between means, large samples are less likely to be influenced by chance variability. For example, imagine you're estimating the mean height of kindergarteners in your local school. Now imagine that you select 5 kindergarteners at random, one of whom, by chance, turns out to be extremely tall for her age. That tall kindergartener is going to "pull" the mean estimate upwards when combined with only 4 other kids. But what if you select 25 kindergarteners instead? Now the tall kindergartener is going to be balanced out by 24 other scores, and her height will have less influence on the mean estimate.

Below is a pair of animations that illustrate this principle. They come from the data science blog R Explorations. The animation used the program R to run a simulation study over and over and over. First, they created a very large population of scores whose mean was known to be 10.0 and whose standard deviation was known to be 1.0. Then they asked the computer to draw a random sample of size 10, compute the mean of the 10 scores, and plot them. You can watch the samples appear in real time on the animation below. Here, xbar is the sample's mean and s is the sample's standard deviation. The red line represents the mean for each sample as it is drawn:

Questions

a) First, watch the top animation, where N = 10. What do you notice about the movement of the vertical red line representing the mean in the top animation? What is it doing, and what does that represent?

b) Now watch the bottom animation, where N = 1000. What do you notice about the movement of the vertical red line representing the mean in this second animation? What is it doing, and what does that represent?

c) What do you notice about the s values of the two animations? Which animation has a steadier estimate of s?

d) Answer this one only if you've had a statistics course: Which of the two animations will have a smaller standard error? How is the standard error represented in the two animations?

e) Given the behavior of the two animations, explain why a large sample is important for research.

f) Which validity does sample size best address, if not external validity?

g) Let's tie this concept back to the "replication crisis" (or, as some are now calling it, "credibility revolution"*). When a finding in psychology has not replicated in a direct replication study, one reason might be that the original study used a small sample. Another reason might be that the replication study used a small sample. Why might the sample size of a study be linked to its replicability? Explain in your own words.

08/15/2017

Pew Research is my favorite polling resource, partly because they ask such interesting questions, and partly because they are so transparent about sharing their methodology. (For examples, see their Methods page or click on the full Report Materials for a study they did on gun ownership in America.) They make their sampling techniques and question wording easily available.

Now Pew has shared a video that explains how a sample of 1,000 can be used to draw inferences about the population. Instructors: Save this one for when you teach Chapter 7!

12/10/2016

The researchers asked students if this element probably linked to real news or fake news.What's the clue that tells you this story is probably not "real news"?

Fake news is in the (real) news lately. Whether you're looking at Facebook, Buzzfeed, or your online newspaper, companies may try to clickbait you into reading a story that's false. Companies may want you to read the story so that you'll be exposed to their advertising. Or a political group may want to persuade you of an extreme opinion. In some recent cases, people have read fake news stories, believed them, and then acted according to what they thought was true (here's an example).

How often do people mistake fake news for real news?

A team at Stanford University recently attempted to measure the problem in a large sample of high school students. The results of their study were summarized by the Wall Street Journal. The journalist from the WSJ reported the following:

...82% of middle-schoolers couldn’t distinguish between an ad labeled “sponsored content” and a real news story on a website, according to a Stanford University study of 7,804 students from middle school through college. The study, set for release Tuesday, is the biggest so far on how teens evaluate information they find online.

The study apparently showed students several examples, asking them for each one if it was a real story or fake news. You'll see an example of one of their study's stimuli in the photo to the left. You can see the other samples in the full report from Stanford's website (scroll to p. 9).

Here are some more results, reported by the WSJ:

More than two out of three middle-schoolers couldn’t see any valid reason to mistrust a post written by a bank executive arguing that young adults need more financial-planning help. And nearly four in 10 high-school students believed, based on the headline, that a photo of deformed daisies on a photo-sharing site provided strong evidence of toxic conditions near the Fukushima Daiichi nuclear plant in Japan, even though no source or location was given for the photo.

a) What kind of claim is it to say that "82% of middle-schoolers couldn’t distinguish between an ad labeled “sponsored content” and a real news story" (Frequency, association, or cause?) What is (are) the variable(s) in the claim?

b) In order to claim that "82% of middle schoolers" do something, you'd probably need to be sure that the study included a generalizable sample of middle schoolers. What are some ways the researchers could have obtained an externally valid sample?

c) For a frequency claim like this one, construct validity is also important. The construct validity of the Stanford study seems excellent, because the researchers asked students questions about realistic-looking mockups of online content. Reading back through the green quotes above, you'll see three different ways they measured the variable, "knowing when news is fake." What are the three ways?

I can't help but point out that in your research methods class, you will learn several media literacy skills. You're learning that journalists might not always get the details of a scientific study right--they might not even read the original article! Journalists might slap a causal claim on a correlational study. Or they might write a sensational study about a single study without reviewing the entire literature on a topic. Being a good consumer of information means you'll be able to critically evaluate media stories about science (and other topics, too).

02/20/2015

What construct and external validity considerations affect Yelp ratings, like this 4-star review of a restaurant near my campus? Screenshot from Yelp website (by author)

In this interesting piece, Slate writer Will Oremus asks why the top-rated restaurants on Yelp are places that "nobody has ever heard of." He explains that the top 10 rated restaruants on Yelp change from year to year, and furthermore, they tend to be places like Copper Top BBQ of CA and Art of Flavors of Las Vegas. The top 100 are not big, famous restaurants--in fact, they tend to be simple, local, even touristy places with "styrofoam and paper plates."

Oremus points out that Yelp ratings are subject to "biases are quite different from the ones we’re used to" in other ratings websites. Are these biases of the external validity variety? Or the construct validity variety? I'll quote some passages from Oremus' Slate article for your consideration.

Here's Oremus's first observation:

I have not eaten at Copper Top, but I have little doubt that these restaurants also share a consistently high quality of food for the money. On Yelp, that’s usually a recipe for a four-star rating. Compared to professional critics, Yelp reviewers skew young and budget-conscious, which is part of the site’s appeal. By and large, they’re happier paying $8 for a very good burrito than $23 for a fancy one, and the ratings reflect that.

a. What kind of validity is this conclusion directed at--external or construct? Can you say anything specific about this type of validity, as used by this journalist?

Here's another observation from Oremus' article.

Part of the explanation lies in the distribution of ratings on the site’s five-star scale. Only a handful or restaurants in the world rate three Michelin stars. But more than 40 percent of all Yelp reviews are perfect scores, suggesting that five stars on Yelp entails satisfaction rather than perfection. Average hundreds of reviews of the same establishment, and you’ll find that its overall rating is influenced far more by the number of dissatisfied customers than by how much the five-star reviewers loved it. The best-rated restaurants on Yelp, then, are not so much the most loved as the least hated.

b. What kind of validity is this conclusion directed at--external or construct? Can you say anything specific about this type of validity, as used by this journalist?

Oremus also points out the power of incidental influences, such as neighborhood and weather, on Yelp reviews. To wit:

...researchers at Georgia Tech and Yahoo Labs found that online restaurant reviews are significantly influenced by at least three factors that have nothing to do with the operation of the business:

Neighborhood demographics: Restaurants in neighborhoods with high education levels don’t get better reviews, but they do get more reviews. That matters, because Yelp’s top-100 rankings are based not only on average ratings, ...so a place with 100 five-star reviews will rank higher than one with 50.

Time of year: Restaurants get more reviews in July and August than they do in the winter, but the average ratings in the summer months are lower.

Weather: One of the strongest exogenous effects on restaurant ratings, according to the study, is the weather at the time of the review. As you might guess, warm temperatures and sunshine mean higher reviews. Cold temperatures or extreme heat mean lower reviews, as does precipitation of any kind. The researchers attribute this to weather’s well-documented effects on mood and memory.

c. What kind of validity is the above research concerned with?

d. How might this information affect your own use of Yelp in the future? Think of two possible ways you can use this information.

Suggested Answers

a. When Oremus writes that Yelp's users tend to be younger and budget-conscious, he's describing how the sample of people who choose to post on Yelp is biased. This is an external validity point. One could say that the sample of Yelp reviews is self-selected, and is biased toward restaurants that are cheaper. Cheaper restaurants may get more reviews on Yelp than more expensive ones; in addition, cheaper restaurants may be rated more positively, all because of this sampling bias. Therefore, ratings on Yelp might not generalize to how older people would evaluate the same restaurants. The situation seems similar to conducting an opinion poll and including (or not) cell-phone only households.

b. This is a construct validity point. People tend to use the 5-star end of the Yelp scale the most, he says. This means that people have a "yea saying" bias on Yelp reviews (to use a Chapter 6 term). As a result, it might be hard to decide if a positive review on Yelp is truly good, or if people just tend to like everything!

Another point Oremus made is that a high Yelp rating means the restaurant is more consistent--not necessarily more delicious. That again is a construct validity point.

c. The weather bias suggests problems with construct validity. We might not know if a positive rating reflects the quality of the restaurant (the construct in question) or the type of weather outside! However, the "neighborhood bias" seems to be an external validity issue--restaurants in highly educated neighborhoods get more reviews, so this is a sampling bias.

d. Answers will vary. All in all, it seems that these construct and external validity issues mean that Yelp reviews are unlikely to parallel what a professional restaurant critic would say. Does that affect your restaurant behavior, or not?

Thanks again to Carrie Smith of Ole Miss, who, as usual, is a fount of bloggable Slate pieces!

a) This figure provides a good opportunity to interpret a graph. Study the figure carefully. The figure seems to be illustrating two main points--one about time, and one about age. What are these two points?

b) This figure, as well, seems to be illustrating two main points. What are the two main takeaway messages from this figure? (Hint: One takeaway message is the same here as in Question a.)

Given the data presented in this figure, Pew Research reports that it will call a higher proportion of cell-phone only households this year. Here is a quote from their website:

To keep pace with this rapid trend, the Pew Research Center will increase the percentage of respondents interviewed on cellphones in its typical national telephone surveys to 65%; 35% of interviews will be conducted by landline. Last year, we increased the ratio to 60% cellphone, with 40% conducted on landline. Back in 2008, when we first started routinely including cellphones in our phone surveys, just one-fourth (25%) of all interviews were done by cellphone.

c) Remind yourself that external validity (through generalizable sampling techniques) is especially important for frequency claims. Give two or three examples of research questions that fit this kind of claim.

d) Explain why it is important for polling organizations to know which households are cell-phone only and which are not. What specific kinds of political and social questions might be especially affected by cell-phone only proportions, given the data presented above?

e) Here's an interesting question: How might an organization like Pew Research obtain an accurate estimate of the number of cell-phone only households in the first place? What kind of sample would you need to get this estimate? How would you contact this sample?

Finally, here's an interesting description from Pew Research, discussing whether it would be smart to conduct polls only on cell phones, not land lines:

The question naturally arises: Why not interview everyone on a cellphone? In fact, at least one major national survey is going to do just that. The Surveys of Consumers, conducted by the University of Michigan, will begin calling only cellphones this month.

But we are not ready to make that change just yet, for at least two reasons. One is that there remains a small share of the public that is not reachable by cellphone. In the newly released data from the National Center for Health Statistics, 7% of adults live in households with a landline phone but no wireless phone. In addition, some people with landlines and cellphones may turn their cellphones on only to make calls or when they are expecting to be called. If these kinds of people are demographically different from those who are more easily reached on a cellphone, then the resulting sample will be less representative of the full population.

Despite eating fast food fairly often, Americans do not think it is good for you.

76% in the U.S. think the food served in fast-food restaurants is "not too good" or "not good at all for you," the same percentage who said so in 2003.

Gallup's analysis of their polling data shows the percentage of different ethnic and gender groups that report eating fast food regularly.For example, they broke the results down by income, reporting that:

But fast food is hardly the province solely of those with lower incomes; in fact, wealthier Americans -- those earning $75,000 a year or more -- are more likely to eat it at least weekly (51%) than are lower-income groups. Those earning the least actually are the least likely to eat fast food weekly -- 39% of Americans earning less than $20,000 a year do so.

When you analyze the methodological strengths of a set of data like these, you should focus mainly on the construct and external validity of the study.

a) What questions could you ask about the construct validity of this poll? What do you think about the construct validity of the different questions?

b) Do you think the people they called all think about "fast food" in the same way? Think about what you'd call fast food (Does the local pizza place count? Does Subway? Does Burger King?) Then ask a couple of friends what they think fast food is. Do you all agree? What might that mean about the results of Gallup's survey?

c) What questions could you ask about the external validity of the poll? Gallup includes a section on Survey Methods on the bottom of the story. Read this section--you may not understand all of it, but see what you can. What do you think about the sampling? What kind of sampling did they use? Can they use their sample to generalize to the population of Americans? Why or why not?

06/10/2013

This recent piece in USNews describes the results of a poll about how young parents feel about their kids' use of smartphones and tablets.

Let's use this story to practice what you are learning about surveys and polls.

Here's one excerpt from the journalist's story.

Surveying more than 2,300 parents of children up to age 8, researchers from Northwestern University found that the vast majority -- 78 percent -- report that their children's media use is not a source of family conflict, and 59 percent said they aren't concerned their kids will become addicted to new media.

a. Will this sample of 2300 parents be able to generalize to American parents overall? What do you need to know? (and, what kind of validity are we asking about here?)

b. Based on the report above, what kind of question do you think the researchers asked--a forced choice question? An open ended one?

Here's another excerpt:

"We asked parents what their challenges were as the parents of young children . . . and sometimes media was never mentioned," said study author Ellen Wartella, director of Northwestern's Center on Media and Human Development. "Parents of children this age are concerned about their health, safety, nutrition and exercise, and media concerns are much lower down the list. That was a surprise."

c. Based on this information, do you think the "challenges" question was closed ended? open ended? Do you think they used a construct valid way to ask this kind of question?

Finally, here's one more part of the story:

The notion that parents are apt to shush their kids by handing them a smartphone or tablet also appears to be false, according to results. To keep their children quietly occupied, moms and dads said they were more apt to turn to toys or activities (88 percent), books (79 percent) or TV (78 percent). Of parents with smartphones or iPads, only 37 percent reported being somewhat or very likely to turn to those devices.

A reader of this news article submitted the following comment:

Parents never tell the truth in situations like this. Either they hate to admit that they shut little junior up by putting him in front of a phone, or they don't even realize how much they do it. Look around you in restaurants and just watch how many parents with small children let them sit there playing with mom's phone or a Nintendo DS.

d. Which validity is the reader criticizing? What do you think of the reader's comment?

Suggested answers:

a. 2300 is probably plenty--as long as it is a sample that was obtained via some probability sampling method (such as random sampling or cluster sampling), 2300 is an adequate sample size. Remember--when it comes to external validity, it's not how many are in your sample, it's how you got it.

b.We can't be sure without checking, but I suspect this was a closed ended question, something like, "is your child's media use a source of family conflict?" or "to what extent is your child's media use a source of family conflict?"

c. In this case, it seems clear that the question was open-ended, such as, "what are your challenges as a parent?"or, "Tell us about some challenges you face as a parent. If you are interested in measuring what parents think about first when it comes to parenting challengies, an open-ended question might be the best. If you asked them, "is media use your biggest challenge?" you might get more parents to say "yes" than if you simply ask them, "what are your biggest challenges?"

d. This reader is criticing the study's construct validity. While it is probably very easy and efficient to ask parents to self-report their ways of calming their kids, the reader has a good point, in my opinion. Parents probably do not want to admit that they quiet their children with a smartphone or tablet. A better way to find out what parents do to keep kids quiet is by visiting a few restaurants and observing what children are doing. Can you design a study that would do this? How can you be sure your study has good external and construct validity?

01/21/2013

This Wall Street Journal article describes how a team of U.S. and European professors traveled to North Korea to teach students there about quantitative methods for studying their population.

North Korea, the story says, has "some of the least reliable statistics in the world."

Please take a look at the pop-out map of North Korea in this story. The map shows the estimated rates of child malnutrition in each of North Korea's provinces. For example, it shows that the estimated rate of child malnutrition in the province of Jagong was 9.8%. That is a frequency claim: a claim about a single variable--the rate of malnutrition--in each province.

The caption on the map is worth attending to from a research methods perspective. The caption reads,

Researchers in North Korea often face challenges. In this 2012 study of child malnutrition, conducted with the support of three U.N. agencies, local village leaders chose the children from which researchers drew their subjects, so they could have excluded the most malnourished kids.

This first part of the caption describes a potential problem with the external validity of this claim. If officials did in fact select only the most healthy kids for the sampling frame, then the estimate of the true rate of child malnutrition would be too low.

Here's the second part of the map's caption:

And the data were collected at the end of the harvest, possibly producing a temporary uptick in nutrition levels.

You could say that this part of the caption is describing a potential problem with the construct validity of this claim. By measuring child malnutrition during a time of plenty, officials did not get a valid measure of the true health of the kids in the sample.

What lessons do you think the instructors are teaching to the students in North Korea?

06/20/2012

An anonymous national survey conducted last year found that 58 percent of high school seniors said they had texted or emailed while driving during the previous month. About 43 percent of high school juniors acknowledged they did the same thing.

Let's interrogate this claim.

a) What kind of claim is the headline making, and which big validities do we care about for a claim like this?

b) The msnbc story gives some detail about the original study's methodology. Does it give you enough detail to assess external validity? If not, what else do you need to know?

c) Does it give you enough detail to assess construct validity? What else might you need to know?

One part of the msnbc article mentions a teen who got into a minor accident while texting behind the wheel:

"I felt like an idiot," said her 18-year-old son, Dylan Young.

"It caused me to be a lot more cautious," said the high school senior, although he conceded that he still texts behind the wheel.

d) Is this story (about Dylan Young) an example of empirical evidence? Why do you think the journalist includes this story?

If you’re a research methods instructor or student and would like us to consider your guest post for everydayresearchmethods.com, please contact Dr. Morling. If, as an instructor, you write your own critical thinking questions to accompany the entry, we will credit you as a guest blogger.