Chapter 06; Surveys and Observations

05/20/2018

Collecting accurate data in a poll is difficult business. Many of us focus on the sampling strategies of polling organizations, and rightly so: External validity depends on whether we include "cell phone only" numbers in our samples, how we account for different rates of responding across certain groups, and so on. However, this post reminds us that question wording also matters. Construct validity is just as important for polls.

In this report, Pew Research Center describes how people's opinions change depending on how a question is posed. The poll asked Americans whether we should increase the size of the House of Representatives. People in the poll were randomly assigned to hear either the original question, or the same question with additional context. Notice how people's support changes when additional context is added:

e) Why might Democrats be more influenced by the contextual information provided?

f) This could be viewed as a 2 x 2, IV x PV factorial design, in which the IV is "question wording" (original or added context) and the PV is "political affiliation" (Democrat or Republican). If you've studied Chapter 12, perhaps you can identify whether there are main effects and interactions in this pattern of data.

The same Pew report contains additional examples of how question wording alters responses to a question about the Senate, and shows how different subgroups respond to questions about the Electoral College.

03/21/2018

Here's a second video in a series by Pew Research. This 5 minute clip describes some of the issues in writing good questions for an opinion poll. It basically summarizes the first part of Chapter 6, and provides some new, concrete examples.

02/20/2016

How fast do people talk? How wordy are they? Do they swear a lot? Photo credit: Eugenio Marongiu/Shutterstock

There's a new piece in the Atlantic summarizing some research on phone calls. The journalist's piece is tantalizing, but it doesn't provide some of the information you need to evaluate the research behind it.

According to the article, a data analytics firm analyzed a bunch of recorded phone calls--about 4 million of them--from customer service interactions. According to the journalist, the firm counted how fast people talked, how many words they used, and even how long they would wait on hold before hanging up! The journalist reported:

In some sense, [the] findings hew to cultural stereotypes. The fast-talkers are concentrated in the North; the slow-talkers are concentrated in the South.

Specifically, the fastest talkers were found in Oregon, Minnesota, Massachusetts, Kansas, and Iowa. The slowest were in North Carolina, Alabama, South Carolina, Louisiana, and Mississippi.

The Atlantic piece includes some colorful maps of the USA, indicating the fast, slow, and medium-talking states. Check them out!

a) While you're there, see if you can figure out how they defined "speed of talking" in this research. Also see if you can find out how much faster the Oregonians talk, compared to the Mississippians (how big is the effect? Is it statistically significant?). Before playing up these regional differences in a magazine article, this seems like important stuff to know, right?

The Atlantic piece goes on to discuss how "wordy" the phone calls were. Here the results were different. The journalist explains:

...speedy speech doesn’t necessarily equate to dense speech. [The analytic form] also used its dataset to analyze the wordiest speakers, state by state—the callers who, regardless of their tempo, used the most words during their interactions with customer service agents.

and also:

Some of the slower-talking states (Texas, New Mexico, Virginia, etc.) are also some of the wordier, suggesting a premium on connection over efficiency. Some of the fastest-talking states (Idaho, Wyoming, New Hampshire) are also some of the least talkative, suggesting the get-down-to-business mentality commonly associated with those states.

b) Can you find out, from the article, how much wordier the "wordy" states are compared to the less wordy states? How big are the effects?

c) And can you figure out from the Atlantic article exactly how the researchers operationalized "being wordy" and "talking fast?" How are these two variables different, exactly? (I couldn't find it from the article--let me know if you can!)

I'm being critical of the journalist's coverage here, but I'm also keeping in mind that it might not be her fault. The source of the data is not a peer-reviewed journal article (which would have been required to present this information). Instead, the source was a report by the for-profit analytics firm that had collected the data. Perhaps this firm did not report enough methodological and statistical details to allow careful reporting by the journalist.

Specifically, you learn that talking speed was operationalized as "words per minute". And the differences between states were described this way:

For every 5 words a slow talking state utters, a fast talking state will utter 6.

You also learn that wordiness was operationalized as "total words in a phone conversation." and:

How big is the difference? A New Yorker will use 62% more words than someone from Iowa to have the same conversation with a business,

d) What do you think of these operationalizations of talking speed and wordiness? And what do you think of the magnitudes of the effects? Are these large differences--worthy of a news story on differences between states? If you were the journalist, would you have incorporated this information, or not?

You might also be interested in a second Atlantic story about which states are the "sweariest." Compared to the story on talking fast (and being wordy), this story is more informative. For one, it describes how swearing was operationalized. They

...examin[ed] more than 600,000 phone calls from the past 12 months—calls placed by consumers to businesses across 30 different industries. It then used call mining technology to isolate the curses therein, cross-referencing them against the state the calls were placed from.

The journalist also opined, (probably correctly):

...cursing is also conveniently specific as a data set; you've got your f-bombs and your double hockey sticks and your bodily functions, and, factoring in their permutations, you're good to go. Plus, you don't need much sophisticated sentiment analysis to ensure that your data are accurate: An f-bomb is pretty much an f-bomb, regardless of the contextual subtleties.

In the swearing story, the journalist also described the data better, explaining how the rates of swearing in the highest-swearing state and the lowest state compare:

People in Ohio cursed the most as compared to every other state in the Union: They swore in one out of about every 150 phone conversations. Ohio was followed, respectively, by Maryland, New Jersey, Louisiana, and Illinois.

And who swore the least? Washingtonians. They cursed, on average, during one out of every 300 conversations. (Yes, this means that Ohioans swear at more than twice the rate of Washingtonians....)

d) Now you have a better picture of the data. What do you think of the sizes of these effects? Big or small? Important, or not?

e) The headline of this story announces that Ohio "is the sweariest state in the Union." What do you think of that headline? Specifically, does that headline overstate the results? Does it overgeneralize from the measures they used?

Let's take a look at the study behind the headline. The journalist's story covers a recent publication in the journal, Computers in Human Behavior. As the study's author told the journalist,

"We all have watched a cat video online, but there is really little empirical work done on why so many of us do this, or what effects it might have on us," added Myrick, who owns a pug but no cats. "As a media researcher and online cat video viewer, I felt compelled to gather some data about this pop culture phenomenon."

So far, so good. Empiricism is a great way to answer questions about pop culture phenomena. So what did they do?

The study, by assistant professor Jessica Gall Myrick, surveyed almost 7,000 people about their viewing of cat videos and how it affects their moods. It was published in the latest issue of Computers in Human Behavior. Lil Bub's owner, Mike Bridavsky, who lives in Bloomington, helped distribute the survey via social media.

A survey, then. Here are some of the results:

Participants in Myrick's study reported that:

They were more energetic and felt more positive after watching cat-related online media than before

The pleasure they got from watching cat videos outweighed any guilt they felt about procrastinating

About 25 percent of the cat videos they watched were sought out; the rest were ones they happened upon

Time to analyze this causal claim.

a) What are the variables in the journalist's causal headline, "viewing cat videos boosts energy and mood?"

b) What kind of a study does it take to establish this causality? How might you have designed such a study?

c) What kind of study seems to have been conducted here? Do the results support the causal claim? If not, what would a more appropriate headline be for a story on this study?

Finally, the journalist mentions that LilBub's owner advertised the study on his website, and that for each person who took the survey, he donated money to the APSCA, which supports animal welfare. Over 7000 people completed it.

d) One result from the study is that "about 36 percent described themselves as a cat person, while about 60 percent said they like both cats and dogs." Can we assume from the results and the methodology that 96% of people are animal (either cat or dog) lovers?

07/10/2015

There's a fun interactive datagraphic on gallup.com's website. It's called "State of the States." You can select a polling variable, such as "overall well-being," "support for Obama," or "religiosity," and it will show you how each U.S. state scores on that variable.

Feel free to take a minute to play with the interactive right now. (I'll wait.)

I've pasted a screen shot from the "well-being" results below. Take a look at it, and consider the questions that follow.

a) In the figure above, the variable I selected was "Well being." The thermometer below indicates that darker states are higher in well-being than lighter states. Using that rule, which states are the highest in well-being? Which are the lowest?

b) You might notice that South Dakota is higher in well-being than North Dakota--their shades of green are noticeably different. In fact, you might even imagine a news story in which a reporter suggests that South Dakotans are "happier." But I want you to consider the effect size of the difference. About how much happier are South Dakotans, according to the scale?

Now consider the next screen map (below). This one shows religiosity, indicating the percentage of state residents who consider themselves "Very religious":

c) As before, the thermometer below indicates that darker states are higher in saying they are "very religious" compared to lighter states. Using that rule, what states are the highest in religiosity? Which are the lowest?

d) Take a look at the scale for this variable--what do you notice about the range for Religiosity compared to the range for well-being?

e) On the map, the states of Utah and Idaho are about the same shades of green as South and North Dakota were on the well-being variable. Indeed, the shades of green for Utah and Idaho are noticeably different. In fact, you might now imagine a news story in which a reporter suggests that Utahans are "more religious." Once again, I want you to consider the effect size of the difference. How much more religious are Utahans, according to the scale?

e) What do you think? How is Gallup using these shades of green in this interactive data map? Is their use misleading? If so, what might be better?

02/20/2015

What construct and external validity considerations affect Yelp ratings, like this 4-star review of a restaurant near my campus? Screenshot from Yelp website (by author)

In this interesting piece, Slate writer Will Oremus asks why the top-rated restaurants on Yelp are places that "nobody has ever heard of." He explains that the top 10 rated restaruants on Yelp change from year to year, and furthermore, they tend to be places like Copper Top BBQ of CA and Art of Flavors of Las Vegas. The top 100 are not big, famous restaurants--in fact, they tend to be simple, local, even touristy places with "styrofoam and paper plates."

Oremus points out that Yelp ratings are subject to "biases are quite different from the ones we’re used to" in other ratings websites. Are these biases of the external validity variety? Or the construct validity variety? I'll quote some passages from Oremus' Slate article for your consideration.

Here's Oremus's first observation:

I have not eaten at Copper Top, but I have little doubt that these restaurants also share a consistently high quality of food for the money. On Yelp, that’s usually a recipe for a four-star rating. Compared to professional critics, Yelp reviewers skew young and budget-conscious, which is part of the site’s appeal. By and large, they’re happier paying $8 for a very good burrito than $23 for a fancy one, and the ratings reflect that.

a. What kind of validity is this conclusion directed at--external or construct? Can you say anything specific about this type of validity, as used by this journalist?

Here's another observation from Oremus' article.

Part of the explanation lies in the distribution of ratings on the site’s five-star scale. Only a handful or restaurants in the world rate three Michelin stars. But more than 40 percent of all Yelp reviews are perfect scores, suggesting that five stars on Yelp entails satisfaction rather than perfection. Average hundreds of reviews of the same establishment, and you’ll find that its overall rating is influenced far more by the number of dissatisfied customers than by how much the five-star reviewers loved it. The best-rated restaurants on Yelp, then, are not so much the most loved as the least hated.

b. What kind of validity is this conclusion directed at--external or construct? Can you say anything specific about this type of validity, as used by this journalist?

Oremus also points out the power of incidental influences, such as neighborhood and weather, on Yelp reviews. To wit:

...researchers at Georgia Tech and Yahoo Labs found that online restaurant reviews are significantly influenced by at least three factors that have nothing to do with the operation of the business:

Neighborhood demographics: Restaurants in neighborhoods with high education levels don’t get better reviews, but they do get more reviews. That matters, because Yelp’s top-100 rankings are based not only on average ratings, ...so a place with 100 five-star reviews will rank higher than one with 50.

Time of year: Restaurants get more reviews in July and August than they do in the winter, but the average ratings in the summer months are lower.

Weather: One of the strongest exogenous effects on restaurant ratings, according to the study, is the weather at the time of the review. As you might guess, warm temperatures and sunshine mean higher reviews. Cold temperatures or extreme heat mean lower reviews, as does precipitation of any kind. The researchers attribute this to weather’s well-documented effects on mood and memory.

c. What kind of validity is the above research concerned with?

d. How might this information affect your own use of Yelp in the future? Think of two possible ways you can use this information.

Suggested Answers

a. When Oremus writes that Yelp's users tend to be younger and budget-conscious, he's describing how the sample of people who choose to post on Yelp is biased. This is an external validity point. One could say that the sample of Yelp reviews is self-selected, and is biased toward restaurants that are cheaper. Cheaper restaurants may get more reviews on Yelp than more expensive ones; in addition, cheaper restaurants may be rated more positively, all because of this sampling bias. Therefore, ratings on Yelp might not generalize to how older people would evaluate the same restaurants. The situation seems similar to conducting an opinion poll and including (or not) cell-phone only households.

b. This is a construct validity point. People tend to use the 5-star end of the Yelp scale the most, he says. This means that people have a "yea saying" bias on Yelp reviews (to use a Chapter 6 term). As a result, it might be hard to decide if a positive review on Yelp is truly good, or if people just tend to like everything!

Another point Oremus made is that a high Yelp rating means the restaurant is more consistent--not necessarily more delicious. That again is a construct validity point.

c. The weather bias suggests problems with construct validity. We might not know if a positive rating reflects the quality of the restaurant (the construct in question) or the type of weather outside! However, the "neighborhood bias" seems to be an external validity issue--restaurants in highly educated neighborhoods get more reviews, so this is a sampling bias.

d. Answers will vary. All in all, it seems that these construct and external validity issues mean that Yelp reviews are unlikely to parallel what a professional restaurant critic would say. Does that affect your restaurant behavior, or not?

Thanks again to Carrie Smith of Ole Miss, who, as usual, is a fount of bloggable Slate pieces!

a) This figure provides a good opportunity to interpret a graph. Study the figure carefully. The figure seems to be illustrating two main points--one about time, and one about age. What are these two points?

b) This figure, as well, seems to be illustrating two main points. What are the two main takeaway messages from this figure? (Hint: One takeaway message is the same here as in Question a.)

Given the data presented in this figure, Pew Research reports that it will call a higher proportion of cell-phone only households this year. Here is a quote from their website:

To keep pace with this rapid trend, the Pew Research Center will increase the percentage of respondents interviewed on cellphones in its typical national telephone surveys to 65%; 35% of interviews will be conducted by landline. Last year, we increased the ratio to 60% cellphone, with 40% conducted on landline. Back in 2008, when we first started routinely including cellphones in our phone surveys, just one-fourth (25%) of all interviews were done by cellphone.

c) Remind yourself that external validity (through generalizable sampling techniques) is especially important for frequency claims. Give two or three examples of research questions that fit this kind of claim.

d) Explain why it is important for polling organizations to know which households are cell-phone only and which are not. What specific kinds of political and social questions might be especially affected by cell-phone only proportions, given the data presented above?

e) Here's an interesting question: How might an organization like Pew Research obtain an accurate estimate of the number of cell-phone only households in the first place? What kind of sample would you need to get this estimate? How would you contact this sample?

Finally, here's an interesting description from Pew Research, discussing whether it would be smart to conduct polls only on cell phones, not land lines:

The question naturally arises: Why not interview everyone on a cellphone? In fact, at least one major national survey is going to do just that. The Surveys of Consumers, conducted by the University of Michigan, will begin calling only cellphones this month.

But we are not ready to make that change just yet, for at least two reasons. One is that there remains a small share of the public that is not reachable by cellphone. In the newly released data from the National Center for Health Statistics, 7% of adults live in households with a landline phone but no wireless phone. In addition, some people with landlines and cellphones may turn their cellphones on only to make calls or when they are expecting to be called. If these kinds of people are demographically different from those who are more easily reached on a cellphone, then the resulting sample will be less representative of the full population.

The researchers visited a sample of fast food restaurants in the Boston area, found tables in which parents were sitting with kids, and made notes of the parents' and children's behavior. One of the major findings was the degree to which the parents were engaged by their mobile devices:

Parents in 40 of the 55 families observed were absorbed in their mobile devices, according to the study.

a) What kind of claim is the journalist apparently making above? What is/are the variable(s) in the claim?

b) How would you ask about the external validity of the study?

The researchers observed both parents and children:

... almost a third of the parents used their devices continuously throughout their meal....Some children appeared unaffected and ate their meals in silence. Other children were more provocative, with one set of siblings singing “Jingle bells, Batman smells” to get their dad’s attention.

c) How would you ask about the construct validity of these measures of parental and child behavior?

There is an inconsistency in the journalist's story. Although the headline of the study makes an association claim, "Parents on smartphones ignore their kids, study finds," the body of the story reports that:

The degree to which the device was used, however, did not necessarily directly relate to the way in which the child reacted, according to the study.

Indeed, if you read the original journal article in Pediatrics, you'll see that the real study did not report much evidence of associations between parental behavior and child behavior.

Suggested answers

a) This line about "40 out of 55 families" is a frequency claim--it seems to simply be documenting "what parents do" when eating in restaurants with children.

b) You'd want to ask "how did they choose these 55 families to observe? Did they use a random sampling/probability sampling technique?" If they sampled families based on convenience, it would be difficult to infer that the same proportion of all American families show these cell phone habits.

You might also ask about the degree to which the setting generalizes. Does parental behavior in restaurants generalize to their behavior at home, or to other settings where they are with the kids?

c) To ask about construct validity, you'd want to ask about how well the coders were trained. Did the coders have good inter-rater reliability? Did they follow a codebook? What behaviors did they decide to code and how well did they code them?

10/10/2013

Chapter 6 discusses the importance of well-worded polling questions. When we look carefuly at the way questions are worded, we interrogate the construct validity of a poll: Is the question measuring what it intends to measure?

My home paper, the Philadelphia Inquirer, ran a story on a recent poll last Saturday (right). It shows the results of a poll that asked people about the national health care law. Half of the sample was asked a question about "Obamacare;" the other half was asked the same question, but about the law's official name, "The Affordable Care Act." You can see how the wording of the question changed the responses.

My read of this poll's results suggests that the name "Obamacare" is not necessarily a pejorative term. While calling the law "Obamacare" increased the negative responses people gave, it also increased the positive responses they gave.

The poll also suggests that most people think of the law under the name "Obamacare." Calling the law the "Affordable Care Act" caused the number of "Don't know" responses to more than double. Construct validity is relevant again--how can people give their valid opinion on a law that they don't know about?

In light of these results, how would you word the question in order to get the most accurate picture of people's true feelings toward America's health care law?

01/21/2013

This Wall Street Journal article describes how a team of U.S. and European professors traveled to North Korea to teach students there about quantitative methods for studying their population.

North Korea, the story says, has "some of the least reliable statistics in the world."

Please take a look at the pop-out map of North Korea in this story. The map shows the estimated rates of child malnutrition in each of North Korea's provinces. For example, it shows that the estimated rate of child malnutrition in the province of Jagong was 9.8%. That is a frequency claim: a claim about a single variable--the rate of malnutrition--in each province.

The caption on the map is worth attending to from a research methods perspective. The caption reads,

Researchers in North Korea often face challenges. In this 2012 study of child malnutrition, conducted with the support of three U.N. agencies, local village leaders chose the children from which researchers drew their subjects, so they could have excluded the most malnourished kids.

This first part of the caption describes a potential problem with the external validity of this claim. If officials did in fact select only the most healthy kids for the sampling frame, then the estimate of the true rate of child malnutrition would be too low.

Here's the second part of the map's caption:

And the data were collected at the end of the harvest, possibly producing a temporary uptick in nutrition levels.

You could say that this part of the caption is describing a potential problem with the construct validity of this claim. By measuring child malnutrition during a time of plenty, officials did not get a valid measure of the true health of the kids in the sample.

What lessons do you think the instructors are teaching to the students in North Korea?

If you’re a research methods instructor or student and would like us to consider your guest post for everydayresearchmethods.com, please contact Dr. Morling. If, as an instructor, you write your own critical thinking questions to accompany the entry, we will credit you as a guest blogger.