Chapter 05; Identifying Good Measurement

09/10/2018

College graduates were more likely than those who'd not been to college to report they are "smarter than average." Is their perception overconfident, or not? Photo: PeopleImages/Getty Images

It seems to be conventional wisdom that people are overconfident in their own abilities. People tend to think they are nicer, smarter, and better looking than most other people. But what's the evidence? The scientist-authors of this Wall Street Journal summary explain,

The claim that "most people think they are smarter than average" is a cliche of popular psychology, but the scientific evidence for it is surprisingly thin. Most research in this area has been conducted using small samples of individuals or only with high school or college students. The most recent study that polled a representative sample of American adults on the topic was published way back in 1965.

The authors, Patrick Heck and Christopher Chabris, worked with a third colleague.

..[W]e conducted two surveys: one using traditional telephone-polling methods, the other using internet research volunteers. Altogether we asked a combined representative sample of 2,821 Americans whether they agreed or disagreed with the simple statement "I am more intelligent than the average person."

Here are some of the results:

We found that more than 50% of every subgroup of people -- young and old, white and nonwhite, male and female -- agreed that they are smarter than average. Perhaps unsurprisingly, more men exhibited overconfidence (71% said they were smarter than average) than women (only 59% agreed).

Perhaps "overconfidence" is really accuracy? Consider this pattern of results:

In our study, confidence increased with education: 73% of people with a graduate degree agreed that they are smarter than average, compared with 71% of college graduates, 62% of people with "some college" experience and just 52% of people who never attended college.

The accessible Wall Street Journal summary is paywalled, but the original empirical publication is open-access in PLOS One.

Questions

a) What kind of study was this? Survey/poll? Correlational? Experimental? What are its key variables?

b) The authors found that more than 50% of every subgroup of people considered themselves smarter than average. Why is this result a sign of overconfidence?

c) The authors of this piece state that their combined sample was "representative". Re-read the section on how they got their sample and then make your own assessment--is the sample representative? (i.e., how is its external validity?). What population of interest do they intend to represent?

d) Sketch a graph of this result:

73% of people with a graduate degree agreed that they are smarter than average, compared with 71% of college graduates, 62% of people with "some college" experience and just 52% of people who never attended college.

e) In concluding their article, the authors wrote, "Our study shows that many people think they are smarter than they really are, but they may not be stupid to think so." What do you think? To what extent does this study's results support this conclusion?

e) Ask a question about this study's construct, internal, external, and statistical validity.

03/20/2018

Maybe you shouldn't get into the car with this guy. Credit: Shutterstock.com

You probably know drivers who honk, tailgate, and shake their fists; you know others who give drivers space and respect. Now researchers have identified a trait the aggressive drivers might share: Narcissism.

The participants answered questions from the Narcissistic Personality Inventory, a set of questions used since 1988 to measure narcissism. This questionnaire had participants rate how strongly they agreed with items such as: “I like to be the center of attention,” or “I am an extraordinary person” on a 1 to 5 scale. They then addressed similar items about aggressive driving behavior: “I often swear when driving a car,” or “When driving my car, I easily get angry about other drivers.” ...The researchers report that the more narcissistic drivers are, the more angry and aggressive they reported becoming on the road.

a) Let's talk measurement first. How was narcissism measured--Did they use a self report? An observational measure? or a physiological measure?

b) Now for the second variable, aggressive driving: How was this measured--Was it self report? An observational measure? or a physiological measure?

c) Was this a correlational or experimental study? How do you know?

d) Sketch a graph (with well-labeled axes) of the results of the study.

Next, the researchers conducted a lab-based study with university students. They measured narcissism just as they'd done before, but they measured aggressive driving differently. Here's how they measured aggressive driving in the lab:

...participants sat in the driver’s seat of a 2010 Honda Accord, surrounded on three sides by a curved projection screen. In a 15- to 25-minute driving exercise, the participants saw other computer-generated cars and were told that some of them were being operated by other study participants. (In fact, the experimenters were controlling the other vehicles.)

During the exercise, the participants encountered:

a car pulling suddenly in front of them;

a traffic jam with two 10-second full traffic stops, one after another;

a construction zone with one lane closed and the other slowed down;

a second car mimicking the human driver’s behavior; and

a traffic light that was red for 60 seconds and green for just 5 seconds.

The researchers found that the participants who scored high on narcissism measures were more likely to tailgate, speed, drive off-road, cross the center line into oncoming traffic, drive on the shoulder, honk their horn, or use “verbal aggression” or “aggressive gestures,” in the experimenters’ chaste wording.

e) In this study, how was aggressive driving measured--Did they use a self report? An observational measure? or a physiological measure?

f) Was this a correlational or experimental study? How do you know?

g) Sketch a graph of the results of the study. Label your axes mindfully.

h) Can you think of moderators of this basic relationship? For example, might there be situations or settings for which narcissism is especially strongly linked to aggressive driving? (As you answer, consider this: Past work on narcissism has established that narcissists aren't always aggressive; they are mainly aggressive when others reject them or when they are provoked.) Create a moderator table like those seen in Chapter 8 (e.g., Figure 8.19 or Table 8.5)

i) Can you think of a mediator that explains the relationship between narcissism and aggressive driving? If so, sketch a mediator diagram like those seen in Chapter 9 (e.g., Figure 9.11 or Figure 9.13)

11/10/2017

How do you feel about the photo to the left? As for me, I hate looking at it, so I made it as small as possible! I am not snake-phobic, but like many other humans, I'd much rather pick up a rabbit or even touch a bear than pick up a snake. But where'd I get that fear? To what extent is humans' fear of certain creatures--like snakes or spiders--present when we are born?

Several groups of researchers have been exploring this question in humans and other primates. The results of one study have been covered by National Geographic's news site. Read this description of the study by the journalist:

Forty-eight six-month-old infants were tested at the institute to analyze how they reacted to images the researchers predicted might be frightening. While sitting on their parents' laps, infants were shown images of spiders and snakes on white backgrounds for five seconds. To prevent parents from inadvertently influencing their infants' reactions, they were given opaque sunglasses during the experiment that prevented them from viewing whatever image was shown.

...When the babies saw pictures of the snakes and spiders, they consistently reacted with larger pupils than when they were shown control images of flowers and fish.

a) What seem to be the independent and dependent variables in this experiment? (By the way, why was this an experiment rather than a correlational study?)

b) Based on the description of the study, was this experiment independent groups or within groups? How do you know?

c) Which of the four basic types of experiments was this: Repeated measures? Concurrent measures? Posttest only? Pretest/posttest?

d) Sketch a graph of the results of the study.

e) The parents were given opaque glasses to wear while holding their babies. Which of the four big validities will this step help to improve?

Now, let's focus a bit on what it means when a baby's pupils dilate. Pupil dilation was the operationalization (the operational definition) of the study's dependent variable. But what construct does pupil dilation supposedly represent? Given the story's headline, you might think the construct is "fear." But it's probably more complicated than that. Read on:

.... dilated pupils are associated with activity in the noradrenergic system in the brain, the same system that processes stress. Closely measuring changes in pupil size has been used in previous studies to determine a variety of mental and emotional stress in adults.

But the journalist also noted:

...it's difficult to characterize the exact nature of the type of stress infants experienced, but dilated pupils show heightened states of arousal and mental processing. Rather than indicating fear in particular, the study says this shows an intense focus.

f) Based on this description, what are some of the candidates for the construct measured by pupil dilation? (You can also read even more correlates of pupil dilation here).

06/20/2017

The study found an association between screen time and language delay in a sample of 900 18-month olds. Photo: Maria Sbytova/Shutterstock

This CNN story reports on research presented at an academic conference for Pediatrics. According to the conference presentation,

[A] study found that the more time children between the ages of six months and two years spent using handheld screens such as smartphones, tablets and electronic games, the more likely they were to experience speech delays.

According to this description, the two main conceptual variables in the study are "time spent using a handheld screen" and "speech delay." Read on to find out how each variable was operationalized:

In the study, which involved nearly 900 children, parents reported the amount of time their children spent using screens in minutes per day at age 18 months. Researchers then used an infant toddler checklist, a validated screening tool, to assess the children's language development also at 18 months. They looked at a range of things, including whether the child uses sounds or words to get attention or help and puts words together, and how many words the child uses.

a) According to the text, how did the study operationalize the variable, time spent using a handheld screen? Do you think this was a valid measure? Why or why not?

b) According to the text, how did the study operationalize the variable, speech delay?

c) The passage describes the infant toddler checklist as "a validated screening tool"--how do you think this tool was validated? What data might they have collected to validate this measure? (apply concepts from Chapter 5)

d) Sketch a scatterplot of the association they reported in the first quoted section.

e) Does this association allow us to conclude that "exposure to handheld screens causes children to experience speech delays?" Why or why not?

Read the following description; identify the mediator that the speaker is proposing.

"We do know that young kids learn language best through interaction and engagement with other people, and we also know that children who hear less language in their homes have lower vocabularies." It may be the case that the more young children are engaged in screen time, then the less time they have to engage with caretakers, parents and siblings, said [an expert who commented on the findings].

f) Sketch the mediator pattern that is hypothesized above.

Suggested answers

a) This variable appears to have been operationalized via parents' reports of their children's screen time.

b) Speech delay may have been operationalized with multiple measures. It is not clear from the reporting if the Infant Toddler Checklist is one measure, separate from whether the "child uses sounds or words to get attention or help and puts words together, and how many words the child uses," or if these three components comprise the infant toddler checklist.

c) One way to validate a measure such as the Infant Toddler Checklist would be through a known groups paradigm. Professional speech language pathologists might identify groups of children who either do have speech delay or do not. Then all children in the two groups would be administered the Checklist. If the Checklist is valid, then the group who have been diagnosed with speech delay should score lower on it than the group who who have been diagnosed as developing normally.

d) Your scatterplot should have "Speech delay" (or, alternatively, "Infant Toddler Checklist") on one axis and "Time spent on screens" on the other axis. The dots should slope upwards from left to right.

e) The results show covariance: Speech delay is associated with time on screens. The method does not allow us to establish temporal precedence, since both variables were apparently measured at the same time. We do not therefore know if the screen time came first (inhibiting speech delay) or if children with speech delays are more likely to be drawn to screens (perhaps to cope with frustration of not communicating easily). Internal validity also is not established. A reasonable third variable explanation might be time in day care: it is possible that children in day care are both less likely to be on screens (because most day cares have lots of other toys) and children in day care are exposed to more language because there are more people around.

f) You'd draw this mediator pattern:

exposure to screens ---> less social engagement with caretakers ---> language delay

03/10/2017

Is counting nests of eggs from sea turtle eggs a valid way to measure the population of sea turtles in the ocean? Photo: Kimberley Croy/Shutterstock

How do scientists estimate the number of sea turtles living in the world's oceans? Counting creatures seems pretty easy--they can be over one meter long, after all--until you remember they live in the ocean, travel hundreds of miles, and are nearly impossible to track.

A single female will lay her eggs at several places within the same nesting ground—a reproductive spread-bet that prevents her from losing an entire generation to, say, a storm or an industrious predator. Scientists have assumed that green turtles lay an average of 3.5 clutches each, and counting these clutches helps scientists estimate the global turtle population.

Is it valid to use the number of clutches to estimate the overall number of turtles?

Scientists can canvas these nesting beaches and count the tracks of the females. If you divide that by the number of nests that each female makes, you get the total female population. For example, if you get 300 tracks, and you assume three nests per turtle, you get a total of 100 females. “But if you think the number of nests per individual is six, it’s a very different story,” says Esteban.

Yong's story describes an example of construct validity. Specifically, you could say that the operational definition being used to estimate the number of sea turtles has been found to have poor construct validity. You can apply some concepts from Chapter 5 to this example:

a) What conceptual variable is the focus of this story? How is the variable being operationally defined?

b) What is the scale of measurement of the "number of egg clutches" measure? Categorical? Ordinal? Interval? or Ratio?

c) What kind of reliability would probably be most relevant for the "number of egg clutches" measure: Internal? Inter-rater? or Test-retest?

According to the data collected by Dr. Esteban, we should update the operational definition currently used, and estimate the number of sea turtles using the factor of six nests per turtle instead. But keep in mind that we need to establish the construct validity of the new data, too. Therefore we should also ask, How did Esteban find out that each female lays 6 nests (rather than 3)?

She arrived at that answer not by counting tracks, but by following actual turtles. In October 2012, she and her colleagues patrolled the beaches of Diego Garcia Island, waited for the turtles to finish laying their eggs, and then accosted them. They carefully cleaned the shell and then stuck on a state-of-the-art satellite tag—a flattened, waterproof, Tupperware-like box, which they painted with black antifouling paint to stop marine microbes and larvae from growing...

..After tagging eight turtles, Esteban realized that they were laying far more nests than anyone had expected. So her team returned to Diego Garcia in July 2015, to tag ten more animals at the very start of the breeding season. And they confirmed that the females were laying an average of six clutches each, with a range of two to nine.

d) In Esteban's study, what is the conceptual variable of interest (Hint: it's not "the number of sea turtles in the ocean")? How was her conceptual variable operationally defined?

As a side note, this research provides an example of how scientists use pilot data (the first eight turtles) and then follow up by collecting more data. Yong also mentions that Esteban's finding, in which female sea turtles lay about six nests each, concurs with two other independent estimates of the same variable.

08/20/2015

What does your vacation say about your personality? Business Insider reported on a series of studies that demonstrated an association between people's personalities and their preferred places to relax and to live. The journalist sums up the studies: "the landscapes around us match the landscapes within us."

The creative set of studies by researchers Shigehiro Oishi, Thomas Talhelm and Minha Leefound used a variety of methods and measures to document links between introversion-extroversion and the physical environment. The theory they tested can be summarized this way:

Studies have shown that extroverts seek out opportunities for socializing and attention, whereas introverts look for quiet, more solitary situations. ...[but] there’s been little exploration into the role our personalities play in determining the geographical settings we love most. ”We argue that beaches are typically noisier, with more people to watch, talk to, and hang out with than mountains,” they write. “In contrast, mountains offer many secluded places, which facilitate isolation.” Extroverts should be happiest in an open area, then, Oishi hypothesized, whereas introverts should thrive in secluded spots.

Below are short descriptions of three of the studies, as explained by the journalist. For each study,

i) indicate the variables in the studyii) indicate whether each variable is manipulated or measurediii) identify the study as correlational or experimental.

Some studies have follow-up questions, too.

a) Study 1:

Oishi and his team asked 921 undergraduates to rate their personality, using a standard questionnaire. The students were then asked whether they prefer the ocean or mountains. Comparing the results, the researchers found that introversion was linked to a preference for mountains, while extroversion was linked to the beach. Mountain-lovers and ocean-lovers had no other significant personality differences, nor did age, gender, or socioeconomic status factor into their preference.

b) Study 2:

These findings were confirmed by a visual test. Oishi and his colleagues showed a smaller group of participants six pairs of pictures of oceans and beaches (one such pair is shown above), asking where they’d prefer to visit, cost and time investment being equal. These participants also took the standard personality questionnaire. Controlling for age, race, gender, and socioeconomic status, extroversion was found to be a significant predictor of ocean preference.

iv) in addition to the first three questions, what does it mean to say "controlling for age, race, gender, and SES?" How might a table of these results look?

c) Study 3:

First, to see whether extroverts and introverts cluster geographically, Oishi and his team sought to find out whether residents of more mountainous states become introverted as a result of their surroundings. The researchers compared answers from a nationwide, multi-year personality survey of hundreds of thousands of respondents, with the relative mountainous-ness of each U.S. state.

Controlling for size of the population, they found a strong correlation between elevation and introversion on a state level: the more mountainous a state, the more introverted its population tended to be.

iv) in addition to answering the first three questions, sketch a scatterplot of the main bivariate result. What would a dot represent on your scatterplot?

d) Study 4

Then, to see whether geography actually brings out certain personality traits, the researchers had another group of students take the personality survey. They placed participants in either one of two spots on the UVA campus: a quiet, wooded hill, or a flat, open lawn. Based on how participants engaged in conversations with the researchers before and after placement, the researchers watched to see if the the quiet hill made people more quiet and introverted, and if the flat, open area made them more talkative and social.

While neither environment made participants more extroverted or introverted than they already were, extroverts were found to be happier in the open area, while introverts were happier in the secluded spot.

iv) Here's a challenge question about Study 4: This study demonstrated a moderator. Given the findings, could you fill in the blanks in this sentence: "_____ moderated the relationship between ______ and ______."

Additional questions:

e) The researchers did a large number of studies and found the same patterns in all of them. Were these studies exact replications, conceptual replications, or replications-and-extensions? Why was it important to run so many studies?

f) One of the studies they did was described this way:

The researchers asked another sample of students where they’d go for fun, social opportunities versus quiet solitude, between the beach and the mountains. Most respondents said they’d go to the beach if they wanted play-time with friends. Most also said that the mountains were best for alone time, affirming the researchers’ assumptions about how people perceive these geographical spaces.

Why was the above study an important one in the series?

Answers to selected questions

Study 1

a) i. One variable was level of introversion/extroversion, another was liking for the beach, and another was liking for the mountains.

ii. All three variables were measured, so

iii. This was a correlational study.

Study 3

c) i. One variable was level of introversion or extroversion. Another variable was relative mountainous-ness. There were other measured variables, too, such as each state's population.

ii. All variables were measured.

iii. This was a correlational study.

iv. You can see a copy of the actual scatterplot in the story here (scroll down). Each dot represents a state. Does your scatterplot look like this one?

e) These studies are conceptual replications of the core hypothesized relationship--that introverts would prefer mountains and extroverts would prefer the beach. By studying this relationship using questionnaires, photos, state-by-state data, and actual campus settings, they are replicating their core finding at least four different times.

f) This study is testing the researchers' theoretical assumption that people see beach areas as more conducive to socializing and mountainous areas more conducive to alone time. This study boosts the construct validity of the other vacation studies, by showing that a choice for the mountains represents a choice to spend time alone.

Psychopathy is a rare and serious personality disorder, which is primarily diagnosed in criminal justice settings. Individuals with psychopathy lack empathy and remorse, do not emotionally connect with other people, are manipulative, use other people to their own ends and are often aggressive or violent. Psychopaths are estimated to make up approximately 1% of the population, but comprise up to 20% of the prison population.

You may have seen some online personality tests (or "quizzes" as they are sometimes called), that purport to tell you whether you have the qualities of a psychopath. I was surprised that such a test is embedded in the dating site OKCupid. And you can find others online, like this one.

The authors of The Guardian piece raise two points about such online tests. First, do the online personality tests work--do they accurately detect psychopathy? In the terms of Chapter 5, on Identifying Good Measurement, do the online quizzes have good construct validity?

In order to have construct validity, these tests should be both reliable and valid.

a) What kind(s) of reliability do you think is (are) important to establish for these online tests--interater reliability? internal reliability? test-retest reliability? How would you decide if the tests have each kind of reliability?

Even if you learned that an online test has reliability, you would also want to establish the test's validity.

b) What kinds of validity might be relevant here? What would you do to establish criterion validity of one of the online tests, for example?

As the authors of The Guardian piece write:

...self-rating instruments are never perfect and there is a great deal of room for error – particularly when the instruments have not been subjected to rigorous empirical study assessing their reliability, validity and ability to capture individual differences in the population. We see no evidence that the online quizzes have undergone these procedures and as such what constitutes a high score is likely to just represent someone’s subjective opinion.

The following sections in The Guardian piece that discusses content validity, too. (What's the definition of content validity?)

To get a diagnosis of psychopathy, an individual has to score a minimum of 30/40 on a standard diagnostic instrument that relies on recorded, independently verified information from institutional files, as well as an in-depth interview administered by a trained professional

People may endorse that they have some psychopathic traits without actually having a full-blown psychopathic personality disorder. But scoring relatively high on some of the features of psychopathy does not make a person a psychopath. Consequently, there is a concern that psychopathy quizzes may suggest to people with psychopathic traits that they in fact are bona fide psychopaths.

c) Explain why this section (quoted above) is about content validity of a psychopathy test. There are several points you could make.

In sum, the first major argument in The Guardian piece is that these online psychopathy tests may not be construct valid.

The second major argument in the article is that these quizzes are also not responsible. After the authors took the online tests and endorsed a few psychopathic traits on purpose, they got this feedback from OKCupid's test:

Wow, you are a genuine psychopath. You lack empathy, are highly manipulative, disregard the law, and don’t even have any delusions to blame for your behaviour. Therapy is unlikely to help you and would in fact just make you better at manipulating others. Chances are that most people don’t even realize just how sick you are.

As the authors point out, this kind of feedback may upset certain people who are actually not psychopathic! And in a way, the feedback is almost congratulatory. The authors write:

In particular, those who encounter these quizzes on dating websites might be especially concerned about how such feedback reflects on their social abilities. At worst, the feedback was irresponsibly congratulatory and even appeared to exhort people to capitalize on their “psychopathic personality” to use others for personal gain. We were also concerned about some of the feedback “diagnosing” the respondent as a psychopath and telling them that they cannot change and that no therapy will work for them. Such feedback is misinformed...

Overall, I thought this was a thoughtful scientific analysis of these online "tests".

d) What other online personality or mental health tests have you encountered? What kind of data (inspired by Chapter 5) would convince you that these tests are reliable and valid?

02/20/2015

What construct and external validity considerations affect Yelp ratings, like this 4-star review of a restaurant near my campus? Screenshot from Yelp website (by author)

In this interesting piece, Slate writer Will Oremus asks why the top-rated restaurants on Yelp are places that "nobody has ever heard of." He explains that the top 10 rated restaruants on Yelp change from year to year, and furthermore, they tend to be places like Copper Top BBQ of CA and Art of Flavors of Las Vegas. The top 100 are not big, famous restaurants--in fact, they tend to be simple, local, even touristy places with "styrofoam and paper plates."

Oremus points out that Yelp ratings are subject to "biases are quite different from the ones we’re used to" in other ratings websites. Are these biases of the external validity variety? Or the construct validity variety? I'll quote some passages from Oremus' Slate article for your consideration.

Here's Oremus's first observation:

I have not eaten at Copper Top, but I have little doubt that these restaurants also share a consistently high quality of food for the money. On Yelp, that’s usually a recipe for a four-star rating. Compared to professional critics, Yelp reviewers skew young and budget-conscious, which is part of the site’s appeal. By and large, they’re happier paying $8 for a very good burrito than $23 for a fancy one, and the ratings reflect that.

a. What kind of validity is this conclusion directed at--external or construct? Can you say anything specific about this type of validity, as used by this journalist?

Here's another observation from Oremus' article.

Part of the explanation lies in the distribution of ratings on the site’s five-star scale. Only a handful or restaurants in the world rate three Michelin stars. But more than 40 percent of all Yelp reviews are perfect scores, suggesting that five stars on Yelp entails satisfaction rather than perfection. Average hundreds of reviews of the same establishment, and you’ll find that its overall rating is influenced far more by the number of dissatisfied customers than by how much the five-star reviewers loved it. The best-rated restaurants on Yelp, then, are not so much the most loved as the least hated.

b. What kind of validity is this conclusion directed at--external or construct? Can you say anything specific about this type of validity, as used by this journalist?

Oremus also points out the power of incidental influences, such as neighborhood and weather, on Yelp reviews. To wit:

...researchers at Georgia Tech and Yahoo Labs found that online restaurant reviews are significantly influenced by at least three factors that have nothing to do with the operation of the business:

Neighborhood demographics: Restaurants in neighborhoods with high education levels don’t get better reviews, but they do get more reviews. That matters, because Yelp’s top-100 rankings are based not only on average ratings, ...so a place with 100 five-star reviews will rank higher than one with 50.

Time of year: Restaurants get more reviews in July and August than they do in the winter, but the average ratings in the summer months are lower.

Weather: One of the strongest exogenous effects on restaurant ratings, according to the study, is the weather at the time of the review. As you might guess, warm temperatures and sunshine mean higher reviews. Cold temperatures or extreme heat mean lower reviews, as does precipitation of any kind. The researchers attribute this to weather’s well-documented effects on mood and memory.

c. What kind of validity is the above research concerned with?

d. How might this information affect your own use of Yelp in the future? Think of two possible ways you can use this information.

Suggested Answers

a. When Oremus writes that Yelp's users tend to be younger and budget-conscious, he's describing how the sample of people who choose to post on Yelp is biased. This is an external validity point. One could say that the sample of Yelp reviews is self-selected, and is biased toward restaurants that are cheaper. Cheaper restaurants may get more reviews on Yelp than more expensive ones; in addition, cheaper restaurants may be rated more positively, all because of this sampling bias. Therefore, ratings on Yelp might not generalize to how older people would evaluate the same restaurants. The situation seems similar to conducting an opinion poll and including (or not) cell-phone only households.

b. This is a construct validity point. People tend to use the 5-star end of the Yelp scale the most, he says. This means that people have a "yea saying" bias on Yelp reviews (to use a Chapter 6 term). As a result, it might be hard to decide if a positive review on Yelp is truly good, or if people just tend to like everything!

Another point Oremus made is that a high Yelp rating means the restaurant is more consistent--not necessarily more delicious. That again is a construct validity point.

c. The weather bias suggests problems with construct validity. We might not know if a positive rating reflects the quality of the restaurant (the construct in question) or the type of weather outside! However, the "neighborhood bias" seems to be an external validity issue--restaurants in highly educated neighborhoods get more reviews, so this is a sampling bias.

d. Answers will vary. All in all, it seems that these construct and external validity issues mean that Yelp reviews are unlikely to parallel what a professional restaurant critic would say. Does that affect your restaurant behavior, or not?

Thanks again to Carrie Smith of Ole Miss, who, as usual, is a fount of bloggable Slate pieces!

04/10/2014

The Washington Post covered a criminal justice story recently that tracked what happened to crime rates after communities changed the way welfare benefits are distributed.

Previously, people received checks for public assistance. After they cashed these checks, they could have become targets of crime. Since the 1990's, however, public assistance benefits have gradually changed, and are now distributed via electronic debit cards (EBT cards).

Did the shift to EBT cards cause a change in crime rate? Let's check it out.

According to a team of researchers at the University of Missouri, the overall crime rate dropped 10 percent after the introduction of EBT cards for welfare.

The graphic [in the story] looks at crime trends in Missouri before and after the switch from cash to debit cards, for all crimes and broken down by individual crimes. In most crime categories the change before and after the switch is striking - upward trends in assault, burglary, car theft and robbery are completely reversed.

a) What kind of quasi-experimental study is this? (Hint--looking at the graphs in the story will help you answer this question!)Options: non-equivalent control groups pretest-posttest designnon-equivalent control groups posttest only designinterrupted time series designnon-equivalent control groups interrupted time series design

The journalist explains the following aspect of the results:

To put these results in perspective, the overall 10 percent decrease in crime corresponded to 47 fewer crimes per 100,000 people per county per month as a direct result of switching welfare benefits from cash to credit. This finding is fairly astonishing....

b) which big validity is the author addressing in the quote above?

Here's another detail presented in the story. Rates of assault, burglary, car theft, larceny, and robbery all dropped after EBT cards, but rates of rape did not change. According to the authors, this result fits the theory that the availability of cash leads to less crime, because rapes are not typically about getting cash.

c) Can the authors support the causal claim implied in the article's title: "To fight crime in your community, stop using cash"?

Suggested answers

a) this is an interrupted time-series design--it measured crime rates repeatedly up until some key event (the introduction of EBT cards), and repeatedly afterwards.

b) By discussing the magnitude of the 10% increase in practical terms (47 fewer crimes per 100,000 people), the author is discussing one aspect of statistical validity (effect size).

c) Let's apply the three causal criteria to this claim. There is definitely covariance--periods of time with EBT cards go with a lower crime rate, and periods of time with cash-based benefits go with a higher crime rate.

There is also temporal precedence. EBT cards came first, and crime rates were recorded afterwards.

What about internal validity? We could work through the 12 major threats to internal validity here. The design and results of this study allow us to rule out selection, regression, maturation, and a variety of other internal validity threats. However, one possible culprit is a history threat--what else might have happened in the 1990's --at the same time that EBT cards were introduced--that might have also explained this drop in all crimes except for rape? Perhaps a change in policing strategy, political structure, or other crime policy occurred. By reading the scientists' original article, you might be able to check out how well the researchers addressed such history threats.

10/20/2013

This piece in Slate online showcased a study that analyzed the words people use into Facebook. The researchers received brief personality surveys from 75,000 volunteers via a Facebook app. They divided the sample into males and females, as well as into different age groupings. In addition, using people's responses to an online personality questionnaire, they divided the sample into introverts and extroverts, as well as people who were neurotic vs. emotionally stable, and so on. Using "big data" analytic techniques, the researchers were able to find sets of words that best distinguised these groups from one another.

I've copied a closeup of one of the word cloud figures that the authors used. This one shows the differences between the extroverts and introverts. Certain words were much more likely to be used by introverts--these are in larger font in the word cloud on the top. Other words were much more likely to be used by extroverts--these are in larger font in the word cloud on the bottom.

As you can see, reading the word clouds gives a unique sense of what extroverts and introverts are like.

It's fascinating to see the word clouds comparing men and women, young and old, and agreeable and disagreeable people, too--all are available through the original article at PLoSONE, here.

Questions:

a) What are the words that most strongly separate introverts from extroverts?

b) The word clouds are a great way to represent the study's results. But they're not the only way: How might you represent the data in this word cloud in a scatterplot (or scatterplots?)

c) These data provide strong concurrent validity evidence for the measure of extroversion/introversion that they used. Can you explain why?

If you’re a research methods instructor or student and would like us to consider your guest post for everydayresearchmethods.com, please contact Dr. Morling. If, as an instructor, you write your own critical thinking questions to accompany the entry, we will credit you as a guest blogger.