June 2019

My review of Jim Albert's book Visualizing Baseball is on the sister blog. As I mentioned in the review, I thought a more appropriate title should be "Fast Intro to Baseball Analytics". Thus, the book may be of interest to some of the readers here.

***

Over the past year, I've been experimenting with the video medium. Here are some clips on various topics:

Just posted a short video that explains one of the techniques used to work with observational data (or found data). This type of data is extremely common in the Big Data world. The data are being collected by some operational process, and in the indelible words of the late Hans Rosling, they are a bag of numerators without denominators. In this case, the database is for car crash fatalities. You only have the reported crashes linked to deaths but that database does not contain any information about crashes that do not have fatalities or safe driving.

Just like most scientific studies, the original researchers made a claim of statistical significance, i.e. they found something out of the normal. (There was an excess of fatalities on April 20.) However, a second research group took a different look at the data and demonstrated that what happened was more common than first thought.

How do statisticians measure how common something is? One takeaway is how to define the reference (control) group. Another takeaway is replications, repeating a style of analysis over different slices of the data.

Click here to see the video. Don't forget to subscribe to the channel to see future videos.

For a long-form discussion of what is covered in the video, see this blog post.

Recently, there have been a load of criticism of the initiative by the College Board known officially as “Environmental Context Dashboard” and dubbed as “adversity scores” by its opponents. I wrote about a similar issue in Numbers Rule Your World (link), a chapter subtitled "The Dilemma of Being Together", in which I explain how the College Board tries to eliminate test questions that may be biased against certain demographic groups.

The underlying problem is the interpretation of test scores: if Bob scores 1200 and Cindy scores 1280, what does the score difference of 80 points mean?

One can simply say Cindy is superior to Bob since Cindy has the higher score.

The score difference contains information on not just the direction but also the magnitude of the comparison. Cindy is 80 points better than Bob.

But the number 80 carries no meaning unless one knows the scale. It’s not enough to know that the valid scores range from 400 to 1600. Scores are not evenly spread out inside that interval; by design, few test-takers get the extreme scores.

One way the College Board helps us interpret the score differential is by converting the scores into a “percentile” scale. Cindy’s score of 1280 is the 89th percentile: she did better than 89% of the test-takers. Bob’s is at the 81st percentile. How much better is Cindy? One answer is: there are 8 percent of test-takers who score higher than Bob but lower than Cindy. (This PDF from the College Board has a table that converts between test scores and percentiles.)

***

A third student, Angela, scores 1000. How much better is Cindy compared to Angela? Since a score of 1000 is 48th percentile, we know that 41 percent of test-takers “sit” between Cindy and Angela. Cindy appears to be much better than Angela.

The percentile scale contains an implicit comparison of each student against all test-takers. Our next concern is what group should a student be compared against. This is what statisticians call the “control group”.

Let’s dig a bit deeper. If we know further that Angela went to a public school in a low-income neighborhood while Cindy went to an elite private school in New York, one may choose to interpret their respective test scores differently. Angela’s score of 1000 puts her at the top 10% of her school, and other similar schools while Cindy’s score of 1280 puts her in the bottom 25% of her school, and other similar schools. To measure this type of comparisons, one can compute a different set of percentiles – instead of computing percentiles against all test-takers, one use test-takers from similar schools and backgrounds.

In this new percentile scale, Angela’s score of 1000 might translate into 90th percentile while Cindy’s might become 28th percentile. The difference in interpretation is due to different definitions of “all else being equal”.

It’s important to realize that when we interpret differences, we implicitly make an assumption of “all else being equal”. How “all else being equal” is defined matters a lot.

***

The initiative by the College Board is an attempt to define “all else being equal” in a more rigorous manner. The new “scores” allow admissions officers to establish control groups and look at relative rather than absolute comparisons of scores. It’s not that one comparison or the other is correct.

Angela is - on an absolute basis - worse than Cindy but she is relatively better than Cindy, when each is compared to her peers. Both statements are true at the same time, and do not contradict each other.

***

The Environmental Context Dashboard or adversity scoring is an attempt to look at relative comparisons of subgroups of test-takers. Vox has a nice round-up of recent coverage. For a more positive story, see this US News report. Slate's take is a criticism of "black-box" models, in which users are not told how scores are constructed.

The contents of Chapter 3 of Numbers Rule Your World (link) play this out at the level of individual test questions, instead of aggregate test scores. What does differential performance on a specific test question say about the different abilities of test-takers? What if one finds that a test question is systematically scoring lower? Read the chapter to learn more.

It's about the broken contract between advertisers and consumers - with "adtech" guilty of causing the split. Here's the key paragraph:

Advertising had ceased to be about connecting with consumers—it was now about finding novel ways of extracting evermore personal information from computers, phones, and smart homes. To many of the most powerful and profitable companies in the world, we are the products, and the services we all use are just afterthoughts they put out to keep us hooked. And the rest of the ad industry, which depends on their data to compete, has no choice but to go along with whatever whims and changes come their way.

It's sad but true that "the services we all use are just afterthoughts". Google, Facebook, Yelp, etc. make it impossible for users to contact them in any channel. They justify these policies because users aren't paying them, only advertisers do.

The latest fracas around hate speech and bullying on Youtube underlines this point: even as management accepted that the Youtuber violated community standards, they deemed the appropriate penalty to be "de-monetize," i.e. disallow the Youtuber from making money off advertisements in those videos -- instead of removing the offensive clips. The almighty advertising dollar: that's their paramount concern.

Richard proposes several steps to rebuild trust:

Obtain consent before collecting and sharing data

Stop refusing service if user does not consent

Provide full disclosure of where the data go

Never collect certain types of data, such as health-related, location-related

This list has some overlaps with my list of 7 Principles of Responsible Data Collection (link).

***

For those in New York, Principal Analytics Prep is hosting an interactive workshop next Tuesday on digital ad fraud "hunting". Augustine Fou, an independent researcher on ad fraud, will lead the session and we'll use analytics to find fraud in advertising data.

A couple of researchers recently claimed to have found evidence of “a battle of the thermostat”: specifically, they argued that women perform significantly worse in a math test when it is administered at lower room temperatures while men’s performance does not decline. They did not find a similar gender gap in a word test.

The analysis strategy is similar to one that is employed by business analysts so readers here might be interested in how it went off track.

First, the researchers looked at the effect of temperature on test scores in aggregate. They found none. See the chart below, which is a replicate of similar charts in the research paper.

Note that in my chart, the scores have been standardized (to be explained below). For both math and word scores, the difference in scores between the extreme temperatures of 15 to 32 celsius (60 to 90 Fahrenheit) is smaller than 0.25 standard units, clearly not statistically significant by any measure.

The next step in the analysis strategy is to rotate through other factors that might influence test scores, such as gender, language spoken, and college major. In the business world, this is often called a deep-dive analysis. The question being asked is whether the effect of temperature on test scores differed by [factor X]. The researchers struck gold when splitting the data by gender. Here are the relevant charts:

In the top chart displaying math scores, it appears that the trend line for men is almost flat while women’s scores progressively increase as temperature increases. According to the usual convention of statistical significance, the effect is strong enough to be published. It is even possible to explain this observation using "battle of the thermostat".

***

Many other researchers have expressed skepticism that the reported effect can be replicated in other settings. See, for example, here.

I explored the data a bit more. Here is the raw data shown in side-by-side boxplots. I have arranged the sessions by the average temperatures, ordered from lowest to highest.

The following features caught my attention:

The math scores of women (top chart, red boxplots) show unusually low variability. The dispersion of these scores is much below that of men’s math scores. The dispersion of these women’s math scores is also much below that of women’s word scores (bottom chart, red boxplots).

Focusing only on the math scores (top chart), and inspecting the data session by session, from the right to the left, I found that the gender difference is hard to see (often obscured by the high variability of the male scores), except in a few sessions with the lowest temperatures (boxed area).

In terms of math scores, the shape of the distribution for men is clearly different from that for women. The scores for men were generally higher, and contain more extreme values on the high side. If a male and a female student score the same in absolute terms, the scores do not mean the same thing in relative terms – each relative to other students of the same gender.

***

One way to deal with this difference in variability is to standardize the scores by gender. This leads to the following chart, in which the scores are expressed relative to each gender’s distribution.

Notice how the math and word charts look almost identical. Also compare to the chart above using the raw data. This shows the fragility of statistical significance: the top chart shows marginally significant while the bottom chart shows not significant.

***

[Technical note: Here is a scatter plot of the standardized math scores against the raw scores, grouped by gender. You can see that the raw women's scores are in a narrower range, and that men's scores are more disperse and contain some high extremes.]

***

Deep-dive analysis sometimes lead to unexpected discoveries. Aggregate data sometimes mask subgroup differences. But at other times, the subgroup difference is a mirage. At the disaggregated level, the sample size is smaller, and there may also be a difference in variability.

For more discussion of subgroup analysis, see Chapter 3 of Numbers Rule Your World (link), in which I discuss how the insurance industry and the educational testing community tackle this problem.