Posts categorized "Current Affairs"

In the sister blog, I featured a data graphic showing the difference in pay levels between white male workers in the U.S. and women workers of various races. That post discusses purely visual matters, in effect accepting the analysis as given. In this post, I take a deeper look at the underlying data analysis.

Let us review the analysis strategy behind the chart, and then discuss why this simple strategy is not particularly insightful.

The starting point of the analyst is income data collected by the Census Bureau. Even here, income is measured in several ways. It appears that using personal income (as opposed to household or family income) is the most appropriate here. There is an additional complication of how to handle "missing values", which may arise in this context because someone is not employed and thus earn zero income. When one says "median income", does one include or exclude the zero earners? What about those who only worked part of the year?

After those questions are addressed, one can work with median incomes, computed at the right level of aggregation. The report linked to by Business Insider only contains aggregation by race, and aggregation by gender, each calculated separately. What is needed is a cross-tabulation. It is not possible to obtain the median income of Asian women from the median income of Asians and the median income of women - unless the analyst makes unwarranted assumptions.

So the analyst learns that white men made $46,000 in 2017 while Asian women made $27,000 and Hispanic women made $21,000. This data can be plotted directly, or after computing the race-gender discount (off the white male wage level).

***

The analyst wants to make this data come alive by using a different unit, the number of days worked.

One way to achieve this is by converting the annual salary to daily salary, then computing how many days the median Asian woman must work in order to earn $46,000, the median for the white man. This is roughly 611 days, which suggests that the Asian woman must work 246 extra days.

In the above description, I have seamlessly papered over several annoying details. I have assumed - without checking with reality - that everyone works 365 days a year. In fact, no one worked 365 days a year. And even if I obtained the correct value of the average number of days worked, call it X, it will also be the case that X will vary between race and gender. Thus, I made a further implicit assumption that such variance is not large enough to bother about.

I justify this lack of care because I rounded the median salaries to the nearest thousand. The questions I raised above ought to concern those analysts who insist on printing estimates in decimals. Is it possible to attain that level of precision while making simplifying assumptions?

***

Also notice that the analysis strategy is counterfactual in nature - it requires conjuring up a hypothetical scenario. It's a comparison of a white man who works exactly one year, and a woman (of any race) who works till she earns $46,000 or whatever is the current median wage for white men.

The notion of "extra days" is an invention since there are only 365 days in a year, and the women can never catch up unless the white guys stop working.

***

When comparing white men and Asian women, both gender and race are shifted. The Business Insider story leads readers to attribute the wage gap to primarily gender but unless we see the median salaries for Asian men, Hispanic men and black men, one can't be sure.

Most likely, there is a gender "effect" as well as a race "effect". The gender effect may even be differently sized for each race. This is known as an "interaction" effect.

***

Finally, there are even more factors to be considered. It is well known that at least some of the wage gap is explained by the difference in the mix of jobs and industries that men and women tend to be employed in. So one can't conclude discrimination without further investigation.

Unequal pay for equal work is discrimination but unequal pay for unequal work need not be.

***

Now, check out my comments about the calendar visualization of the wage gaps by Business Insider.

In the last year, news kept coming out about how the social media giant mishandled user data, including taking phone numbers collected for two-factor authentication and using them for marketing, saving passwords unencrypted, and asking users their passwords to other services. Despite the $5 billion fine, the settlement has been widely panned by the technology press as a slap on the wrist.

The size of the fine is not meaningful in the context of Facebook’s massive advertising business that generated $16 billion revenues in the 2nd quarter of 2019 alone.

Even more frustrating are the so-called new regulations, which fail to address any underlying issues.

Consider the requirement that Facebook ask users to “opt-in” to facial recognition. Requiring opt-in has long lost any meaning because of the forced consent tactic. Facebook can simply put up a notice saying that you will not be served unless you consent to the following activities. This kind of notice is prevalent with mobile apps - to use these apps, you must agree to a list of demands.

Consider the requirement to encrypt passwords. This is standard IT best practice branded as regulatory success.

Consider the requirement to bar third-party apps that “fail to certify that they are in compliance with Facebook’s platform policies”. This is a reiteration of existing policy. The FTC does not stipulate any restriction on types of use. The issue is intractable since once the data moves to third-party servers, Facebook has no ability to monitor them.

In my new 3-minute video, I discuss the controversy about the revelation that Grubhub has been playing a dirty trick on its restaurant customers.

As I explain in the video, what Grubhub did is similar to what Google has been doing with its search engine.

Most of these online businesses make money off "lead generation" - they bring prospective customers to brands. Restaurant owners hope Grubhub will bring them additional diners. These have to be incremental diners to justify the expense, which is a sizable 20-percent fee Grubhub assesses for each lead.

Grubhub has every incentive to count each order as an incremental lead. The dirty trick involves counting orders who should not be charged the lead-generation fee. Grubhub sets up shadow restaurant websites with similar URLs that pretend to be official websites. The diner usually does not know s/he is ordering from Grubhub's website and not the official restaurant website. This bit of deception qualifies the order for the lead generation fee - if the same diner had ordered from the restaurant's own website, Grubhub would have collected a much lower fee.

Similarly, lots of us search for brand names on Google, instead of going directly to the brands' websites. This leads some brand managers to buy their own brand names from Google. Google over the years has made it harder and harder to see which result is a paid ad. If we click on those ads, the brands pay Google a lead generation fee. Nevertheless, if we are searching for a brand's name, we will be visiting that brand's website, whether or not we pass through the Google toll booth!

In Numbersense (link), I talk about the importance of counterfactual thinking, and measuring incremental rather than absolute metrics.

Just posted a short video that explains one of the techniques used to work with observational data (or found data). This type of data is extremely common in the Big Data world. The data are being collected by some operational process, and in the indelible words of the late Hans Rosling, they are a bag of numerators without denominators. In this case, the database is for car crash fatalities. You only have the reported crashes linked to deaths but that database does not contain any information about crashes that do not have fatalities or safe driving.

Just like most scientific studies, the original researchers made a claim of statistical significance, i.e. they found something out of the normal. (There was an excess of fatalities on April 20.) However, a second research group took a different look at the data and demonstrated that what happened was more common than first thought.

How do statisticians measure how common something is? One takeaway is how to define the reference (control) group. Another takeaway is replications, repeating a style of analysis over different slices of the data.

Click here to see the video. Don't forget to subscribe to the channel to see future videos.

For a long-form discussion of what is covered in the video, see this blog post.

Recently, there have been a load of criticism of the initiative by the College Board known officially as “Environmental Context Dashboard” and dubbed as “adversity scores” by its opponents. I wrote about a similar issue in Numbers Rule Your World (link), a chapter subtitled "The Dilemma of Being Together", in which I explain how the College Board tries to eliminate test questions that may be biased against certain demographic groups.

The underlying problem is the interpretation of test scores: if Bob scores 1200 and Cindy scores 1280, what does the score difference of 80 points mean?

One can simply say Cindy is superior to Bob since Cindy has the higher score.

The score difference contains information on not just the direction but also the magnitude of the comparison. Cindy is 80 points better than Bob.

But the number 80 carries no meaning unless one knows the scale. It’s not enough to know that the valid scores range from 400 to 1600. Scores are not evenly spread out inside that interval; by design, few test-takers get the extreme scores.

One way the College Board helps us interpret the score differential is by converting the scores into a “percentile” scale. Cindy’s score of 1280 is the 89th percentile: she did better than 89% of the test-takers. Bob’s is at the 81st percentile. How much better is Cindy? One answer is: there are 8 percent of test-takers who score higher than Bob but lower than Cindy. (This PDF from the College Board has a table that converts between test scores and percentiles.)

***

A third student, Angela, scores 1000. How much better is Cindy compared to Angela? Since a score of 1000 is 48th percentile, we know that 41 percent of test-takers “sit” between Cindy and Angela. Cindy appears to be much better than Angela.

The percentile scale contains an implicit comparison of each student against all test-takers. Our next concern is what group should a student be compared against. This is what statisticians call the “control group”.

Let’s dig a bit deeper. If we know further that Angela went to a public school in a low-income neighborhood while Cindy went to an elite private school in New York, one may choose to interpret their respective test scores differently. Angela’s score of 1000 puts her at the top 10% of her school, and other similar schools while Cindy’s score of 1280 puts her in the bottom 25% of her school, and other similar schools. To measure this type of comparisons, one can compute a different set of percentiles – instead of computing percentiles against all test-takers, one use test-takers from similar schools and backgrounds.

In this new percentile scale, Angela’s score of 1000 might translate into 90th percentile while Cindy’s might become 28th percentile. The difference in interpretation is due to different definitions of “all else being equal”.

It’s important to realize that when we interpret differences, we implicitly make an assumption of “all else being equal”. How “all else being equal” is defined matters a lot.

***

The initiative by the College Board is an attempt to define “all else being equal” in a more rigorous manner. The new “scores” allow admissions officers to establish control groups and look at relative rather than absolute comparisons of scores. It’s not that one comparison or the other is correct.

Angela is - on an absolute basis - worse than Cindy but she is relatively better than Cindy, when each is compared to her peers. Both statements are true at the same time, and do not contradict each other.

***

The Environmental Context Dashboard or adversity scoring is an attempt to look at relative comparisons of subgroups of test-takers. Vox has a nice round-up of recent coverage. For a more positive story, see this US News report. Slate's take is a criticism of "black-box" models, in which users are not told how scores are constructed.

The contents of Chapter 3 of Numbers Rule Your World (link) play this out at the level of individual test questions, instead of aggregate test scores. What does differential performance on a specific test question say about the different abilities of test-takers? What if one finds that a test question is systematically scoring lower? Read the chapter to learn more.

It's about the broken contract between advertisers and consumers - with "adtech" guilty of causing the split. Here's the key paragraph:

Advertising had ceased to be about connecting with consumers—it was now about finding novel ways of extracting evermore personal information from computers, phones, and smart homes. To many of the most powerful and profitable companies in the world, we are the products, and the services we all use are just afterthoughts they put out to keep us hooked. And the rest of the ad industry, which depends on their data to compete, has no choice but to go along with whatever whims and changes come their way.

It's sad but true that "the services we all use are just afterthoughts". Google, Facebook, Yelp, etc. make it impossible for users to contact them in any channel. They justify these policies because users aren't paying them, only advertisers do.

The latest fracas around hate speech and bullying on Youtube underlines this point: even as management accepted that the Youtuber violated community standards, they deemed the appropriate penalty to be "de-monetize," i.e. disallow the Youtuber from making money off advertisements in those videos -- instead of removing the offensive clips. The almighty advertising dollar: that's their paramount concern.

Richard proposes several steps to rebuild trust:

Obtain consent before collecting and sharing data

Stop refusing service if user does not consent

Provide full disclosure of where the data go

Never collect certain types of data, such as health-related, location-related

This list has some overlaps with my list of 7 Principles of Responsible Data Collection (link).

***

For those in New York, Principal Analytics Prep is hosting an interactive workshop next Tuesday on digital ad fraud "hunting". Augustine Fou, an independent researcher on ad fraud, will lead the session and we'll use analytics to find fraud in advertising data.

Another day, another story about Facebook data. Ars Technica reports that Facebook is suing a South Korean app developer Rankwave, claiming the company misused data it received from Facebook. Rankwave creates mobile apps through which it obtained Facebook user data for 10 years.

All we know about this situation came from the Facebook press release, and it's not clear what the offense is. The article cited the violation as using the Facebook data to "create and sell advertising and marketing analytics and models". That's how Facebook uses the user data, same as why Facebook partners want access to user data.

One part of the press release rings very true: Facebook admits that it does not control the data once shared with third parties. Facebook lawyers demanded Rankwave do the following:

Provide a full accounting of Facebook user data in its possession;

Identify all individuals, organizations, and governmental entities to which it had sold, or otherwise distributed, Facebook user data;

Provide a full record of the access logs and permissions it had granted third parties to access the data;

Delete and destroy all Facebook user data after returning it to Facebook;

Provide Facebook with full access to all storage and related devices so that Facebook could confirm deletion and destruction of the data through an audit.

These all sound great but would any company, even Facebook, be able to deliver the above logs, reports, devices, etc.? Given how data are spread out in big networks of servers ("data clouds", "data lakes", etc.), this wishlist sounds like a fantasy.

In this new video, I talk about the data sharing ecosystem, why it is so hard to delete anything, and how companies lose control of the data. It's the flip side of speed and convenience. What is the price we're willing to pay?

The authors of these articles express genuine shock and awe. They apparently believed that “machine learning” means no humans involved. The tech industry allows this misconception to fester by being opaque about how machine learning works.

(The reporters are also dismayed by the privacy invasion. The Echo speakers are constantly recording in users’ households. Facebook did not have explicit permission from users to it send their data out for labeling.)

***

Humans have always been a part of the machine learning workflow, and will continue to be. Let’s use one of the examples in the Facebook report to illustrate this point.

Computers work fast, and can make tons of predictions quickly. The question is whether these predictions are accurate. It’s one thing to create these models in the laboratory; it’s another thing when they are unleashed to the world, and affecting Facebook users, e.g. by deleting videos that are predicted to contain profanity.

Why should Facebook and its data scientists care about predictive accuracy?

User complaints. When users find their videos deleted due to profanity, they complain if said clips do not in fact contain profane language. Other users are upset to unsuspectingly encounter videos with profanity (that the machine fails to identify and delete).

***

It’s not easy to measure if the machine-learning model is correctly predicting profanity. The machine can’t be both decider and judge at the same time. The judge typically is a human who views the video to determine if it contains profanity. These human judges are the “annotators” described in the news articles. They are hired to look through videos and apply a profanity “label” if they find profanity.

As disclosed in the articles, companies typically hire two or three judges for each item because profanity is a somewhat subjective opinion. They might order more detailed labeling, e.g. label types of profanity as opposed to one overall label of profanity.

***

Now let’s remove the assumption that we already have a machine learning model. Where does this model come from? Such a machine has to know what features of the video are correlated with presence of profanity. To discover this correlation, the machine needs to be told which videos have profane language, and which do not.

This is a chicken-and-egg problem and it is solved by having humans label a big batch of videos at the start, building the “training data”. In the Facebook example, they hired over 200 people to create data labels, later reduced to 30. The first team was building a large training dataset; after the predictive model has been produced, the future labeling by the reduced workforce is used to monitor the accuracy of the predictions.

Any company that claims to use our data to predict our behavior must create training data, i.e. labeled data. In most cases, humans must create the labels – by reading our emails, listening to our conversations, viewing our videos, reviewing our calendars, scanning our receipts, and so on.

How far companies should go and what methods they shoudl use in collecting such data are ethics questions that should be discussed.

And if you're wondering about the acronym, it's Driving Under the Influence of Weed on 420 Day, which I learned from Andrew Gelman's blog is a day of celebration of cannabis.

Andrew's blog post is about the exemplary work done by Sam Harper and Adam Palayew, debunking a highly-publicized JAMA study that claimed that 420 Day is responsible for a 12 percent increase in fatal car crashes.

The discussion provides great fodder for examining how to investigate observational data, which is what most of Big Data is about. It is a cautionary tale for what not to do.

***

The blog begins with Harper/Palayew channeling Staples/Redelmeier, the authors of the study: "fatal motor vehicle crashes increase by 12% after 4:20 pm on April 20th (an annual cannabis celebration)."

This short sentence captures the gist of the original study but it omits an important detail: to what is the increase relative?

If we ran an experiment, we would recruit a group of drivers, and select half of them at random to smoke weed on April 20. Then, we would count what proportion of drivers suffered fatal car crashes after 4:20 pm. The analysis would be straightforward: what's the difference in proportions between the two groups? With such an experiment, it is possible to draw a causal conclusion.

Alternatively, we could conduct a case-control study. The cases are the drivers who suffered fatal car crashes on April 20. We collect demographic data on these drivers. Then, we define a set of "controls", drivers who did not suffer car crashes on April 20 but on average, have the same demographic characteristics as the cases. Next, we need data on cannabis consumption, preferably on April 20. We want to show that the level of cannabis consumption is significantly higher for cases than for controls.

(For further discussion of these analysis designs, see Chapter 2 of Numbers Rule Your World (link).)

The actual study was neither experiment nor case-control. It was a piece of pure data analysis, based on "found data". I like to call this "adapted data," the "A" in my OCCAM framework for Big Data - data collected for other purposes that the researcher has adapted for his/her own objectives. In this study, the adapted data come from a database of fatal car crashes.

So how was the adapted data analyzed? Harper/Palayew answer this question in their second description of the research:

Over 25 years from 1992-2016, excess cannabis consumption after 4:20 pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after.

The cases are the fatal car crashes that occurred after 4:20 pm on 420 Day. The comparison isn't to the drivers who did not suffer crashes on the same day. The reference group consisted of fatal car crashes that occurred after 4:20 pm on 4/13 and 4/27. The difference in the average number of crashes is taken to result from "excess cannabis consumption".

Notice that such a conclusion requires a strong assumption. We must believe that absent 420 Day, 4/13, 4/20 and 4/27 ought to have the same fatal crash frequencies.

***

You hopefully recognize that the analysis design for adapted data is on much shakier ground than either an experiment or a case-control study.

Harper/Palayew's initial debunking focused on one issue: what's so special about April 20? To answer that, they repeated the same analysis on every day of the year. The following pretty chart summarizes their finding:

The red line is the line of no difference (between the analyzed day and the two reference days from the week before/after). Each vertical line is the range of estimate of the difference for a specific day of the year. The range for 4/20 is highlighted, and several other days with elevated fatal crash counts are labeled.

The chart was originally published here, with the following commentary: "There is quite a lot of noise in these daily crash rate ratios, and few that appear reliably above or below the rates +/- one week." Andrew adds: "Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part."

While the chart looks cool, and sophisticated, the following histogram of the same data helps the reader digest the information.

I took the daily estimates of the fatal crash ratios from Harper/Palayew's published data. Each ratio presents the crashes on the analysis day relative to the crashes on the two reference days. The histogram shows the day-to-day variability of the crash ratios, which is what we need to answer the question: how special is 4/20?

The histogram is roughly centered at 1.0 meaning no observed difference. The black vertical line shows the ratio for 4/20. It is leaning right - in fact, it is at the 94th-percentile. In classical terms, this is a p-value of 0.06, barely significant.

Will JAMA editors accept one research paper for each of these days? The work is already done - the rest is story time.

P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. This version contains the point estimates that the other version did not. Those point estimates are used to generate the histogram.

A report came out from Stanford School of Medicine about a study of Apple Watch's health monitoring features. Some headline writers are proclaiming that "finally, there is proof that these watches benefit our health!" For example, Apple Watch Stanford Study Shows How It Can Save Lives (link).

When you read the official story, you will learn the following facts about the study:

The research is funded by Apple

It was a purely observational study in which they follow (400,000) people who wear Apple Watches

Participants must own both an Apple Watch and an iPhone to be eligible (plus meeting other criteria)

There was no "control" group - they did not follow anyone who did not use Apple Watch or use any other health monitoring wearables

Every participant is self-selected

The device issued warnings to only 0.5 percent of the participants (~ 2,160)

Those who received a warning were directed to a video consultation; and the doctor decided whether or not to send the participant an ECG patch, which is used to establish the "ground truth". Only about 30 percent were sent patches, and of those, 70 percent (450) returned the patches for analysis.

Only those who had ECG data were analyzed. One third of these were shown to have experienced "atrial fibrillation" (irregular heartbeat). This means that two-thirds got false alarms. But if we include the 70% who were not sent patches after the video consultation as false alarms as well, then out of every 100 warnings, only 7 were validated.

There is no discussion of false negatives: did any of the 99.5 percent who did not receive warnings experienced irregular heartbeats?

We do know that if there were significant false negatives, then more warnings would have to be sent, which pushes up false alarms.

Despite the headlines, any lives saved were extrapolated.

There are some major methodological limitations about this study.

Firstly, the study design prevents drawing conclusions of the type "People wearing Apple Watches .... compared to those who did not wear Apple Watches." It does not include anyone not wearing Apple Watches.

Secondly, it's difficult to interpret the accuracy metrics. Is 20 percent false alarms a good or bad number? Is 0.5 percent receiving warnings a reasonable proportion given the demographic and health characteristics of the study population?

Hopefully, this study is just the beginning, and more rigorous studies are being planned.