Posts categorized "Fairness"

In the sister blog, I featured a data graphic showing the difference in pay levels between white male workers in the U.S. and women workers of various races. That post discusses purely visual matters, in effect accepting the analysis as given. In this post, I take a deeper look at the underlying data analysis.

Let us review the analysis strategy behind the chart, and then discuss why this simple strategy is not particularly insightful.

The starting point of the analyst is income data collected by the Census Bureau. Even here, income is measured in several ways. It appears that using personal income (as opposed to household or family income) is the most appropriate here. There is an additional complication of how to handle "missing values", which may arise in this context because someone is not employed and thus earn zero income. When one says "median income", does one include or exclude the zero earners? What about those who only worked part of the year?

After those questions are addressed, one can work with median incomes, computed at the right level of aggregation. The report linked to by Business Insider only contains aggregation by race, and aggregation by gender, each calculated separately. What is needed is a cross-tabulation. It is not possible to obtain the median income of Asian women from the median income of Asians and the median income of women - unless the analyst makes unwarranted assumptions.

So the analyst learns that white men made $46,000 in 2017 while Asian women made $27,000 and Hispanic women made $21,000. This data can be plotted directly, or after computing the race-gender discount (off the white male wage level).

***

The analyst wants to make this data come alive by using a different unit, the number of days worked.

One way to achieve this is by converting the annual salary to daily salary, then computing how many days the median Asian woman must work in order to earn $46,000, the median for the white man. This is roughly 611 days, which suggests that the Asian woman must work 246 extra days.

In the above description, I have seamlessly papered over several annoying details. I have assumed - without checking with reality - that everyone works 365 days a year. In fact, no one worked 365 days a year. And even if I obtained the correct value of the average number of days worked, call it X, it will also be the case that X will vary between race and gender. Thus, I made a further implicit assumption that such variance is not large enough to bother about.

I justify this lack of care because I rounded the median salaries to the nearest thousand. The questions I raised above ought to concern those analysts who insist on printing estimates in decimals. Is it possible to attain that level of precision while making simplifying assumptions?

***

Also notice that the analysis strategy is counterfactual in nature - it requires conjuring up a hypothetical scenario. It's a comparison of a white man who works exactly one year, and a woman (of any race) who works till she earns $46,000 or whatever is the current median wage for white men.

The notion of "extra days" is an invention since there are only 365 days in a year, and the women can never catch up unless the white guys stop working.

***

When comparing white men and Asian women, both gender and race are shifted. The Business Insider story leads readers to attribute the wage gap to primarily gender but unless we see the median salaries for Asian men, Hispanic men and black men, one can't be sure.

Most likely, there is a gender "effect" as well as a race "effect". The gender effect may even be differently sized for each race. This is known as an "interaction" effect.

***

Finally, there are even more factors to be considered. It is well known that at least some of the wage gap is explained by the difference in the mix of jobs and industries that men and women tend to be employed in. So one can't conclude discrimination without further investigation.

Unequal pay for equal work is discrimination but unequal pay for unequal work need not be.

***

Now, check out my comments about the calendar visualization of the wage gaps by Business Insider.

On my (new) Youtube channel called "Fung with Data", I am using short clips to explain how data, software and algorithms are working behind the scenes to influence their daily decision making.

The third episode just launched today, and it addresses the question of whether Google Maps or GPS navigators can really find you the "fastest route" to your destination. Lots of people I know swear by the software; how does it work? Click here to see the video.

This video is the second in a series about Google Maps. Episode 2 presents the basics of how route optimization works. Click here to see the short clip.

Subscribe to the channel to get notified when the next episode shows up.

Andrew Gelman nails it again with this post titled "combining apparently contradictory evidence." He uses the example of repeated tests given to the same student, such as the scores from multiple assignments within the same course. One student might get 80,80,80 for three equally-weighted assignments while another student might get 80,100,60. The issue is that the sample size of three is too small to judge not only the average score but also the variability in score.

I made a comment about exactly the same problem I encountered when reading applications for the MS program at Columbia. Most of the applicants have good STEM undergrad degrees and no meaningful work experience. At first, I thought the three reference letters would be useful to differentiate the applicants.

It turns out that most applicants get 3 good references, almost always from professors who taught them. Occasionally, an applicant would get a poor reference, i.e. the professor is not recommending the student. However, in all such cases, the one poor reference is contradicted by two good references. So who do I believe? I typically don't know the authors of these references, and therefore have no external information about their reliability.

I am very aware that the sample of three is too small. One is tempted to think that because this applicant got inconsistent references while most other students did not is a "signal" that this applicant must be worse than the average, but drawing that conclusion is to ignore the small sample size - and the small-sample problem is even worse when drawing conclusions using the observed inconsistency of grades!

***

Thinking back to the grad school admissions process also makes me more sympathetic about the practical rationale for Princeton's decision to walk back its grade-deflation policy (see critical post here).

You might think undergrad GPAs are useful for making admissions decisions. The decades of grade inflation have vanquished this metric, as almost every recent graduate has a GPA say in the 3.5 to 3.8 range.

In fact, when metrics are gamed, it is usually not just uninformative - it can be anti-informative. Such metrics can lead to very bad conclusions.

The difference in GPAs no longer reflects a difference in ability between students. They are more likely to be influenced by other factors such as (a) when the student graduated and (b) whether the school or department uses grade deflation (or a grading curve).

Take date of graduation. If someone has a GPA lower than 3.0, it is almost always the case that this student graduated in the 1990s or before. But the GPA numbers are typically not presented together with year of graduation - so if the analyst does not recognize and adjust for this long-term grade-inflation trend, then the older candidates face a systematic discrimination.

This line of thinking takes me back to Princeton's decision to end grade deflation. Same problem here - when the admissions officer reads the GPA, it's not typically presented next to the college that grants it. There are hundreds of colleges the officer might come across during the admissions process, so it's impossible to hold in one's head the grading policies of so many colleges. In fact, even though I know a lot about Princeton's grading policies over the years, it still requires unreasonable effort to bring this contextual information into the decision-making. For one thing, I'd have to be aware of the different periods of grade inflation, then deflation, then inflation, etc. Therefore, I believe that the grade-deflation policy put Princeton graduates at a disadvantage when competing for scholarships, grad-school spots, etc.

***

If schools are required to release data on grades, then it would be possible to overcome the interpretation problems with the GPAs. Knowing the grade distributions by major, by school, by year is a good start.

Her conclusion seems to be: more relevant ads is better than no ads at all. What future is waiting for cheated fe/males? A warning "Be careful to your partner" or a reassuring "All is well" to choose in advance among app settings?

Gillian is someone who totally buys into the tech industry's "big data" pitch - that the more you share, the more you gain. She's writing tags that cue algorithms to send her relevant ads. Presumably, when she was pregnant, she was satisfied with the ads that at the time were selling her relevant products.

She's mad that the algorithm is not all-knowing, personalized and omnipotent. She expects that Facebook, Instagram, Amazon, etc. tracks her every move, and optimizes her experience just for her. She's angry when it makes mistakes.

And, if one reads behind the lines, her proposed solution is for the tech industry to be even more creepy, gather even more personal data, be even more personalized. She wants ads, just not the ones she doesn't like.

This solution is not radical at all. In fact, it is exactly what tech firms have been doing for 10 years. The "theory" is: data make ads more relevant, and if ads are not relevant enough, it is because they do not have enough personal data. In this sense, Gillian's column is a love letter to the tech industry.

***

The overlooked solution is to have less relevant ads or no ads at all.

In the Charles Duhigg story about Target's pregnancy prediction model (see Numbersense), one of the curious nuggets we learned is that the data scientists deliberately mixed random products in between the pregnancy goods being marketed to the women predicted to be pregnant. The official explanation was to make the brochures appear less creepy.

In the book, I suggested a different explanation for that decision. In a predictive model like that, there are likely to be multiples more false positives (i.e. women wrongly predicted to be pregnant and thus sent irrelevant materials) than true positives (i.e. women correctly predicted to be pregnant). I also speculated that many true positives would act like Gillian did - appreciating the pregnancy product ads as relevant rather than creepy. However, I believe that the false positives will complain that the pregnancy product ads are irrelevant, maybe even somewhat offensive.

Mixing in other products lessens the bad of the wrong predictions - but simultaneously, it will also soften the impact of the correct predictions. What's in the balance is consumer interests versus advertiser business goals.