January 2019

Slate has this very interesting little essay about the "wind chill factor." For those not in the U.S. (or not living in the cold parts of the U.S.), you may not know about our obsession with this number.

Typically, the weather report says the temperature is 25F but it "feels like" 10F (32 Fahrenheit is 0 Celsius). The "feels like" is temperature adjusted by the so-called "wind chill factor." It conveys the idea that keeping temperature constant, it feels colder when there is wind.

The Slate article covers a bunch of general issues related with inventing metrics:

People love large numbers, in this case, because we are measuring cold temperatures, they like really small numbers

The name of the metric may have little or nothing to do with what is being measured

In seeking to make numbers more palatable to the public, people may choose less precise language that sometimes completely loses the original meaning. For example, "feels like" does not indicate that wind is at issue. Other factors like humidity also affect how cold one feels, at constant temperature.

That said, the public is hungry for statistical concepts or metrics that can be explained easily and understood instinctively. There is nothing wrong with this desire.

After a metric is established, it's not easy to dislodge it. Changing the metric renders the entire history useless. I also made this point in the chapter on obesity metrics in Numbersense (link).

How cold one feels is affected by a system of multiple factors, including temperature, wind, humidity, etc.

Any definition of the perceived temperature involves the notion of statistical adjustments

Racial and cultural harmony is hard work. The news about an episode at Duke's biostats program is a good example of the intricacies involved.

According to what is known at the moment (reported also by the Duke Chronicle), the program director of the Master of Biostatistics program, who is also an assistant professor, wrote students a note asking that they always speak English in the department's building and in professional settings. This email was aimed at Chinese graduate students, which form the bulk of the enrollment of many graduate programs at U.S. colleges (although I'm not sure about this particular one).

Further, the email dispatch was motivated by two unnamed Duke professors who went to the program director to complain about Chinese students speaking Chinese "very loudly" in a "lounge or study area".

The two professors demanded photos of the Chinese students to identify the offenders "in case the students ever applied for an internship or were interviewed by them." The program director warned students that speaking their mother tongue would have "unintended consequences" that could affect their careers and recommendations.

***

According to the articles, the school has thus far taken the following steps:

The program director has immediately been replaced.

The Master's program has been placed "under review".

The Dean of the Medical School issued an apology, affirming that there is no restriction of languages spoken outside the classroom, that speaking other languages outside the classroom would not affect careers or recommendations, and that student privacy will be protected.

***

It looks like the program director is being positioned as the scapegoat. There is no mention at all about the two professors who violated the students' privacy, and threatened their careers and recommendations.

The outgoing program director gave the students great advice. The behavior of the two professors validates the point of view that it is in the students' interest to speak English. The Duke Chronicle did not reprint the Dean's email but saying that speaking other languages would not affect careers does not make it so - in light of evidence to the contrary!

Speaking one's mother tongue to someone else with the same mother tongue is completely natural to every human being; and doing so when living in a foreign country is to make a connection to one's heritage.

Now, practice speaking English while studying in the U.S. is also good advice but that ought to be a choice.

On my (new) Youtube channel called "Fung with Data", I am using short clips to explain how data, software and algorithms are working behind the scenes to influence their daily decision making.

The third episode just launched today, and it addresses the question of whether Google Maps or GPS navigators can really find you the "fastest route" to your destination. Lots of people I know swear by the software; how does it work? Click here to see the video.

This video is the second in a series about Google Maps. Episode 2 presents the basics of how route optimization works. Click here to see the short clip.

Subscribe to the channel to get notified when the next episode shows up.

Last week, I pointed out that the U.S. payroll survey uses the following accounting rule:

The Labor Department... determines the number of people on nonfarm payrolls by asking employers for a head count during the pay period including the 12th of the month. For many federal employees, that would be the pay period from Jan. 6 to Jan. 19.

Further, one only needs to have worked for one hour during that period to be counted as employed.

This weekend, we heard word that the government shutdown has no end in sight.

On Sunday (Jan 13), we heard more news - that the TSA has decided to issue each TSA worker a one-time one-day pay plus $500 bonus. This means TSA can now report all workers as having been paid during the pay period including the 12th of January.

Yesterday, on the sister blog, I posted a pair of charts that digs into how the unemployment rate is computed. This number has been in the news a lot because we supposedly have a super-tight labor market, the lowest unemployment rate since the peak of the tech boom in 2000! But the story is much more complex if one understands how unemployment levels are counted.

Another reason why counting unemployment is in the news is because of the government shutdown. And the Wall Street Journal has a nice (but rather convoluted) article about what that means for the unemployment number. This is where you can see how politics enter into the statistics - it's via rules that determine what gets counted and what doesn't.

***

As I pointed out in Numbersense (link), the Labor Department uses a specific definition of employment, and here is how the WSJ describes it:

The Labor Department... determines the number of people on nonfarm payrolls by asking employers for a head count during the pay period including the 12th of the month. For many federal employees, that would be the pay period from Jan. 6 to Jan. 19.

In other words, someone just needs to have worked at least one hour during that pay period to be considered "employed".

Based on that definition, if the government shutdown extends beyond Jan 19, then there should be a big negative impact on the unemployment rate.

***

But wait - politicians of neither party would want that to happen (in case they are the ones in power at the moment).

So what do we learn from the WSJ reporter? This:

even if the shutdown extends beyond the pay period, workers would be counted if legislation is enacted that requires agencies to pay employees for what they would have earned had there been no shutdown, a spokesman for the Labor Department’s Bureau of Labor Statistics said. Such practice would be consistent with how the department accounted for workers during previously lengthy shutdowns.

The reason given for this is precedence. But that actually means that the unemployment rate in that scenario would not accurately portray the state of employment in the U.S.

An economist cited in the article further explained that the following could also happen:

some federal employees initially furloughed have been called back to work and some are being paid with supplemental funds.

Since it only takes one hour of work during that specific pay period to be counted as employed, it's not hard to come up with ideas about how to turn these not-working people into employed people.

As a data analyst, there is no getting around learning all these nuances if you want to interpret data properly. It's the nature of our work.

In the second half of 2018, in the aftermath of the Cambridge Analytica-Facebook story, the news media have broken out of a stupor, and started realizing that the world of the Web and mobile is filled with fake data. I've written about this problem quite a bit. For example, my 2017 piece on the birth of the "fake news business" is still worth your time.

I also urged data analysts to recognize the amount of fake data that we are analyzing while looking the other way. This post describes some real examples of fake data discovered by journalists. Even earlier, in 2015, I authored a piece for HBR discussing Augustine Fou's work on uncovering ad fraud - driven by automated bots.

New York magazine just noticed that a lot of Web data are fake. This is a nice piece discussing the current state - it has only gotten worse. Among the trove of lies and damned lies mentioned in this article are fake web traffic, fake websites, fake clicks, fake mouse movements, fake social network accounts, fake cookies, fake video views, fake time on site, fake subscribers, fake video viewers, fake AI assistants, fake Instagram influencers, fake sponsored content (i.e. ads that pretend to be news), fake people, ... One inexplicable omission is the fake product review, which is probably the most commonly encountered species.

The tech industry is driving a lot of this and has not taken the proper actions to contain this problem before it goes out of hand. The fake data problem will evolve into a trust problem; in the last half of 2018, we start to hear rumblings of discontent.

A new video is up on our channel. I explain why Google Maps - and specifically the embedded navigation software - is one of the insanely great products from the Big Data era. (Similiar products are available from GPS providers such as Garmin and TomTom but Google has superior marketing and "pricing".)

How do Big Data power this product behind the scenes? Where do the data come from? Why the same Big Data that bring you insanely great products also threaten your privacy? To learn more, click here.

If you like what we are doing with the videos, let us know by subscribing to our Youtube channel and liking our videos. Youtube plays this game where it withholds goodies - such as the ability to customize the URL of the videos to something you can remember and not a string of random alphabets - until I attain some minimum of subscribers and video plays. This is frustrating to anyone who is not a big brand, and does not want to participate in the fake-data economy.

Andrew Gelman nails it again with this post titled "combining apparently contradictory evidence." He uses the example of repeated tests given to the same student, such as the scores from multiple assignments within the same course. One student might get 80,80,80 for three equally-weighted assignments while another student might get 80,100,60. The issue is that the sample size of three is too small to judge not only the average score but also the variability in score.

I made a comment about exactly the same problem I encountered when reading applications for the MS program at Columbia. Most of the applicants have good STEM undergrad degrees and no meaningful work experience. At first, I thought the three reference letters would be useful to differentiate the applicants.

It turns out that most applicants get 3 good references, almost always from professors who taught them. Occasionally, an applicant would get a poor reference, i.e. the professor is not recommending the student. However, in all such cases, the one poor reference is contradicted by two good references. So who do I believe? I typically don't know the authors of these references, and therefore have no external information about their reliability.

I am very aware that the sample of three is too small. One is tempted to think that because this applicant got inconsistent references while most other students did not is a "signal" that this applicant must be worse than the average, but drawing that conclusion is to ignore the small sample size - and the small-sample problem is even worse when drawing conclusions using the observed inconsistency of grades!

***

Thinking back to the grad school admissions process also makes me more sympathetic about the practical rationale for Princeton's decision to walk back its grade-deflation policy (see critical post here).

You might think undergrad GPAs are useful for making admissions decisions. The decades of grade inflation have vanquished this metric, as almost every recent graduate has a GPA say in the 3.5 to 3.8 range.

In fact, when metrics are gamed, it is usually not just uninformative - it can be anti-informative. Such metrics can lead to very bad conclusions.

The difference in GPAs no longer reflects a difference in ability between students. They are more likely to be influenced by other factors such as (a) when the student graduated and (b) whether the school or department uses grade deflation (or a grading curve).

Take date of graduation. If someone has a GPA lower than 3.0, it is almost always the case that this student graduated in the 1990s or before. But the GPA numbers are typically not presented together with year of graduation - so if the analyst does not recognize and adjust for this long-term grade-inflation trend, then the older candidates face a systematic discrimination.

This line of thinking takes me back to Princeton's decision to end grade deflation. Same problem here - when the admissions officer reads the GPA, it's not typically presented next to the college that grants it. There are hundreds of colleges the officer might come across during the admissions process, so it's impossible to hold in one's head the grading policies of so many colleges. In fact, even though I know a lot about Princeton's grading policies over the years, it still requires unreasonable effort to bring this contextual information into the decision-making. For one thing, I'd have to be aware of the different periods of grade inflation, then deflation, then inflation, etc. Therefore, I believe that the grade-deflation policy put Princeton graduates at a disadvantage when competing for scholarships, grad-school spots, etc.

***

If schools are required to release data on grades, then it would be possible to overcome the interpretation problems with the GPAs. Knowing the grade distributions by major, by school, by year is a good start.