Posts categorized "Science"

Just posted a short video that explains one of the techniques used to work with observational data (or found data). This type of data is extremely common in the Big Data world. The data are being collected by some operational process, and in the indelible words of the late Hans Rosling, they are a bag of numerators without denominators. In this case, the database is for car crash fatalities. You only have the reported crashes linked to deaths but that database does not contain any information about crashes that do not have fatalities or safe driving.

Just like most scientific studies, the original researchers made a claim of statistical significance, i.e. they found something out of the normal. (There was an excess of fatalities on April 20.) However, a second research group took a different look at the data and demonstrated that what happened was more common than first thought.

How do statisticians measure how common something is? One takeaway is how to define the reference (control) group. Another takeaway is replications, repeating a style of analysis over different slices of the data.

Click here to see the video. Don't forget to subscribe to the channel to see future videos.

For a long-form discussion of what is covered in the video, see this blog post.

A couple of researchers recently claimed to have found evidence of “a battle of the thermostat”: specifically, they argued that women perform significantly worse in a math test when it is administered at lower room temperatures while men’s performance does not decline. They did not find a similar gender gap in a word test.

The analysis strategy is similar to one that is employed by business analysts so readers here might be interested in how it went off track.

First, the researchers looked at the effect of temperature on test scores in aggregate. They found none. See the chart below, which is a replicate of similar charts in the research paper.

Note that in my chart, the scores have been standardized (to be explained below). For both math and word scores, the difference in scores between the extreme temperatures of 15 to 32 celsius (60 to 90 Fahrenheit) is smaller than 0.25 standard units, clearly not statistically significant by any measure.

The next step in the analysis strategy is to rotate through other factors that might influence test scores, such as gender, language spoken, and college major. In the business world, this is often called a deep-dive analysis. The question being asked is whether the effect of temperature on test scores differed by [factor X]. The researchers struck gold when splitting the data by gender. Here are the relevant charts:

In the top chart displaying math scores, it appears that the trend line for men is almost flat while women’s scores progressively increase as temperature increases. According to the usual convention of statistical significance, the effect is strong enough to be published. It is even possible to explain this observation using "battle of the thermostat".

***

Many other researchers have expressed skepticism that the reported effect can be replicated in other settings. See, for example, here.

I explored the data a bit more. Here is the raw data shown in side-by-side boxplots. I have arranged the sessions by the average temperatures, ordered from lowest to highest.

The following features caught my attention:

The math scores of women (top chart, red boxplots) show unusually low variability. The dispersion of these scores is much below that of men’s math scores. The dispersion of these women’s math scores is also much below that of women’s word scores (bottom chart, red boxplots).

Focusing only on the math scores (top chart), and inspecting the data session by session, from the right to the left, I found that the gender difference is hard to see (often obscured by the high variability of the male scores), except in a few sessions with the lowest temperatures (boxed area).

In terms of math scores, the shape of the distribution for men is clearly different from that for women. The scores for men were generally higher, and contain more extreme values on the high side. If a male and a female student score the same in absolute terms, the scores do not mean the same thing in relative terms – each relative to other students of the same gender.

***

One way to deal with this difference in variability is to standardize the scores by gender. This leads to the following chart, in which the scores are expressed relative to each gender’s distribution.

Notice how the math and word charts look almost identical. Also compare to the chart above using the raw data. This shows the fragility of statistical significance: the top chart shows marginally significant while the bottom chart shows not significant.

***

[Technical note: Here is a scatter plot of the standardized math scores against the raw scores, grouped by gender. You can see that the raw women's scores are in a narrower range, and that men's scores are more disperse and contain some high extremes.]

***

Deep-dive analysis sometimes lead to unexpected discoveries. Aggregate data sometimes mask subgroup differences. But at other times, the subgroup difference is a mirage. At the disaggregated level, the sample size is smaller, and there may also be a difference in variability.

For more discussion of subgroup analysis, see Chapter 3 of Numbers Rule Your World (link), in which I discuss how the insurance industry and the educational testing community tackle this problem.

The Atlantic reports on the dynamics of yet another group of scientists coming to grips with having wasted time and resources chasing down a dead end. (link)

It's a good read but long. Here is the gist of it:

Almost 20 years ago, some researchers made a huge splash by claiming to have discovered the "depression gene". The one gene eventually engendered 450 publications, and when counting related genes, over 1,000 publications. A recent large-scale "validation" study is likely to bring down the entire cottage industry - the depression gene is found to have little explanatory power for depression after all.

Gene data is an example of a type of Big Data. Big Data can be big in terms of the number of individuals in the dataset, or the number of measurements per individual. Two decades ago, the scale was attained by virtue of more measurements, not more individuals. The original study looked at about 300 or so individuals but each person's genome is vast.

The basic analysis is to compare the average depressed individual versus the average not-depressed individual in the sample. The data analyst sifts through large numbers of genes to find one or a few that are highly correlated with having depression. This is a classic fishing expedition, because of the large number of candidate genes, and also because of the large number of ways to define depression.

Such an analysis rides on top of a "model" of the world in which a single gene is responsible for depression. Over the years, the scientific community has discovered that this model is wrong. The new model assumes depression is indicated by a large set of genes each contributing a weak effect.

This type of structure is very hard to elicit from the typical datasets of the past - those that have numerous measurements on few individuals. Nowadays, we have data on lots of individuals but the sourcing of the data and other problems pose formidable challenges. It's also not clear how to use a model that spreads the blame thinly around a large number of genes for treatment.

Science is proceeding as it should - weak theories are overturned with more research. The article laments that it took 20 years to turn the tide, earlier warnings were ignored, the publish-or-perish culture in academia creates perverse incentives, retraction of scientific studies, etc.

***

I recently wrote about the challenge of Big Data expanding the variety of measurements here. Also, in writing Numbersense (link), I was concerned that the explosion of data collection causes an avalanche of false-positive science.

The authors of these articles express genuine shock and awe. They apparently believed that “machine learning” means no humans involved. The tech industry allows this misconception to fester by being opaque about how machine learning works.

(The reporters are also dismayed by the privacy invasion. The Echo speakers are constantly recording in users’ households. Facebook did not have explicit permission from users to it send their data out for labeling.)

***

Humans have always been a part of the machine learning workflow, and will continue to be. Let’s use one of the examples in the Facebook report to illustrate this point.

Computers work fast, and can make tons of predictions quickly. The question is whether these predictions are accurate. It’s one thing to create these models in the laboratory; it’s another thing when they are unleashed to the world, and affecting Facebook users, e.g. by deleting videos that are predicted to contain profanity.

Why should Facebook and its data scientists care about predictive accuracy?

User complaints. When users find their videos deleted due to profanity, they complain if said clips do not in fact contain profane language. Other users are upset to unsuspectingly encounter videos with profanity (that the machine fails to identify and delete).

***

It’s not easy to measure if the machine-learning model is correctly predicting profanity. The machine can’t be both decider and judge at the same time. The judge typically is a human who views the video to determine if it contains profanity. These human judges are the “annotators” described in the news articles. They are hired to look through videos and apply a profanity “label” if they find profanity.

As disclosed in the articles, companies typically hire two or three judges for each item because profanity is a somewhat subjective opinion. They might order more detailed labeling, e.g. label types of profanity as opposed to one overall label of profanity.

***

Now let’s remove the assumption that we already have a machine learning model. Where does this model come from? Such a machine has to know what features of the video are correlated with presence of profanity. To discover this correlation, the machine needs to be told which videos have profane language, and which do not.

This is a chicken-and-egg problem and it is solved by having humans label a big batch of videos at the start, building the “training data”. In the Facebook example, they hired over 200 people to create data labels, later reduced to 30. The first team was building a large training dataset; after the predictive model has been produced, the future labeling by the reduced workforce is used to monitor the accuracy of the predictions.

Any company that claims to use our data to predict our behavior must create training data, i.e. labeled data. In most cases, humans must create the labels – by reading our emails, listening to our conversations, viewing our videos, reviewing our calendars, scanning our receipts, and so on.

How far companies should go and what methods they shoudl use in collecting such data are ethics questions that should be discussed.

If you are a frequent flier, you already know the gist of this nice article by the BBC: that airlines are allowed to sandbag the flight durations. A flight that takes 60 minutes will be portrayed to fliers as taking twice as long, if not longer.

The airlines are even allowed to lie about this practice. When your flight is delayed taking off, the captain claims that s/he will “make up for the delay,” as if the plane could be driven faster on command. (Were they deliberately going slower before?) The truth is that the schedule is padded, so that it can absorb a limited amount of delay.

This quote sums the situation up: “By padding, airlines are gaming the system to fool you.”

At the very bottom of the article, you’d find the potential motivation – to avoid compensating travelers for long delays, as required by law in some countries.

***

The situation here is similar to the road congestion problem discussed in Chapter 1 of Numbers Rule Your World (link). Managing perceived time is as important as managing actual time experienced by the traveler. Of course, reducing actual wasted time is preferred, especially to scientists working on the problem, but when the road/sky capacity is fixed and over-subscribed, it’s almost impossible to attain. The second half of the article addresses “why don’t the airlines work on efficiency instead of lengthening flight times?”

***

Another quote reveals: “Over 30% of all flights arrive more than 15 minutes late every day despite padding.”

The “on-time arrival rate” blessed by the Department of Transportation (DoT) is not what you think it is.

Let’s take a random flight that takes 60 minutes. This flight schedule might be padded and advertised as departing at 2 pm and arriving at 4 pm.

If the flight departs at 2 pm and takes 60 minutes, then you’d think on-time arrival is defined as arriving at 3 pm. You might agree to allow for some slack, say, 15 minutes. In this case, on-time arrival is arriving before 3:15 pm.

Given the discussion, you now know that on-time arrival is actually arriving before 4 pm since the schedule is padded not by 15 minutes but by 60 minutes.

And then you’d be wrong! Because there is padding upon padding. Airlines are allowed to claim “on-time arrival” if the flights arrive within 15 minutes of the scheduled arrival time, which in our example, has been padded by 60 minutes. So any flight arriving before 4:15 pm is counted as “on-time.”

***

Padding is not purely a bad thing. A certain amount of padding is necessary because lots of flights are vying for a limited amount of airport and air space. A padded schedule is a more accurate schedule. It acknowledges other factors that cause delays in arrival.

The gaming of the padded metric is what works people up. Gaming is possible because padding inserts subjectivity into the measurement. So long as subjectivity cannot be avoided, gaming is here to stay.

***

The reporter said airlines have spent billions on technologies to improve efficiency, i.e., managing actual experienced time. But, “this has not moved the needle on delays, which are stubbornly stuck at 30%.”

Now square that statement with this one: “Billions of dollars in investment [in modernising air traffic control] have in fact halved air traffic control-caused delays since 2007 while airline-caused delays have soared.”

Does this sound like the new technologies have successfully reclassified delays from air traffic control caused to airline caused? Passing the buck?

The next time your flight is delayed, the airlines will likely tell you, “it’s not us, it’s the weather.”

Just finished reading The Undoing Project by Michael Lewis, his bio of the Kahneman and Tversky duo who made many of the seminal discoveries in behavioral economics.

In Chapter 7, Lewis recounts one of their most celebrated experiments which demonstrated the “base rate fallacy.”

Here is one version of the experiment. The test subjects are asked to make judgments based on a vignette.

Psychologists have administered tests to 100 people, 70 of whom are lawyers and 30 are engineers.

(A) If one person is selected at random from this group, what is the chance that the selected person is a lawyer?

(B) Dick is selected at random from this group. Here is a description of him: “Dick is a 30 year old man. He is married with no children. A man of high ability and high motivation, he promises to be quite successful in his field. He is well liked by his colleagues.” What is the chance that Dick is a lawyer?

Those subjects who answered (A) made the right judgment, in accordance with the base rate of 70 percent.

The answer to (B) should be the same, since it shouldn't matter whether the random person is named Dick or not, and the generic description provides no useful information to determine Dick’s occupation. However, those subjects who answered (B) edited the chance down to about 50-50. The experiment showed that access to Dick’s description led people astray – to ignore the base rate. Note that the base rate here is the prior probability.

***

What are the practical applications of the KT experiment for business data analysts?

tl;dr

Before throwing the kitchen sink of variables (features) into your statistical (machine learning) models, review the literature on the base rate fallacy starting with Kahneman-Tversky experiments.

1. Adding more variables can make your predictions worse

Let's start with what kind of additional information is provided by Dick’s description. The sample size has not changed – it’s still one. The data expanded only in the number of variables (or features). Specifically, these eight additional variables:

X1 = age

X2 = gender

X3 = martial status

X4 = number of children

X5 = ability level

X6 = motivation level

X7 = expected level of success in field

X8 = popularity among colleagues

In today’s age of surveillance data, it is all too easy for any analyst to assemble more variables. The KT experiment shows that having more variables does not imply you have more useful information. Worse, those extra variables may distract you from the base rate, leading to worse predictions.

2. Machines are even more susceptible than humans

If humans are prone to such mistakes, should we use machines instead? Sadly, machines will perform worse.

Machines allow us to process even more variables at even greater efficiency. Instead of eight useless variables, you can now add 800 or even 8,000 useless variables about Dick. The machines will then inform you which subset of these variables “pop.” The more useless data you add in, the higher the chance you will encounter an accidental correlation.

A report came out from Stanford School of Medicine about a study of Apple Watch's health monitoring features. Some headline writers are proclaiming that "finally, there is proof that these watches benefit our health!" For example, Apple Watch Stanford Study Shows How It Can Save Lives (link).

When you read the official story, you will learn the following facts about the study:

The research is funded by Apple

It was a purely observational study in which they follow (400,000) people who wear Apple Watches

Participants must own both an Apple Watch and an iPhone to be eligible (plus meeting other criteria)

There was no "control" group - they did not follow anyone who did not use Apple Watch or use any other health monitoring wearables

Every participant is self-selected

The device issued warnings to only 0.5 percent of the participants (~ 2,160)

Those who received a warning were directed to a video consultation; and the doctor decided whether or not to send the participant an ECG patch, which is used to establish the "ground truth". Only about 30 percent were sent patches, and of those, 70 percent (450) returned the patches for analysis.

Only those who had ECG data were analyzed. One third of these were shown to have experienced "atrial fibrillation" (irregular heartbeat). This means that two-thirds got false alarms. But if we include the 70% who were not sent patches after the video consultation as false alarms as well, then out of every 100 warnings, only 7 were validated.

There is no discussion of false negatives: did any of the 99.5 percent who did not receive warnings experienced irregular heartbeats?

We do know that if there were significant false negatives, then more warnings would have to be sent, which pushes up false alarms.

Despite the headlines, any lives saved were extrapolated.

There are some major methodological limitations about this study.

Firstly, the study design prevents drawing conclusions of the type "People wearing Apple Watches .... compared to those who did not wear Apple Watches." It does not include anyone not wearing Apple Watches.

Secondly, it's difficult to interpret the accuracy metrics. Is 20 percent false alarms a good or bad number? Is 0.5 percent receiving warnings a reasonable proportion given the demographic and health characteristics of the study population?

Hopefully, this study is just the beginning, and more rigorous studies are being planned.

Reader AR pointed me to this Fast Company article that examines the ethics of A/B testing.

The only way to comprehend this point of view is to think of A/B testing not as a scientific experiment but as a decision-making process that involves running an experiment. The researchers are unhappy that A/B tests could lend support to decisions that have undesirable impact on society.

Two such examples are described:

Two images are tested for a job ad. During the test, site visitors were shown one of the two images, selected at random. The winner of the test is an image that disproportionately drives male applicants.

Separate pricing tests are run in different zip codes. The "winning" prices at the conclusion of these tests are different for different zip codes. Because racial profiles differ by zip code, prices are in effect different for different races. Therefore, the test result leads to race-based discrimination.

There are two important questions to discuss here. First, what is the alternative to A/B testing? Is that method of decision-making better? Second, is the harm produced by the experiment itself, or by the decision made as a result?

Alternatives to A/B Testing

Consider the image test described above. Presumably, the test is run because someone believes that one of those two images might perform better at driving applicants. At most companies, a test sees the light of day after teams of people debate and prioritize testing ideas. If the test including a sexist image is run, then the team in charge of testing has approved it for some (possibly bad) reason.

If they didn't have A/B testing, how would they have decided which image to run? And if the image is not explicitly sexist - in other words, if the analyst had to analyze the data to learn that one image drove more male applicants - how would that insight be surfaced without running the other image? The alternative decision process may be even worse.

It is certainly true that automated A/B testing is risky - because no human beings are involved in turning test results into actions. The absence of humans is usually touted as a benefit by vendors of such testing tools. In this example, a human analyst reports on the test result, and includes the analysis by gender showing that while total applications increased, the winning image disproportionately attracted male applicants. The decision-makers can and should decide not to adopt the winning image based on that analysis. The A/B test revealed the bias but did not cause it.

Even without the gender issue, such analysis and discussion of results is necessary. For example, ad clicks can be generated by placing ads near scroll bars to stimulate accidental clicking. Human analysts can report that clicks increased but only through accidental clicking. The decision-makers can and should decide not to implement the winning design.

From where does the harm come?

The other example is more far-fetched. I am reverse-engineering the pricing test as described. Given that the test led to different prices for different zip codes, they would be running separate A/B tests stratified by zip code. Given the law of supply and demand, it might be the case that the winning price would be lower in poorer zip codes and higher in richer zip codes. This definitely results in price discrimination by zip code. If the design team did not want price discrimination by zip code, then such a test design would not have been approved so the test itself isn't creating harm.

Further, race-based price discrimination is accused because zip codes are correlated with race. Almost all variables are correlated with race. Age is correlated with race, so are income, education, what websites one visits, etc. So this standard leads to a banning of all segmentation and targeting policies. The only possible pricing policy would be one price for all.

***

In short, human supervision of A/B testing from design to interpretation is definitely needed. A/B tests provide a wealth of data to support decision-making. The biases highlighted by the Fast Company article are merely revealed by the testing - they are not caused by it.

I don’t agree with Daniel’s conclusions in his article in Slate about the measles “crisis” but he did his research and there are lots to chew on. You don’t have to agree with him to find this article thought-provoking.

There is one paragraph which everyone should read. It’s a celebration of science, and how it saved lives. (Daniel used this story for a different purpose: he argued that we never “eradicated” measles, and therefore, the anti-vaxxers could never have reversed some mythical victory.)

During the most recent, major wave of measles infection in the U.S., between 1989 and 1991, close to 56,000 people fell ill and more than 100 people died...The 1989–91 epidemic was large enough and deadly enough to cast light on two pressing problems: First, that a single vaccine dose was not sufficient to protect children, and second, that black and Latino children, especially those living in urban areas, were less likely to be vaccinated, and thus more vulnerable to the disease.

Efforts were made to address both issues in the years that followed. A second measles shot was recommended for all children, while the federal government ramped up efforts to provide free vaccination to high-risk groups. The plan worked. By 1994, vaccine coverage for measles was closing in on 90 percent. The number of cases reported every year soon dropped from the thousands into the hundreds, and then into the tens. It was in response to this decline that experts from the CDC announced that measles had been “eliminated” from the U.S.

It's a great example of finding the drivers behind the data, and executing actions that successfully changed the numbers.

***

From the rest of the article, here are some useful tidbits:

Vaccines work because of a phenomenon called “herd immunity,” which is a type of wisdom of the crowd. Diseases spread when people interact with each other. If both sides are vaccinated, then the risk of spreading is much, much lower. Thus, the higher the proportion of the vaccinated, the lower risk for everybody. The threshold desired by health authorities is 93 to 95 percent.

In the past several decades, at a national level, the vaccination rate has stayed around 91 percent. So it is below the desired level but seems close enough not to cause alarm.

Discredited research by a Dr. Wakefield ignited the anti-vaxxer movement. The publication of such research allows people to confirm their prior beliefs, and it often is hard to dislodge such beliefs, even when the research has been invalidated.

In some localities, e.g. Somalis in Minnesota, the rate of vaccination has dropped drastically to below 50%. Those are isolated communities, and in aggregate, the level of vaccination has not changed.

The number of measles cases while small has shown signs of increase. There have been triple-digit cases 7 out of the last 10 years, but only twice in the previous decade.

The fatality rate of measles is extremely low, 11 deaths in 18 years, which is said to be similar to being killed by scorpions. Those who read Chapter 2 or 4 of Numbers Rule Your World (link) will recognize the need to think about the cost of errors.

Happy Lunar New Year! And greetings to Orlando people who are coming to my dataviz seminar this morning.

***

What’s going on with digit recognition, one of the signature applications of machine learning?

Before self-driving cars, before image recognition, before machine translation, there was digit recognition: computers are trained to read and recognize hand-written numbers. This problem shares several of the key components of problems tailor-made for machine learning methods:

The correct answer is unambiguous for each item (i.e. image of a digit). The author of the digit has a particular number in mind.

The range of possible answers across all items is finite. In a decimal system, each image can only be one of 0, 1, 2, ... , 9.

The end-user only cares about how accurately the digit can be predicted. Causality is not of interest here.

A massive dataset of labeled images, i.e. images that have been correctly recognized, used to train computers is easily obtained.

Live application generates more data, which feeds back into the system in a positively-reinforcing manner.

This digit-recognition technology is widely deployed. Check deposit at the ATM machine is one obvious example. In 2016, about 16 billion checks are deposited in the U.S. (source). So what’s wrong with the current state of art?

This snapshot I took at an ATM illustrates the problem:

Recently, I noticed that the ATM has refused to recognize the digits on several checks, asking me to enter the amounts manually instead.

From this evidence, I infer the following:

Still after these years, the error rate is higher than these banks could absorb. Assuming 10 billion checks read each year at ATMs, even a 0.01% error rate amounts to 1 million errors per year, or about 2,800 errors per day!

Banks would rather err on the side of caution – when in doubt, ask users to enter the amount. This behavior implies humans make fewer errors than machines, even after including mischief as a source of human error.

What would a teller do if s/he can't make out the scripted digits? The human would look at the handwritten words "six thousand," solving the problem. Apparently, the ATM does not have handwriting recognition technology, or perhaps its accuracy rate is not high enough. It's a harder problem, though of a similar nature.

***

Why are the banks risk averse? As a victim of one of these errors, I think I understand. Last year, I spent four or six weeks chasing after $20. In this case, the machine read the 2 as a 0. I didn't catch the mistake while at the ATM, but later noticed it on the bank statement.

I learned that convenience comes at a price. The bank's process to verify the amount and correct the mistake is convoluted. It's like missing that exit on the highway, and you now have to go five miles before the next exit. It's a pain for the bank as well as for the client.