Posts categorized "Behavior"

Business Insider reports on the bottom 20 college majors, with the warning "what not to study". The list looks extremely random including everything from visual and performing arts to international relations to math and computer science (!).

The author explains these majors have the highest unemployment rates.

Here are a few reasons why you can and should ignore this study.

Presumably the journalist is advising college students, who will soon become new college graduates. But the analysis references college graduates which include anyone with a college degree, ranging from a new college graduate to someone who's worked for 40 years. New college graduates have a much higher unemployment rate than all college graduates. (See this recent Huffington Post article.)

Students do not randomly select a college major. Thus, students who choose to major in international relations are not the same type of people who decide to major in engineering. If an international relations student were to switch major to engineering, it would not follow that this person's employment prospect had suddenly brightened. It might even worsen because (a) the student could be relatively less competitive against other engineering majors; and (b) the student and the major might now be mismatched.

If everyone followed this journalist's advice, then each college would need a very small number of majors. Unless the economy suddenly produced a bonanza of jobs, where would the unemployed go? You got it... the unemployment rate would accrue to the several remaining majors, instead of being spread out over hundreds of majors. The unemployment rate of these "good" majors would rise to the level of the average unemployment rate. Oops.

Not everyone needs or wants a job. Changing majors doesn't change that reality.

***

There is an even bigger howler in the article. The analyst at Bankrate equated "job stability" with low unemployment rate, using language such as "Having a high-paying job doesn't necessarily mean you'll have job stability, and vice versa." So, majors with high unemployment rates are bad because people job-hop.

But the unemployment rate is not a measure of job tenure, and thus not an indicator of job stability. In fact, if people job-hop a lot, the unemployment rate will be relatively low, because the same job can be held by multiple people in the course of a year.

Let's imagine a country of 10 people with 5 jobs. The unemployment rate is 50%. If no one changes jobs, the same five people are employed always, and the rate is stable at 50%. The government stipulates that no person can hold on to the same job for longer than 6 months. We put the 10 people in a circle, and every other person is given a chair and seated. Every 6 months, each person moves clockwise by one slot: the five previously seated no longer have a seat, the five seats now occupied by the five previously standing. Everyone is now employed 6 months out of the year. The employment rate is now 100% (counting part-timers).

Thus, unemployment rate of zero can coincide with high job instability.

I've talked about "fake data" before. A lot of fake data come from people trying to game algorithms or skew metrics, and oftentimes, automated bots are involved. Attempts to obscure these tactics typically involve creating layers of complexity so it is not easy to connect the dots.

I come across suspicious data all the time, and it's not always clear what's going on or why. So I thought I'd feature some of them here and see if anyone can figure it out.

***

This account on Yelp caught my attention because this user apparently uploaded a photo of her desktop to one of the restaurant pages. She has uploaded a total of four photos, all of them are unrelated to food. The four photos were uploaded to two New York City restaurants while she indicated that she lives in San Francisco. She did not review either of those NYC restaurants but she has written one review for a cafe in Long Island City. The review seems genuine (although it's hard to tell unless you've been to that cafe).

She has five friends. While she lives in San Francisco, these friends live in Manhattan, Brooklyn, Scottsdale and Oceanside. She has no friends in the Bay Area. None of these friends have ever written a single review and have no likes. However, each of these friends have 100-400 friends. It's not clear why one would be friends with someone on Yelp who has no reviews or likes.

***

Is this account fake? If so, why was it created? How did those photos get uploaded? How did they get placed in those particular restaurants? Who are these friends? Are they fake as well? If the account is fake, was that review also fake? Is it possible to predict that the review is fake?

So many questions, and so hard to get answers. What do you think is going on?

We've all been stumped by those "Captcha" puzzles. They started with single words, then pairs of words, then single images, and now, panels of images. Sometimes, it's really hard to know what the answer is. The square that contains one tiny pixel of a traffic light, is that part of the traffic light or not?

It's human versus machine. Does the machine think that pixel is part of the traffic light or not? If you get it wrong, you will be punished by another set of nine images. If you fail multiple times, you will simply be locked out of the website you wanted to visit. Woe be you!

Imagine this. We are trying to prove to a machine that we are not machines! These Captcha puzzles are explicitly designed to keep bots away. The machine will decide whether you are a human being or not.

I explain this, and a lot more in my new Fung with Data video. This journey will take you through data science, big data, machine learning, deep learning, self-driving cars, and more. These things are all inter-connected with Captchas.

Just posted a short video that explains one of the techniques used to work with observational data (or found data). This type of data is extremely common in the Big Data world. The data are being collected by some operational process, and in the indelible words of the late Hans Rosling, they are a bag of numerators without denominators. In this case, the database is for car crash fatalities. You only have the reported crashes linked to deaths but that database does not contain any information about crashes that do not have fatalities or safe driving.

Just like most scientific studies, the original researchers made a claim of statistical significance, i.e. they found something out of the normal. (There was an excess of fatalities on April 20.) However, a second research group took a different look at the data and demonstrated that what happened was more common than first thought.

How do statisticians measure how common something is? One takeaway is how to define the reference (control) group. Another takeaway is replications, repeating a style of analysis over different slices of the data.

Click here to see the video. Don't forget to subscribe to the channel to see future videos.

For a long-form discussion of what is covered in the video, see this blog post.

The authors of these articles express genuine shock and awe. They apparently believed that “machine learning” means no humans involved. The tech industry allows this misconception to fester by being opaque about how machine learning works.

(The reporters are also dismayed by the privacy invasion. The Echo speakers are constantly recording in users’ households. Facebook did not have explicit permission from users to it send their data out for labeling.)

***

Humans have always been a part of the machine learning workflow, and will continue to be. Let’s use one of the examples in the Facebook report to illustrate this point.

Computers work fast, and can make tons of predictions quickly. The question is whether these predictions are accurate. It’s one thing to create these models in the laboratory; it’s another thing when they are unleashed to the world, and affecting Facebook users, e.g. by deleting videos that are predicted to contain profanity.

Why should Facebook and its data scientists care about predictive accuracy?

User complaints. When users find their videos deleted due to profanity, they complain if said clips do not in fact contain profane language. Other users are upset to unsuspectingly encounter videos with profanity (that the machine fails to identify and delete).

***

It’s not easy to measure if the machine-learning model is correctly predicting profanity. The machine can’t be both decider and judge at the same time. The judge typically is a human who views the video to determine if it contains profanity. These human judges are the “annotators” described in the news articles. They are hired to look through videos and apply a profanity “label” if they find profanity.

As disclosed in the articles, companies typically hire two or three judges for each item because profanity is a somewhat subjective opinion. They might order more detailed labeling, e.g. label types of profanity as opposed to one overall label of profanity.

***

Now let’s remove the assumption that we already have a machine learning model. Where does this model come from? Such a machine has to know what features of the video are correlated with presence of profanity. To discover this correlation, the machine needs to be told which videos have profane language, and which do not.

This is a chicken-and-egg problem and it is solved by having humans label a big batch of videos at the start, building the “training data”. In the Facebook example, they hired over 200 people to create data labels, later reduced to 30. The first team was building a large training dataset; after the predictive model has been produced, the future labeling by the reduced workforce is used to monitor the accuracy of the predictions.

Any company that claims to use our data to predict our behavior must create training data, i.e. labeled data. In most cases, humans must create the labels – by reading our emails, listening to our conversations, viewing our videos, reviewing our calendars, scanning our receipts, and so on.

How far companies should go and what methods they shoudl use in collecting such data are ethics questions that should be discussed.

If you are a frequent flier, you already know the gist of this nice article by the BBC: that airlines are allowed to sandbag the flight durations. A flight that takes 60 minutes will be portrayed to fliers as taking twice as long, if not longer.

The airlines are even allowed to lie about this practice. When your flight is delayed taking off, the captain claims that s/he will “make up for the delay,” as if the plane could be driven faster on command. (Were they deliberately going slower before?) The truth is that the schedule is padded, so that it can absorb a limited amount of delay.

This quote sums the situation up: “By padding, airlines are gaming the system to fool you.”

At the very bottom of the article, you’d find the potential motivation – to avoid compensating travelers for long delays, as required by law in some countries.

***

The situation here is similar to the road congestion problem discussed in Chapter 1 of Numbers Rule Your World (link). Managing perceived time is as important as managing actual time experienced by the traveler. Of course, reducing actual wasted time is preferred, especially to scientists working on the problem, but when the road/sky capacity is fixed and over-subscribed, it’s almost impossible to attain. The second half of the article addresses “why don’t the airlines work on efficiency instead of lengthening flight times?”

***

Another quote reveals: “Over 30% of all flights arrive more than 15 minutes late every day despite padding.”

The “on-time arrival rate” blessed by the Department of Transportation (DoT) is not what you think it is.

Let’s take a random flight that takes 60 minutes. This flight schedule might be padded and advertised as departing at 2 pm and arriving at 4 pm.

If the flight departs at 2 pm and takes 60 minutes, then you’d think on-time arrival is defined as arriving at 3 pm. You might agree to allow for some slack, say, 15 minutes. In this case, on-time arrival is arriving before 3:15 pm.

Given the discussion, you now know that on-time arrival is actually arriving before 4 pm since the schedule is padded not by 15 minutes but by 60 minutes.

And then you’d be wrong! Because there is padding upon padding. Airlines are allowed to claim “on-time arrival” if the flights arrive within 15 minutes of the scheduled arrival time, which in our example, has been padded by 60 minutes. So any flight arriving before 4:15 pm is counted as “on-time.”

***

Padding is not purely a bad thing. A certain amount of padding is necessary because lots of flights are vying for a limited amount of airport and air space. A padded schedule is a more accurate schedule. It acknowledges other factors that cause delays in arrival.

The gaming of the padded metric is what works people up. Gaming is possible because padding inserts subjectivity into the measurement. So long as subjectivity cannot be avoided, gaming is here to stay.

***

The reporter said airlines have spent billions on technologies to improve efficiency, i.e., managing actual experienced time. But, “this has not moved the needle on delays, which are stubbornly stuck at 30%.”

Now square that statement with this one: “Billions of dollars in investment [in modernising air traffic control] have in fact halved air traffic control-caused delays since 2007 while airline-caused delays have soared.”

Does this sound like the new technologies have successfully reclassified delays from air traffic control caused to airline caused? Passing the buck?

The next time your flight is delayed, the airlines will likely tell you, “it’s not us, it’s the weather.”

Just finished reading The Undoing Project by Michael Lewis, his bio of the Kahneman and Tversky duo who made many of the seminal discoveries in behavioral economics.

In Chapter 7, Lewis recounts one of their most celebrated experiments which demonstrated the “base rate fallacy.”

Here is one version of the experiment. The test subjects are asked to make judgments based on a vignette.

Psychologists have administered tests to 100 people, 70 of whom are lawyers and 30 are engineers.

(A) If one person is selected at random from this group, what is the chance that the selected person is a lawyer?

(B) Dick is selected at random from this group. Here is a description of him: “Dick is a 30 year old man. He is married with no children. A man of high ability and high motivation, he promises to be quite successful in his field. He is well liked by his colleagues.” What is the chance that Dick is a lawyer?

Those subjects who answered (A) made the right judgment, in accordance with the base rate of 70 percent.

The answer to (B) should be the same, since it shouldn't matter whether the random person is named Dick or not, and the generic description provides no useful information to determine Dick’s occupation. However, those subjects who answered (B) edited the chance down to about 50-50. The experiment showed that access to Dick’s description led people astray – to ignore the base rate. Note that the base rate here is the prior probability.

***

What are the practical applications of the KT experiment for business data analysts?

tl;dr

Before throwing the kitchen sink of variables (features) into your statistical (machine learning) models, review the literature on the base rate fallacy starting with Kahneman-Tversky experiments.

1. Adding more variables can make your predictions worse

Let's start with what kind of additional information is provided by Dick’s description. The sample size has not changed – it’s still one. The data expanded only in the number of variables (or features). Specifically, these eight additional variables:

X1 = age

X2 = gender

X3 = martial status

X4 = number of children

X5 = ability level

X6 = motivation level

X7 = expected level of success in field

X8 = popularity among colleagues

In today’s age of surveillance data, it is all too easy for any analyst to assemble more variables. The KT experiment shows that having more variables does not imply you have more useful information. Worse, those extra variables may distract you from the base rate, leading to worse predictions.

2. Machines are even more susceptible than humans

If humans are prone to such mistakes, should we use machines instead? Sadly, machines will perform worse.

Machines allow us to process even more variables at even greater efficiency. Instead of eight useless variables, you can now add 800 or even 8,000 useless variables about Dick. The machines will then inform you which subset of these variables “pop.” The more useless data you add in, the higher the chance you will encounter an accidental correlation.

Happy Lunar New Year! And greetings to Orlando people who are coming to my dataviz seminar this morning.

***

What’s going on with digit recognition, one of the signature applications of machine learning?

Before self-driving cars, before image recognition, before machine translation, there was digit recognition: computers are trained to read and recognize hand-written numbers. This problem shares several of the key components of problems tailor-made for machine learning methods:

The correct answer is unambiguous for each item (i.e. image of a digit). The author of the digit has a particular number in mind.

The range of possible answers across all items is finite. In a decimal system, each image can only be one of 0, 1, 2, ... , 9.

The end-user only cares about how accurately the digit can be predicted. Causality is not of interest here.

A massive dataset of labeled images, i.e. images that have been correctly recognized, used to train computers is easily obtained.

Live application generates more data, which feeds back into the system in a positively-reinforcing manner.

This digit-recognition technology is widely deployed. Check deposit at the ATM machine is one obvious example. In 2016, about 16 billion checks are deposited in the U.S. (source). So what’s wrong with the current state of art?

This snapshot I took at an ATM illustrates the problem:

Recently, I noticed that the ATM has refused to recognize the digits on several checks, asking me to enter the amounts manually instead.

From this evidence, I infer the following:

Still after these years, the error rate is higher than these banks could absorb. Assuming 10 billion checks read each year at ATMs, even a 0.01% error rate amounts to 1 million errors per year, or about 2,800 errors per day!

Banks would rather err on the side of caution – when in doubt, ask users to enter the amount. This behavior implies humans make fewer errors than machines, even after including mischief as a source of human error.

What would a teller do if s/he can't make out the scripted digits? The human would look at the handwritten words "six thousand," solving the problem. Apparently, the ATM does not have handwriting recognition technology, or perhaps its accuracy rate is not high enough. It's a harder problem, though of a similar nature.

***

Why are the banks risk averse? As a victim of one of these errors, I think I understand. Last year, I spent four or six weeks chasing after $20. In this case, the machine read the 2 as a 0. I didn't catch the mistake while at the ATM, but later noticed it on the bank statement.

I learned that convenience comes at a price. The bank's process to verify the amount and correct the mistake is convoluted. It's like missing that exit on the highway, and you now have to go five miles before the next exit. It's a pain for the bank as well as for the client.

Racial and cultural harmony is hard work. The news about an episode at Duke's biostats program is a good example of the intricacies involved.

According to what is known at the moment (reported also by the Duke Chronicle), the program director of the Master of Biostatistics program, who is also an assistant professor, wrote students a note asking that they always speak English in the department's building and in professional settings. This email was aimed at Chinese graduate students, which form the bulk of the enrollment of many graduate programs at U.S. colleges (although I'm not sure about this particular one).

Further, the email dispatch was motivated by two unnamed Duke professors who went to the program director to complain about Chinese students speaking Chinese "very loudly" in a "lounge or study area".

The two professors demanded photos of the Chinese students to identify the offenders "in case the students ever applied for an internship or were interviewed by them." The program director warned students that speaking their mother tongue would have "unintended consequences" that could affect their careers and recommendations.

***

According to the articles, the school has thus far taken the following steps:

The program director has immediately been replaced.

The Master's program has been placed "under review".

The Dean of the Medical School issued an apology, affirming that there is no restriction of languages spoken outside the classroom, that speaking other languages outside the classroom would not affect careers or recommendations, and that student privacy will be protected.

***

It looks like the program director is being positioned as the scapegoat. There is no mention at all about the two professors who violated the students' privacy, and threatened their careers and recommendations.

The outgoing program director gave the students great advice. The behavior of the two professors validates the point of view that it is in the students' interest to speak English. The Duke Chronicle did not reprint the Dean's email but saying that speaking other languages would not affect careers does not make it so - in light of evidence to the contrary!

Speaking one's mother tongue to someone else with the same mother tongue is completely natural to every human being; and doing so when living in a foreign country is to make a connection to one's heritage.

Now, practice speaking English while studying in the U.S. is also good advice but that ought to be a choice.

On my (new) Youtube channel called "Fung with Data", I am using short clips to explain how data, software and algorithms are working behind the scenes to influence their daily decision making.

The third episode just launched today, and it addresses the question of whether Google Maps or GPS navigators can really find you the "fastest route" to your destination. Lots of people I know swear by the software; how does it work? Click here to see the video.

This video is the second in a series about Google Maps. Episode 2 presents the basics of how route optimization works. Click here to see the short clip.

Subscribe to the channel to get notified when the next episode shows up.