Big Data, Plainly Spoken (aka Numbers Rule Your World)tag:typepad.com,2003:weblog-812465314966136442019-09-11T08:49:00-04:00Kaiser Fung, author of "Numbers Rule Your World: the hidden influence of probability and statistics in everything you do" and "Numbersense: How to use Big Data to your advantage". Comments on how data science, algorithms, software shape current events. TypePadThe list of worst college majors, and why you should ignore ittag:typepad.com,2003:post-6a00d8341e992c53ef0240a481949d200c2019-09-11T08:49:00-04:002019-09-10T02:08:55-04:00Kaiser Fung (Numbersense, Principal Analytics Prep) discusses a report of the worst college majors in terms of employment. junkcharts

Business Insider reports on the bottom 20 college majors, with the warning "what not to study". The list looks extremely random including everything from visual and performing arts to international relations to math and computer science (!).

The author explains these majors have the highest unemployment rates.

Here are a few reasons why you can and should ignore this study.

Presumably the journalist is advising college students, who will soon become new college graduates. But the analysis references college graduates which include anyone with a college degree, ranging from a new college graduate to someone who's worked for 40 years. New college graduates have a much higher unemployment rate than all college graduates. (See this recent Huffington Post article.)

Students do not randomly select a college major. Thus, students who choose to major in international relations are not the same type of people who decide to major in engineering. If an international relations student were to switch major to engineering, it would not follow that this person's employment prospect had suddenly brightened. It might even worsen because (a) the student could be relatively less competitive against other engineering majors; and (b) the student and the major might now be mismatched.

If everyone followed this journalist's advice, then each college would need a very small number of majors. Unless the economy suddenly produced a bonanza of jobs, where would the unemployed go? You got it... the unemployment rate would accrue to the several remaining majors, instead of being spread out over hundreds of majors. The unemployment rate of these "good" majors would rise to the level of the average unemployment rate. Oops.

Not everyone needs or wants a job. Changing majors doesn't change that reality.

***

There is an even bigger howler in the article. The analyst at Bankrate equated "job stability" with low unemployment rate, using language such as "Having a high-paying job doesn't necessarily mean you'll have job stability, and vice versa." So, majors with high unemployment rates are bad because people job-hop.

But the unemployment rate is not a measure of job tenure, and thus not an indicator of job stability. In fact, if people job-hop a lot, the unemployment rate will be relatively low, because the same job can be held by multiple people in the course of a year.

Let's imagine a country of 10 people with 5 jobs. The unemployment rate is 50%. If no one changes jobs, the same five people are employed always, and the rate is stable at 50%. The government stipulates that no person can hold on to the same job for longer than 6 months. We put the 10 people in a circle, and every other person is given a chair and seated. Every 6 months, each person moves clockwise by one slot: the five previously seated no longer have a seat, the five seats now occupied by the five previously standing. Everyone is now employed 6 months out of the year. The employment rate is now 100% (counting part-timers).

Thus, unemployment rate of zero can coincide with high job instability.

In a new series of videos, I plan to cover the "missing pieces" of data science. These are the little things that instructors assume students already know. These little things block further learning if not cleared up.

The first video is about functions. Anyone learning to code must know about functions. What's covered?

You will be unambiguously judged by the accuracy of your figures... this means you must unconditionally adopt a routine inclusive of quality assurance.

The swiftest way to lose your credibility is to overlook an obviously noticeable data error... if your senior vice president is the one to spot that the number of customers in your analysis only added up to 50% of the actual, do not expect him/her to accept your conclusion so readily and resolutely. [I'd add: s/he may distrust not just this one analysis but every analysis you'll put out from that point on. There is no "forget" button you can press.]

Do not detail your journey when you are at the table... the more background you present, the more opportunity there will be for individuals to take the discussion down a rabbit hole.

Listen... write down what you hear.

[Don't] be the individual who starts solving the problem in his/her head in the midst of the meeting.

the sooner you recognize the tools are a means to an end, the sooner you will focus more on the "end" and less on the "tool".

Keep learning.

Unfortunately, some of the things you learned in college are precisely the things that will get you in trouble. Explaining things from first principles, describing the journey (the wrong turns, the obstacles, etc.), holding the conclusion as cliff-hanger, focusing on technical outcomes only, believing that arriving at the wrong number is okay so long as you used the right methodology, being obsessed with edge cases rather than the most probable case, etc. etc.

I am not saying your teachers were wrong. But to have success in an industry career, you need to adapt.

Wage inequality and its sourcestag:typepad.com,2003:post-6a00d8341e992c53ef0240a4c861bf200b2019-08-26T08:11:00-04:002019-08-25T21:50:18-04:00Kaiser Fung (Numbersense, Principal Analytics Prep) explains the data analysis behind Business Insider's take on wage inequality in the U.S.junkcharts

In the sister blog, I featured a data graphic showing the difference in pay levels between white male workers in the U.S. and women workers of various races. That post discusses purely visual matters, in effect accepting the analysis as given. In this post, I take a deeper look at the underlying data analysis.

Let us review the analysis strategy behind the chart, and then discuss why this simple strategy is not particularly insightful.

The starting point of the analyst is income data collected by the Census Bureau. Even here, income is measured in several ways. It appears that using personal income (as opposed to household or family income) is the most appropriate here. There is an additional complication of how to handle "missing values", which may arise in this context because someone is not employed and thus earn zero income. When one says "median income", does one include or exclude the zero earners? What about those who only worked part of the year?

After those questions are addressed, one can work with median incomes, computed at the right level of aggregation. The report linked to by Business Insider only contains aggregation by race, and aggregation by gender, each calculated separately. What is needed is a cross-tabulation. It is not possible to obtain the median income of Asian women from the median income of Asians and the median income of women - unless the analyst makes unwarranted assumptions.

So the analyst learns that white men made $46,000 in 2017 while Asian women made $27,000 and Hispanic women made $21,000. This data can be plotted directly, or after computing the race-gender discount (off the white male wage level).

***

The analyst wants to make this data come alive by using a different unit, the number of days worked.

One way to achieve this is by converting the annual salary to daily salary, then computing how many days the median Asian woman must work in order to earn $46,000, the median for the white man. This is roughly 611 days, which suggests that the Asian woman must work 246 extra days.

In the above description, I have seamlessly papered over several annoying details. I have assumed - without checking with reality - that everyone works 365 days a year. In fact, no one worked 365 days a year. And even if I obtained the correct value of the average number of days worked, call it X, it will also be the case that X will vary between race and gender. Thus, I made a further implicit assumption that such variance is not large enough to bother about.

I justify this lack of care because I rounded the median salaries to the nearest thousand. The questions I raised above ought to concern those analysts who insist on printing estimates in decimals. Is it possible to attain that level of precision while making simplifying assumptions?

***

Also notice that the analysis strategy is counterfactual in nature - it requires conjuring up a hypothetical scenario. It's a comparison of a white man who works exactly one year, and a woman (of any race) who works till she earns $46,000 or whatever is the current median wage for white men.

The notion of "extra days" is an invention since there are only 365 days in a year, and the women can never catch up unless the white guys stop working.

***

When comparing white men and Asian women, both gender and race are shifted. The Business Insider story leads readers to attribute the wage gap to primarily gender but unless we see the median salaries for Asian men, Hispanic men and black men, one can't be sure.

Most likely, there is a gender "effect" as well as a race "effect". The gender effect may even be differently sized for each race. This is known as an "interaction" effect.

***

Finally, there are even more factors to be considered. It is well known that at least some of the wage gap is explained by the difference in the mix of jobs and industries that men and women tend to be employed in. So one can't conclude discrimination without further investigation.

Unequal pay for equal work is discrimination but unequal pay for unequal work need not be.

***

Now, check out my comments about the calendar visualization of the wage gaps by Business Insider.