January 2014

Last week, Andrew Gelman (link) and I were kindred spirits: we both did a "numbersensing" exercise on two different data analyses. I was reading the MailChimp study on the effect of Google siphoning off "marketing" emails into a separate tab, and a noise buzzed my head when I saw that the aggregate click-to-open ratio was reported at an inconceivable 85%. (See Part 1 of my reaction here.)

In the meantime, Andrew was investigating a tidbit that appeared in a chapter of the "Doing Data Science" book by Rachel Schutt and Cathy O'Neal, in which it was claimed that the slowness in matching certain data records led to "one or two patients per week" dying, which Gelman estimated to imply up to one-quarter of the deaths were caused by poor record-keeping (!)

What Gelman called his "Spidey-sense" is what I've been calling "numbersense". In the age of Big Data, when everyone has data to make arguments, developing numbersense is really important to disentangle ambiguous or conflicting studies. Hey, I wrote an entire book about this.

***

In the same post, Gelman made a brilliant observation. We have come across this type of data stories before. Just open a Gladwell book, or some of the Freakonomics stuff, and you'll find plenty of other examples. Gelman is calling these "parables", thus the rise of the "statistical parable".

The point is that the people who write these stories do not really care if the numbers are accurate or not. Put differently, they are vested in the direction of a relationship but not its magnitude. In the above story, the writer is interested in the fact that poor record-keeping can lead to some unnecessary deaths but how many is "some" is of no concern. The data is really a side show; the message is the main attraction!

This goes a long way, I think, in explaining the popularity of the genre as well as the repulsion of many statisticians to this type of stories.

The reason why statisticians dislike statistical parables is that these stories are false unless we can verify two conditions: one is a strong enough signal; two is not too much noise.

MailChimp, a major vendor that companies use to send marketing emails to customers, published an analysis of the effect of Gmail marketing tabs (link). How should you read such a study?

I'd begin by clarifying what problem the analyst is solving. In May, Google rolled out to all Gmail users a tabbed interface, in which the inbox is split into three parts: the regular inbox, a "promotional" email box, and a "social" email box.

from socialmediatoday.com

Immediately, everyone assumes that this change will hurt email marketers (we are talking about legitimate companies, not spammers here.) The MailChimp analyst is using data to validate this hypothesis, which is a wonderful endeavor in the spirit of this blog.

***

Next, I'd identify the analysis strategy used to arrive at the answer. This analyst is using a pre-post analysis while controlling (ex-post) for a single factor. In layman's terms, that means the analyst compares the open and click rates before the tabs rollout with those rates after the tabs rollout. But that difference can be misleading because the pre-post analysis by itself does not prove that the tabs rollout was the cause of any observed difference. For example, there may be a seasonal change in open rates regardless of the tabs rollout.

Recognizing this, the analyst used other email providers as a natural "control" for this single factor (seasonality). The idea is that if seasonality were the cause of the change in open rates, then the other email providers should exhibit the same seasonal change over the same time window. This is a reasonable supposition but you might already be questioning... why must the seasonal effect be identical across email providers?

Good question! It doesn't have to be, which tells you that the outcome of this analysis is valid only under the assumption that the seasonal effect is identical across email providers. (See post #2 for other strong assumptions needed due to controlling for only one factor.)

Once I am satisfied with the analysis strategy, I look at the quality of the data. I did notice one red flag here. Looking at the click rate chart (please imagine that this is a line chart, not a column chart with an axis not at zero!), I am shocked that the average open rate was in the 85% range. This is saying that almost all of the people who open emails click on something inside the email. Since I have seen email clickthrough data before at various companies, I am skeptical that these rates are correct.

***

I did leave a comment at the blog asking them to check their data but as of today, it looks like it got lost in cyberspace - or censored. My friend who originally shared the blog post left a comment and it went through.

The analyst seems to have little sense of what real-world clickthrough rates look like! He convinced himself that the rate must be correct since it is what the data say, and further threw in a distraction -- that there are two ways to measure click rates, one is based off the number of emails sent and the other is based off the number of emails opened. Not surprisingly, the latter is much higher than the prior number.

By his count, the number of clicks to emails sent is in the 10 to 20 percent range. That too is way too high. If you tell me there are a few email campaigns that achieve such a high rate, I'd believe it. But given that his study is "BIG DATA", with 29 billion emails, 4.9 billion opens, 4.2 billion clicks, and 43.5 million unsubscribes, presumably across a large number of clients and many different industries, it is hard to fathom what it means to say one out of every five to ten emails get clicked on.

I'm not bashing the analyst here. Every data analyst will encounter this type of situation over and over. You are convinced that your number must be correct - because you know the data, you know the steps you took, you know the care you took to compute the rates.

When someone else points out the rates don't sound right, you're scratching your head. You know it's just a simple formula, the sum of clicks divided by the sum of opens, so you think there are only a few ways it could go wrong. Further, the person raising the doubt has no data so what could he/she know?

In reality, there are many ways to skin the cat of a simple formula. Have the data been cleansed of bots, and suspicious clicks? What are the time windows for counting each item? How are multiple opens or clicks by the same entity treated? etc.

This is the test of how good an analyst someone is. This is when the analyst demonstrates numbersense. How much time does it take to figure out what is driving these numbers crazy?

***

The reason I'm not bashing the analyst is this: I'd say if you tally up each time the person with no data raises doubt about analytics data, I'd say probably 80 percent of the time, the data is fine; and possibly 5 percent of the time, the data has serious errors (defined as, the conclusion changes after the fix).

Of course, if you are a manager of a data team, you want to manage to those ratios. If your analysts are wrong much more often, some remedial action should be taken to improve the performance.

In my next post, I'll look at the MailChimp study from the perspective of Big Data.

I have been busy working on syllabuses for my Spring 2014 courses at NYU, and that's why posting has been more haphazard than usual. I don't think I have said much about my teaching here on the blog, so let me take this opportunity to introduce the classes that I teach.

This is an introductory statistics course with a business/management emphasis. Many students take this course as a bridge from undergraduate to graduate schools (MBA, policy school, social sciences, etc.). There are also working people who just need to use statistics in their work. While the curriculum stays close to the usual Stats 101, I use a very different style of teaching. Much more time is spent on conceptual understanding, and interpretation of data than usual; I use a form of "case method" made famous by Harvard Business School (see this post).

I never assign problems such as "what is the confidence interval of...?"; instead, the question will look like "tell me how the decision-maker should interpret the test data, and recommend a course of action. Explain how you arrive at the decision." I also tailor the schedule to emphasize materials that are widely used in practice like linear/logistic regression, which often get short-changed.

This is a first-ever offering and the first course of this type that I know of. The entire course is dedicated to working and reworking a data visualization project of the student's choice, and it will be run along the lines of a creative writing workshop. If you follow Junk Charts, you know what I'm talking about. It's a course on the craft of visualizing data, and visualization criticism. I have put the syllabus on Junk Charts (link) for you to look at.

This course is part of the M.S. in Integrated Marketing program. Business analytics (aka Big Data) is a rapidly evolving field that is facing a dearth of skilled talent. The students who take this course will find a world of opportunities opening up for them when they graduate. We will introduce the hot trends in business analytics and Big Data as well as teach analytics skills using SAS software (JMP and Enterprise Guide), such as cleaning up data, data mining techniques and graphing. I will emphasize also the interpretation of data analyses (i.e. numbersense) and awareness of data quality issues.

In theory, the availability of data should improve our ability to measure performance. In reality, the measurement revolution has not taken place. It turns out that measuring performance requires careful design and deliberate collection of the right types of data-- while Big Data is the processing and analysis of whatever data drops onto our laps. Ergo, we are far from fulfilling the promise.

This is such an important point that I'm repeating it at the top of this post.

***

The title of the post is taken from Sean J. Taylor's on the same topic. Highlights:

Making your own data means you are creating new facts about the world which gives you privileged access to scientific findings.

If you are the creator of your data set, then you are likely to have a great understanding the data generating process. Blindly downloading someone’s CSV file means you are much more likely to make assumptions which do not hold in the data.

The last point is a major takeaway from Numbersense(link), and in particular, read the chapters on economic indicators.

In my last post, I pointed you to Avinash's post about Reporting Squirrels versus Analysis Ninjas. My focus in that post is the underlying concern of "return on analytics" or lack thereof. This post takes up Avinash's argument directly. Think of this as part 2 of Avinash's post if he had kept on writing.

Avinash's summary of his post is as follows:

Reporting Squirrel type work has a minor incremental impact on a company's bottom-line, hence your career progression will be on a slower track. Analysis Ninja work on the other hand does (even though their output is not numbers, their output is words in English!), so you want to be on that track. Seek those opportunities.

In short, Reporting Squirrels are the people who work on data processing, loading, transformation, and reporting while Analysis Ninjas are creative problem solvers who work with the cleaned up data.

***

One approach to this issue is to understand the Analytics Process. Raw data is never the right answer (see Chapter 6 of Numbersense for more (link)) The Analytics process starts with data processing, loading, transformation, and reporting, and on top of a foundation of good data, one can do slicing, dicing, modeling, etc.

In any practical Analytics process, 70 to 80 percent of the effort is spent on processing and cleaning data! Those are "low" value-add activities when considered alone. But these are not dispensable tasks either, and somebody has to do them. Too bad textbooks and academic courses tend to spend less than 10 percent of the time on data processing and cleaning, and thus contribute to the skewed view of the analytics work.

If you are managing an analytics team, you have a decision to make. Do you create a Department of Squirrels and a Department of Ninjas? Or do you create one Department of half-Squirrel-half-Ninjas? I have usually preferred the latter which avoids the caste system but others have set up separate departments. The reason why I like to combine the two roles is that the best Ninjas must have a complete grasp of the data, including every assumption and every adjustment that was made or attempted (and failed).

***

Avinash has advice for people getting into this field - get on the Analysis Ninja track.

That is good advice, with a caveat. The two roles have different requirements, and not everyone is suited for both. Reporting Squirrels need to be disciplined, care about order, structure, proper process and documentation, and attention to the smallest details. and not easily irritated by anomalies. Analysis Ninjas have to be creative, adaptable, able to think outside the box, and have great communications skills and business sense. The overlap between the two sets of requirements is surprisingly small.

For job-seekers, you should assess your own temperament and strengths. When you apply for a job, figure out from the hiring manager what types of people he's looking for. Understand how the manager organizes the Ninjas and Squirrels, and figure out where you fit in.

If you think along these lines, you might notice that my previous advice to create unified roles also comes with a problem: it is hard to find in one person who is an A in both roles; most likely, you'll find someone who's A in one role and B in the other, and you'd have to work around that.

Avinash (Web Analytics 2.0) is fond of animal metaphors. I think he's the one who coined HIPPOs (Highest Paid Person's Opinion). Now he's come up with Reporting Squirrels and Analysis Ninjas. See his recent post here.

In short, he is shouting about "return on analytics", a really, really important thing. What is being lost in the hype of Big Data is that all the investment in analytics has to generate value for businesses. All too often, the "data scientists" have no idea what tangible value they are creating.

For example, I saw presentations from two different tech firms that have data scientists working on the following problem: when you "check in" to a location on your phone, and start typing the name of the location, the app will use GPS and other data to guess where you are, and pop up a list of guesses so that you can save a few keystrokes.

This is an interesting research question that can showcase how to use GPS and other data. It is an impressive feat to process such data in near real time. How does it create value for the respective businesses? When pressed, I expect the data scientist to make the following claim: the check-in location predictor will reduce the amount of time it takes for users to check in, which means they will check in more frequently, which means they will present more opportunities for our advertisers to present them with offers.

If you come to me with this answer, I will press you some more. How will these offers benefit the business? So you continue the argument, and now tells me when there are more offers, there will be more sales, and when the advertisers sell more, we will earn more (that is, if you are paid per action; if you are paid per impression, then showing more offers directly generate more revenue share).

You see the problem here? The path from the check-in prediction to incremental revenues has many steps. Each of these steps is being argued logically - there is not a shred of data to support any of those steps. Many factors affect revenues other than the ones mentioned here; so what is crucial to understand is the magnitude of the impact that the predictive technology has generated. You also want to undertand incremental revenues - counting revenues that would have been earned whether or not the check-in predictor exists is a sleight of hand that produces no value.

***

In theory, the availability of data should improve our ability to measure performance. In reality, the measurement revolution has not taken place. It turns out that measuring performance requires careful design and deliberate collection of the right types of data -- while Big Data is the processing and analysis of whatever data drops onto our laps. Ergo, we are far from fulfilling the promise.

Just a little while ago, I showed an example of imprecise algorithms and how it causes incorrect historical facts to be promulgated. The point is not that algorithms are scary things but that we should not confuse efficiency with accuracy (or truth).

So this past week, I have another encounter with imprecise machines, and this time, it's personal.

***

If you go to Amazon right now, and search for my name in quotes "Kaiser Fung", you will get several versions of my 2010 book Numbers Rule Your World, including the recently published Chinese translation but you will not see my 2013 book Numbersense at all.

If you instead search for Kaiser Fung without quotes, the first match is Numbersense, followed by the older book.

To me, this is a clear mistake.

However, Amazon doesn't think so. This is what the customer service rep wrote me:

I understand your recent book "Numbersense" is not appearing in the search results when you search with your name "kaiser fung", including the quotes.

When you use our search engine to look for books, our system attempts to find the products you're most likely to be looking for based on the words you entered. Our search methods go beyond simple keyword matching and may also be using information not visible on the search results page, including attributes provided by the publisher.

So apparently, people who search for my name in 2013-4 are looking for my 2010 book instead of my 2013 book. In addition, my publisher has given them attributes to suppress my recent book from the search results.

Search results for books may be based on the text of each book, not just its title. That's why you may sometimes see results you weren't expecting.

I don't even understand what this sentence could mean, in the context of the name of the author.

I regret that we haven't been able to address your concerns to your satisfaction.

We won't be able to provide further insight or assistance for your request.

Thank you for contacting us.

This is rather shocking coming from the gatekeeper of most book sales.

***

Needless to say, this type of error costs authors as some people won't find the book. Yet, Amazon is unwilling/unable to fix the issue. Are there any readers here who have insight into why this is happening and how I might be able to correct this error?

My friend Kate alerted me to the notable New York Times story on academic fraud at the University of North Carolina (Chapel Hill). Phantom courses have been created to provide students with A grades, in some cases, for the benefit of athletes. This story fits the larger pattern of fraudulent practices across the education sector, which is the subject of Chapter 1 of Numbersense (link).

The story about law school admissions in Chapter 1 has many parallels with this story. Falsifying grades to make them look better is a common theme. Law schools do it for their US News ranking; universities with big-time sports programs do it for their NCAA eligibility and on-the-field competitiveness. For those who don't follow US college sports, it is helpful to know that the NCAA (governing body) sets minimum average GPA for "scholar"-athletes (see this story for example). In addition, the NCAA created the "Academic Progress Rate" (Wikipedia) which tracks graduation rates and uses this metric to control the number of scholarships that can be given to athletes. Scholarships is a crucial tool for recruiting the best athletes from high schools. In other words, the NCAA practices a form of performance-based management.

Another parallel with the law schools story is that the college administrators do not see the problem of fraud as cultural or ethical, instead turning to their PR staff for damage control. It is laughable to think that the entire fraud is known only to two already-retired employees, as the reporter writes with amusement, just as it was laughable to think that the head of admissions single-handedly faked all kinds of admissions statistics at those law schools.

Those who read Chapter 1 (and 2) of Numbersense (link) know that all metrics have subjective components and that one should never manage by metrics alone. Once rules are set to compute a metric, the metric will be immediately gamed, and the gap between the measured and real outcomes will continue to grow. The mistake isn't in using metrics but in treating them as truth, and not being vigilant about gaming. If severe punishment were to meet the cheaters, one would suspect the amount of cheating would be reduced.

One of the great laments in the education sector in the last several decades is the adoption of the "corporation" model by many great universities in the U.S., and at the schools level, through the ill-conceived No Child Left Behind legislation. This means rule by quantitative metrics, "performance"-based management, the acceptance of systemic inequity, plausible deniability by the executive class, treating education as an "investment" and students as "customers", etc.

A fundamental problem with this policy is the difficulty of measuring "performance" in education. When you have a half-accurate metric, and you treat it as truth, you are in a world of trouble.

***

Let's talk about accountability and punishment. In each of these cheating scandals, administrators and/or professors were at fault for manipulating grades and other statistics. Eventually, one or two "bad apples" were identified and dismissed while the "executives" expressed shock at being duped by previously loyal staff members. According to the NYT reporter, for example, the school believes that the administrators and other professors in the same department were not aware that their department offers courses that do not meet even once.

Even if I grant them their naivete, there is a deafening silence in the official story. What about the students? What about the beneficiaries of this fraud? Here, the plausible deniability vanishes. It is simply impossible that the students were duped. If you signed up for a course, and you couldn't figure out where or when the class meets, do you just go through a whole semester and not ask a question? When at the end of the term, you are given an A in the course for which you did not do any work, does no student ever ask a question?

What I really want to know is what UNC will do about those fraudulent grades. In my opinion, all these grades should be invalidated, and graduates who benefited from such courses should be given a choice: forfeit their degrees or earn those credits properly. As my friend pointed out to me, this course of action is lenient--because there is nothing punitive about it.

This is a moment in which we can tell if UNC is a business or an educational institution. If it is a business first, it will crater to its customers (and investors). If it is an institution that cares about academic integrity, it must act bravely.

***

The record on such matters unfortunately is dismal, as Kate reminded me. The students who copied final exams at Harvard last year were "forced to withdraw" for a year (not sure whether this will be noted on the transcripts). In spite of the investigation, many students and alumni remain convinced that the students were not wrong to copy their exams.

I also found this interesting article on how the withdrawal of athletes associated with the cheating scandal affected Harvard's APR (link). The administrators were talking openly about how this student leaving and so on affects the APR. The APR is probably the most gamed statistic yet to make the news.