Posts categorized "Statisticians"

Facebook data scientists are being blasted for a social psychology experiment they ran in 2012 in which they varied the amount of positive/negative content exposed to users in newsfeeds and measured whether this affected the positive/negative content posted by those users. (link to WSJ report; link to paper)

I'm perplexed by the reaction. Boing Boing's Cory Doctorow calls it "likely illegal", who links to James Grimmelmann, a law professor. Slate slams it as "unethical". NPR proclaims that "we" are "lab rats, one and all".

***

About Consent

The biggest gripe is that users who were randomly selected to be part of the test were not informed. Facebook argues that consent is given in the overarching terms and conditions which all users must agree to.

I am against all forms of forced consent that is practiced by Internet companies like Facebook and Google but that's a much larger issue of which scientific experiments like this is but a small part. The same critics don't seem to mind if the experiment are conducted for financial gain (like generating advertising revenues) but they grumble about an academic exercise designed to verify a theory of social psychology.

Critics are emotionally charging the conversation by claiming that "Facebook is manipulating users' emotions". The truth is every form of marketing and advertising is a form of manipulating users' emotions. If you are against this particular experiment, you should be against all the other experiments (i.e. so-called A/B tests) that are conducted every day by the same companies on the same users who are not informed and have not agreed to be "lab rats".

In fact, the same critics should have been incensed more by those advertising experiments, for which the businesses are looking for "actionable" insights, meaning that they are looking for ways to manipulate not just your emotions but your actions and behavior, such as what you click on, what you view, and what you spend money on.

Scientific Progress

Playing the devil's advocate for the moment, I'd suggest that this type of large-scale randomized experiments has the potential of bringing a revolution to psychology experiments.

I have never been a fan of the typical psychological experiment that we are forced to accept as legitimate "science": you know, those experiments in which a professor recruits 10 or 50 students via campus posters offering $20 for participation. The severe limitation of the sample, both size and composition, does not stop researchers from generalizing the results to all humans. The standard claim is that the observed effect is so large as to obviate the need for having a representative sample. Sorry - the bad news is that a huge effect for a tiny non-random segment of a large population can coexist with no effect for the entire population.

I believe scientists should be actively addressing the privacy and ethical concerns of such experiments but not dismissing them categorically.

The Opt Out Solution

There are at least two components of the possible solution. One is having an organized review board like institutional review boards. This should be created across industry, not within a company. Perhaps a list of ongoing experiments should be made available for those who care enough to review it.

More importantly, users of Facebook or other websites should be allowed to opt out of any experiments (again, not just science experiments but also advertising-motivated experiments).

I think many critics are being incredibly facetious by asking for prior consent. For many psychology experiments, we need to use double blinding. The experiment at the center of this controversy would have been neutered if the participants know what is being changed.

The Issue of Harm

One of the weakest arguments raised against Facebook is the allegation of harm. James Grimmelmann, who has been cited by various articles, wrote definitively: "The study harmed participants." (link) How so? He explains:

The unwitting participants in the Facebook study were told (seemingly by their friends) for a week either that the world was a dark and cheerless place or that it was a saccharine paradise. That’s psychological manipulation, even when it’s carried out automatically.

That's it? And he condones advertisers and politicians who manipulate our emotions because somehow we should accept lower standards in those arenas.

The Fallacy of Paying for Free Service

The WSJ commits the same fallacy as a lot of other journalists when it comes to explaining why users should submit to experimentation. It says:

Companies like Facebook, Google Inc. and Twitter Inc. rely almost solely on data-driven advertising dollars. As a result, the companies collect and store massive amounts of personal information.

The same argument has been used to support massive invasion of privacy and indiscriminate data collection.

There are several problems with this argument:

Firstly, most companies that are collecting massive amounts of data on their users today are not solely advertisers. Amazon is not an advertiser. Netflix is not an advertiser. Cable and phone companies are not advertisers. Banks are not advertisers. Your doctor is not an advertiser. (I mean, not primarily advertisers; some of these actually do earn advertising dollars.)

Secondly, advertising dollars are there whether or not there is data. I have yet to see a proper study that shows that digital marketing dollars are incremental spending, not just repurposed spend shifting from offline channels. Indeed, the frequent claim that digital advertising solves the half-the-traditional-advertising-spend-is-wasted problem is evidence that the digital marketing play is mostly spend shifting, not spend creation. The alternative world in which these digital advertisers and data collectors do not exist is not one in which the advertising market is half the size of what it is today.

Thirdly, advertisers do not need personal information. If advertising is the purpose of the data collection, anonymous data is just as good. This is because brands have an identity. Brands do not want to be different things to different people. Brand messaging, just like other forms of messaging, benefits from simplicity. Nike wants everyone to use "just do it"; Nike is never going to want a million slogans for a million people. Thus, the idea that collecting massive amounts of personal data is a "result" of data-driven advertising is bogus.

Fourthly, the business model of Facebook, Google, etc. is a choice. No one but themselves is forcing them to rely on advertising dollars. Google for instance makes Android, Chrome, etc., all of which are products that they can charge money for.

The journalists who keep printing the argument that we have to accept mass surveillance in order to support the business model of "free" are merely repeating a marketing message without thinking about it. They have fallen victim to emotional manipulation.

As others binge watch Netflix TV, I binge read Gelman posts, while riding a train with no wifi and a dying laptop battery. (This entry was written two weeks ago.)

Andrew Gelman is statistics’ most prolific blogger. Gelman-binging has become a necessity since I have not managed to keep up with his accelerated posting schedule. Earlier this year, he began publishing previews of future posts, one week in advance, and one month in advance.

Also, I have been stubbornly waiting for the developers of my former favorite RSS reader to work out an endless parade of the most elementary bugs, after they launched a new site in response to Google Reader shutting down. Not having settled on a new RSS tool has definitely shrank the volume of my reading.

I only managed to go through about a week’s worth of posts because the recent pieces interest me a lot.

Gelman links to Lior Pachter's review of what he calls "quite possibly the worst paper I've read all year".

This bit deserves further mocking: when the researchers fail to achieve conventional 5% significance, they draw conclusions based on "trend towards significance". This sleight of hand happens frequently in practice as well, where the phrase directional result is utilized.

When an observed effect, as in this case, is not statistically significant, the implication is that the signal is not large enough to distinguish from background noise. When the researcher then says “but I still see a signal”, said researcher is now ignoring the uncertainty around the point estimate, pretending that the noise doesn’t exist. The researcher is in effect making a decision using the point estimate. Anyone who has taken Stats 101 should know not to use a point estimate.

One great tenet of statistical thinking is the recognition that the observed data sample is merely one of many possible things that could have happened. The confidence interval is an attempt to capture the range of possibilities, and the much-maligned tests of significance represent an attempt to reduce such analysis to one statistic. It achieves simplicity at the expense of nuance.

This cannabis study is also a great example of what I’ve been calling “causation creep”. The authors are well-aware that they have merely found an instance of correlation (not even but just for the sake of argument), but when they start narrating their finding, they cannot help but use causal language.

The title of the paper is "Cannabis use is quantitatively associated with...", and yet the lead author told USA Today: "Just casual use appears to create changes in the brain in areas you don't want to change."

Causal creep is actually endemic in academic publishing of observational studies, and I don't want to single these authors out.

Gelman has been on this one for a while. The offensive paper looked at the correlation between hurricane damage and the gender of the names we give these hurricanes. I didn’t find it worth spending my time studying this line of research but I’m assuming that the problem is considered interesting because they claim to have found a “natural experiment” in that the gender is effectively “randomly assigned” to the hurricanes as they appear.

I have been quite irritated over the years by this type of research, encouraged by the fad of Freakonomics. Even if they did find a natural experiment, what is that experiment about? Instead of spending research hours on correlating damage with naming conventions, why not spend the precious time looking for real causes of hurricane damage? You know, like weather patterns, currents, physical phenomena, human-induced climate changes, human decisions to live in high-risk areas, etc.?

I should note that much of Steven Levitt’s original work that launched this field deal with real problems, like crime rates and . It’s just that many of his followers have gone astray.

Matt Novak debunks an article in Vox which repeats the assertion by the tech industry that new technologies have been adopted much more quickly in recent years than in the past. Vox is not the only place where you see this assertion. We have all seen variations of the chart shown on the right.

Novak puts on a statistician's hat and asks how the data came about. This type of chart is particularly prone to errors since many different studies across different eras are needed.

What Novak found: the invention date of older technologies (like TV and radio) were defined by their invention in the laboratory while recent technologies (such as Internet, mobile) were defined by their date of commercialization. Needless to say, adoption is expected to be slow when the technologies were not yet available to consumers!

Needless to say, anyone who cites this chart or its conclusion from here on out should be publicly shamed.

Gelman nicely distills one of the central messages in my Numbersense book (Get it here). All data analyses require assumptions; assumptions are subjective; making assumptions is not a sin; clarifying one’s assumptions and vigorously testing them is what make good analyses. Go read this post.

Gelman was surprised by a recent paper in which the researchers found that 42% of their sample purchased detergent on their most recent trip to the store. This reminds me of the section of Numbersense (Get it here) in which I described a study in which some marketing professors had mystery shoppers track people in a supermarket and within seconds of them placing groceries in their trolley, asked them how much the items cost. The error rate was quite shocking.

There is another big problem with this research design. People's memory of what they purchased depends on how long ago that "most recent" trip was. I also wonder how online purchasing affects this sort of study as I typically don't count going to a website as "a trip to the supermarket". It seems like some sort of prequalification is needed but prequalification always restricts the generalizability of any finding.

Andrew gently mocks both of these commonly used procedures. The discussion of outlier detection is buried in the comments section so if you are interested, you should scroll below the fold. Gelman’s annoyance with outlier detection is semantic: but important semantics, which align with my own practice. Like Gelman, I don't consider any extreme value an outlier.

Stepwise is a suboptimal procedure and Gelman prefers modern techniques like lasso. But lots of practitioners use stepwise because the procedure is “intuitive”, that is to say, one can explain it to a non-technical person without rolling their eyes. The discussion below the post is worth reading.

Andrew Gelman discusses a paper and blog post by Ian Ayres on the Freakonomics blog. Their main result is summarized as:

We find that a ten percentage-point increase in state-level female sports participation generates a five to six percentage-point rise in the rate of female secularism, a five percentage-point increase in the proportion of women who are mothers, and a six percentage-point rise in the proportion of mothers who, at the time that they are interviewed, are single mothers.

Andrew finds these claims implausible, so do I.

Ayres uses the econometrics methodology called instrumental variables regression to support these claims. Since the data is observational, and as Andrew pointed out, there wasn't even a period of time in which one could find exposed and unexposed populations (since the TItle IX regulation was federal), one must treat such regression results with a heavy dose of skepticism.

It is useful to understand that causal claims are possible here only if we accept all the assumptions of the instrumental variables method.

Besides, plausibility is assisted by the ability to outline the causal pathways. It should be obvious that more females competing in college sports does not directly cause more females to become secular. The data on sports competition and on secularism come from different sources and this presents a hairy problem. The analysis would have been more convincing if it found that among the women who participated in college sports, more became secular; what the analysis linked was higher participation rate and higher secularism among all women in the state.

What is it about sports participation that would cause people to become secular? (The visual evidence from professional American sports would lead me to hypothesize the opposite--that sports participation may be associated with higher religosity!) Is this specific to the female gender? Do we find male secularism increase as sports participation by men went up?

As Andrew pointed out, the magnitude of the estimated effect seems too large to believe. I'd prefer to see these effects reported at more realistic increments. A jump of 10% participation is very drastic. For example, according to the chart here (the one titled "a dramatic, 40-year rise"), the percent of women participating in high school sports has moved just 2 percent from 1995 to 2011.

***

Andrew is right that this is an instance of "story time". And we are not saying that statisticians should not tell stories. Story-telling is one of our responsibilities. What we want to see is a clear delineation of what is data-driven and what is theory (i.e., assumptions). The plausibility of a claim depends on the strength of the data, plus whether we believe the parts of the theory that are assumed.

Posting will be light this week, as I prepare for a number of meetings. Please come find me if you are in the neighborhood.

On May 20 (this Tuesday), I am the Banquet Speaker at the Midwest Biopharmaceutical Statistics Workshop (MBSW), to be held in Munice, Indiana on the Ball State campus. More information on this event here. I will be talking about how Big Data is affecting us, and the need for Numbersense. The key lessons will come from Google Flu Trends and digital marketing attribution models.

Next week, on May 28, I will be in Toronto at the Statistical Society of Canada's Annual Meeting. I am a participant in an invited session on "Business Analytics: Where does the statistician fit?" My article in Significance is a good starting point. More information about SSC here.

On June 18, I am giving a Keynote Speech at Predictive Analytics World, in Chicago. You can register here. The agenda looks great, and you will learn a lot about all kinds of applications across many industries at this conference.

In the popular science genre, one often comes across "published in a peer-reviewed journal" as a certificate of authenticity. Given that the authors of such reports or books typically do not have the technical chops to understand the materials deeply, it's not a surprise that they require third-party validation. However, "published in a peer-reviewed journal" is pretty weak.

I just read a paper published in a peer reviewed journal that made me cringe. This is not any journal but the California Management Review, which according to Wikipedia, is "along with other publications such as Harvard Business Review and MIT Sloan Management Review, among the most influential and viable sources of contemporary business research". It has an impact factor of 1.667.

The paper is called "Organizational Blueprints for Success in High-Tech Startups: Lessons from the Stanford Project on Emerging Companies" by James Baron and Michael Hannan (pdf here). This paper was referenced in the New York Times Upshot article titled "Yes, Silicon Valley, Sometimes You need More Bureauracy", and my friend Xan G. was unhappy with the way the reporter presented the findings of the paper. I am not convinced by the original paper either.

***

First, let's talk about Xan's complaint. This sentence neatly summarizes the reporter's point of view:

Yet a human resource department is essential. The two [researchers] found that companies with bureaucratic personnel departments were nearly 40 percent less likely to fail than the norm, and nearly 40 percent more likely to go public — data that would strike many Silicon Valley entrepreneurs as heresy.

Going back to the source, we find the following (type of) chart from which the reporter extracted the evidence:

This chart shows the six "organizational blueprints" of which the Engineering blueprint is treated as the reference level. Only the Autocratic blueprint has higher likelihood of failure than Engineering. Four out of five non-Engineering blueprints did better than Engineering, and among those four, Bureaucratic was the worst! So how can the reporter conclude that this chart supports more bureaucracy (which is defined as having a human resource department) for Silicon Valley? Why not go for Commitment blueprint and have a 100% lower rate of failure instead of 40%?

Despite the name "Bureaucratic", the researchers never said that having an HR department is a defining characteristic of this blueprint. Having HR departments is compatible with several of the other blueprints. In fact, on page 14, the researchers stated: "Commitment and Star firms tended to be the fastest to bring in HR expertise."

What's more, the reporter described the Engineering blueprint as "the norm", which is not a term that the researchers used. The researchers used the word "modal" (page 11) which means the most frequently used but this usage is contradicted by Figure 3, in which the largest piece of the pie chart is the "Aberrant" type (which probably maps to "Non-type") 33% against 31% Engineering.

***

Next, the original paper in CRM is an instance of "story time". The data they have collected is nice but very limited; they really stretched in making their stories way beyond what their analysis could support. Most of the conclusions I'd consider based on theory rather than evidence. Besides, in simplifying the technical content to suit CRM's target audience, so much is lost that what remains is impossible to interpret.

Just look at Figure 6 above. The reader might guess that there is some kind of regression model being run with likelihood of failure as the response variable and the organizational blueprint as a predictor. The Engineering blueprint is selected as the reference level and given a value of 0%. What does it mean by the Commitment blueprint being 100% less likely to fail? Does this mean no companies using Commitment have ever failed?

If you'd like to know what the actual failure rates are, you will not find them within the 26-page article. If Engineering failed at a one-percent rate, then Bureaucratic failed at 0.6 percent, hardly a concern. You'd note that there is no indication of standard errors, error bars, or sample size. (They referenced a more technical paper in the footnote but I didn't see actual failure rates there either.)

***

There are other problems with the CRM article. The authors said they have a sample of "nearly 200 technology start-ups" but made no mention of how these start-ups were selected. In the entire paper, not a single company's name was mentioned.

Knowing how the sample is selected matters a lot here. There may be survivorship bias, for example, in that companies which failed fast are not in the sample.

Much is made about firms that shifted their organizational blueprint as it aged. A Founder's blueprint is contrasted with a CEO's blueprint. On page 15, we are told that "only 18 of the 165 firms in Figure 4 changed from one pure model type to another; of these, 14 moved between Engineering and Bureaucracy, the two closest pure type models". So somewhere along the way, we lost 35 firms in the sample without explanation.

Then on page 16, the researchers said: "One obvious question to ask is: Do changes in HR blueprints accompany changes in senior management within startups? The answer is yes." Later, on page 21, they asserted "we found compelling evidence that changing the HR model is destabilizing to high-tech start-ups".

The level of confidence in those statements is at odds with the sample of 18 firms that changed the blueprints, of which 14 moved between the two closest types.

Let's do a quick calculation. With 6 basic types, there are 30 possible A->B shifts, in which B->A is regarded as different from A->B. There were only 18 observed shifts, 14 of which ended up in the same A->B pair so there can at most be five unique shifts in the data. And yet they are able to draw conclusions about changes in HR blueprints in general?

***

The subject of the research is worth investigating. I can't help but think that the research would have been better without the pretense of being data-driven. Most if not all the conclusions are mostly supported by interviews, and not so much by the data anyway.

I am annoyed with the presentation of a series of charts that have no meaning--all relative values without disclosing the reference level; the presentation of statistical results without mentioning sample sizes or error bars; and the presentation of results from manually gathered data without even naming one company and describing the sampling methodology.

In case you are not subscribed to my dataviz feed, I put up a post yesterday that is highly relevant to readers here interested in statistical topics. The post discusses a graphic of a New York Times article that interprets the official inflation rate (known as the CPI). I devoted an entire chapter of Numbersense (link)to the question of why the official inflation rate diverges from our everyday experience.

In a larger context, inflation rate is an invented metric, invented to measure some quantity that has no objective reality. This is true of a lot of statistics. Revenues and profits are also invented concepts, for example, and only attain meaning through generally accepted accounting rules. Obesity, which is discussed in Chapter 2 of Numbersense (link) is another example of a quantity that has meaning only because of a convention of measuring.

The article in NYT brings up one of the points I raised in the book, which is that price increases are magnified in our imagination while price decreases are taken for granted.

The other larger point of the chapter on inflation is that anyone wishing to comment on whether CPI reflects real experience ought to understand how CPI is constructed. A superficial understanding such as that it is the average price of a basket of goods is useless because there are so many little details that affect the statistic. Because inflation has no objective basis, it is pointless to argue if it reflects reality: all we are left with is discussing the rules and you can't discuss the rules without knowing them well.

Details matter a lot in statistics. This is one of the reasons why I keep asking my Big Data colleagues to talk specifics. A statistician who only talks in generality is like the Manhattan realtor who can't tell you the size of the listed apartment.

I saw Joe N.'s tweet asking me about a study of how professors spend their time, reported by Lisa Wade at Sociological Images. This is an anthropological study, something that I am not at all familiar with although the people in the field seem to believe that they can make statistically valid observations.

I'm glad the author of the study, John Ziker, wrote a (really) long article describing what he was trying to accomplish. The key point is that the study is a preliminary exploration, with important limitations; a follow-up study is planned which may give generalizable conclusions.

Here are some issues with the first study that makes a statistician nervous:

- the sample was between 14 and 30 professors (tiny): Wade reported it to be 16. Ziker definitely started with 30.

- the selection was non-random, based on the first 30 people who responded to a school-wide announcement

- about half the initial respondents did not complete the study, and provided only partial data (one to six days)

- despite the tiny sample, some analysis required slicing the data further into four segments by grade level! I wonder how many department chairs were in that sample. (See chart on right)

- each professor is followed for a two-week period but only every other day, thus each professor at most contributed one observation per day of week

- the interviews were every other day "so the time taken for the interview did not appear on the previous day’s report." This is a horrible problem to deal with! Because time allocation is the subject of the study, the measurement method (in-depth interviewing) interferes with the measured outcome. It seems to me impossible to believe that the time spent answering questions every other day did not affect time allocation on the non-interview days.

- Ziker reasoned: "While we cannot make a claim that all faculty have the same work patterns as our initial subject pool — they do not comprise a random sample — the results are highly suggestive because of the consistency across our subjects who did represent.". In order not to fall prey to the law of small numbers, a better way to say this is: we make the assumption that the small sample is representative on both mean value and dispersion, which then leads to the assumption that all faculty have consistent work patterns similar to the observed.

- "With our initial 30 Homo academicus subjects, we ended up with a 166-day sample with each day of the week well represented." I am assuming that Ziker did not drop the 16 professors with partial data and made charts like the one on the right by ignoring the identity of the professor and aggregating over days of the week. Let's review what lies behind this chart. Each respondent contributed at most one observation per day of week; about half of the respondents did not even contribute data for all seven days. So the time allocation on any particular day is averaged over anywhere from 14 to 30 professors. These professors span a variety of ranks, departments, tenure, backgrounds, etc. and were not randomly selected. It's hard for me to trust this chart at all.

***

In general, I am a big fan of shoe leather research in which the researcher goes out there and gather the relevant data they need to address their specific research question, rather than picking up what data they could find, and then tailoring the research question to avoid the imperfection in the data. So I don't want to sound too negative. It's a difficult research problem they are dealing with. What they learned from this first study is useful to inform future explorations but drawing conclusions at this stage is premature.

At the end of his article, Ziker described the "experience sampling" method that will form the next phase of this study. I am very excited about this methodology.

Roughly speaking, they will ask participants to install a mobile app, which pops questions from time to time asking them what they are doing at that moment. Instead of exhaustively tracking a small number of participants over the course of time, they will get little bits of data, incomplete schedules, for a large number of professors. If the sample is big enough and randomized appropriately, they can analyze the data ignoring the professor identity, and report results for the "average professor". This method also retains the other benefit of the original design, which is that the respondents report their activities close to the time in which they occurred.

Data scientists pay attention! You don't have to collect complete data at the user level to do proper research. Designs like this "experience sampling" approach produce statistically valid findings without the need for complete data. In fact, trying to collect complete data is counterproductive, leading to shaky conclusions as shown above.

My article on whether we can trust airfare prediction models is published today at FiveThirtyEight, the new data journalism venture launched by Nate Silver after he moved to ESPN.

This topic was originally conceived as a chapter of Numbersense (link) but I dropped it. As I have noted in my review of Nate Silver's book, he has a keen interest in evaluating predictions, and not surprisingly, he encouraged me to get this piece done.

Putting Big Data to the Test

Just like Google Flu Trends (link), Oren Etzioni's Farecast has been held up as a Big Data success story. I have been a Farecast user for years, and though I use the tool, I've always wondered how accurate are those predictions. If you're a user, you've probably wondered as well. I have also complained that Big Data practitioners are too lax in offering quantitative evidence for their Big Data projects--it's a bit ironic when we tell others to use data and throw away their gut feelings.

One of the reasons for this oversight is that it is hard work to evaluate predictions properly. In this post, I will cover how I designed the evaluation strategy.

Humility First

The first rule of evaluation is to check your ego at the door. The goal of evaluating a predictive model is to measure how well it performs. It is tempting for the evaluator to reinvent the wheel, devise a new way of predicting, and prove its superiority--but that is not evaluation. The evaluator is like a quality-control analyst, or a code reviewer.

Assumptions, Assumptions, Assumptions

One of the core messages of Numbersense (link) is that every analysis has assumptions, often called "theory". People who think their analyses contain no assumptions are usually ones who haven't thought carefully about their models. Making "no assumptions" is itself an assumption. In the same way, evaluating models require assumptions, and lots of them! Bear this in mind as you keep reading.

What to Compare to

In my article, I explained why the right comparable is the most realistic alternative strategy for purchasing air tickets if one were to not use Kayak/Farecast. This is one of my most important assumptions and it took me a while to figure this out.

At first glance, you might think a "natural" comparable would be the actual price trajectory for a given route for a given travel period. In other words, consider when the algorithm recommended buying and when it recommended waiting and judge it based on whether the algorithm led you to the lowest price during your search.

Using such a metric is to commit a form of hindsight bias. You have to remember that any algorithm (or human) must make the decision to buy or wait when the future prices are not known yet. In addition, we expect that there will be substantial price volatility in the future. If we were able to re-run history many times, the price paths would be different and the algorithm's performance would also vary. When we are staring at the realized price path (ex-post), it is easy to forget about the underlying volatility.

A worse problem with judging the algorithm against the lowest possible prices is that there may be no way to get to that lowest price! Remember to check your ego, you are the evaluator, not the modeler. The question is are you able to find an existing alternative solution that would lead you to those lowest prices without cheating and using future price data?

Don't Compare to Imaginary Toys

Instead of the theoretical maximum (i.e. the model that always finds the best possible price), let's consider the existence of a Best Realizable Model (BRM). Let's suppose its performance will be 70% of the theoretical maximum. Then, in theory, we can compare Kayak/Farecast to this BRM.

The catch is we don't know anything about BRM. In particular, we don't know if it performs at 70% or 30% of the theoretical maximum. If Kayak/Farecast gets to 25% of ideal and BRM 70%, then Kayak/Farecast is pretty poor. But at the same 25%, if BRM performs at 30%, then Kayak/Farecast is impressive.

Neither the theoretical maximum nor the Best Realizable Model should be used in this evaluation, simply because they are not real strategies, and just imaginary toys.

Don't Bow to Random

At the other extreme, modelers like to compare their algorithms to the "random" strategy. In the case of airfare prediction, one such strategy might be picking a random number of days before departure and taking the lowest fare on that day. The random strategy is throwing a die, and using no skill at all. This is unsatisfactory because it creates too low a bar.

Compare to Next Best Alternative

In my view, a far better approach is to figure out what you'd have done in the absence of the predictive model. In my case, I typically wait till two weeks before departure, and so that is the comparable.

Some readers have commented that they tend to buy three or four weeks before departure, and one pointed to an analysis claiming that 54 days is exactly the right moment to get the cheapest fare. This takes us back to the point raised earlier, that every evaluation strategy makes assumptions. If we do an analysis starting 30 days out, a different reader will object, saying he or she typically purchases 21 days out. (Remember, though, the earlier you start this exercise, the longer you have to track day-by-day wait or buy recommendations.)

***

Another "obvious" (until you think about it) evaluation criterion is to focus on the probability estimates themselves, rather than the outcomes they produce. The probability estimates are given in the form "there is 79% chance that the fare would go up by $20 or more in the next seven days." These forecasts are, of course, given for every route, for every departure date, and for every date of search.

It may take more than a moment's thought but such probability estimates are essentially impossible to verify. The forecast statement basically asserts that if the same forecasting situation arises many many times, they would be right about 79% of the time. But we can't in real life replicate the same forecasting situation many many times.

One way around this is to forget about real life, and check the probability estimates by simulating many future worlds. A major problem of this approach is that even if you can show that the probability estimates are good relative to those simulations, the travelers who use these forecasts still may not save money. Again, it lacks the "what would you otherwise have done" dimension.

As you dig deeper, you'll find more tricky issues. One is the continuous time (24/7) nature of online travel search. If you need to measure whether "the fare went up $20 or more in the next seven days," you'd have to be monitoring fares continuously over those seven days. To make things even more complicated, in each of those seven days, for the same itinerary, Kayak/Farecast is updating its prediction in a rolling 7-day window.

***

One other consideration I'd like to cover. Nate Silver is famous for predicting all 50 states correctly. Remember though that anyone who has the slightest knowledge of US politics can predict 40 out of 50 states--where he demonstrated his skill was in those swing states.

Now in the context of airfare prediction, if it were true that prices are much cheaper two or three months prior to departure, then people who would be purchasing in those time frames do not really need an algorithm to help them. It is important to test predictive models under situations in which they are most likely to demonstrate their skills. I would therefore recommend that you evaluate airfare predictions closer to the departure date when you'd need it most.

***

As you can see, evaluating predictive analytics is filled with challenges. But as I demonstrate here, it can be done. It should be done.

For those who weren't able to attend my recent talks, a few have surfaced online.

***

JMP put up the video of the webcast from last Friday with Alberto Cairo, a data visualization expert and author of The Functional Art. You can access it from here. This event is part of their Analytically Speaking series with recent guests such as David Hand and Michael Schrage. I also appear on this recording of the panel celebrating the International Year of Statistics.

***

Agilone, an emerging vendor of self-service marketing analytics software, hosted me at their recent user conference, as well as a webcast. Here is a clip, in which I explain the structure of analytics teams that I have assembled.

***

Last year, I gave a fun, lightning talk at the Leaders in Software & Art conference. The recording is here.

***

Joe Dager did several long interviews with me that is well worth listening to. Here's Part 1, and then there's Part 2.

There is now some serious soul-searching in the mainstream media about their (previously) breath-taking coverage of the Big Data revolution. I am collecting some useful links here for those interested in learning more.

Here's my Harvard Business Review article in which I discussed the Sciencepaper disclosing that Google Flu Trends, that key exhibit of the Big Data lobby, has systematically over-estimated flu activity for 100 out of the last 108 weeks. I also wrote about the OCCAM framework, which I find useful to think about the "Big Data" datasets we analyze today versus more traditional datasets from the past.

Slate was probably the earliest to react, and noticed a post on this blog that was the precursor to the HBR article.

Readers who are specifically interested in GFT should read the source materials themselves, which are quite accessible. Start with the Science paper. After that, you can read the original research article by the Google team, hosted at google.org (click on the PDF link in the blue box at the bottom of the page). There are some bold claims in this paper, as well as caveats. They seemed to be concerned about "false alerts" at the time, such as news events rather than illness that drive certain searches. (For those statistically inclined, the underlying model involves only 1,152 points of data--128 weekly aggregates in nine regions--but a search through 450 million simple logistic models to not only define which search terms are important but also determine how many search terms to include in the final regression.)

Then read the article by Cook, et. al., which covers an update to the model made after the 2009 season when GFT totally missed the pH1N1 swine flu epidemic. Notice that this Big Miss is the opposite error to the "false alert" problem. (See Chapter 4 of Numbers Rule Your World for a thorough discussion of different types of prediction errors, and how to think about them.) From the charts in the Cook article, you can see that in the runup to the Big Miss, GFT systematically under-estimated flu activities for as many weeks as you can count.

The overhaul was drastic. The search term topics that accounted for 70% of the original model were reduced in importance to 6% while two other topics that counted for 8% originally were inflated to 69% in the updated model. This dramatically improved the "fit" statistic (RMSE) for the "first phase" of the Big Miss from 0.008 to 0.001.

Next, there is Butler's article for Nature, (Feb 2013) which precedes the Science article, but first pointed out the over-estimation problem for the 2012 flu season. One possibility is that the model update described above over-compensated for the Big Miss, making it more susceptible to the False Alert.

Other media coverage of Google Flu Trends include Guardian (which focuses on the need to understand causality), CRN, and the Economist (which talks mostly about Twitter data which is much more problematic than Google search data).

***

Tim Harford has probably the most educational of the pieces revisiting Big Data for the Financial Times. (When I wrote this line, the FT link wasn't working. The title of the article is "Big Data: Are we making a big mistake?" if you need to find a different link.) His is the longest and covers a lot of ground, and has great examples, including one of my own. Highly recommended.

***

One of the slogans of the Big Data industry (of which I'm a part) is the push toward "evidence-based" decision-making in place of "gut feelings" or "instincts". Until now, I'm afraid there has been plenty little "evidence" presented to support the assertions of universal, revolutionary goodness of Big Data (try searching for quantitative assessment of Big Data projects). I hope we are witnessing the birth of evidence-based decision-making inside the industry of dishing out evidence-based advice.

The reason I put out the OCCAM framework is to steer our community toward a more constructive approach to tackling "Big Data" problems. It requires a fundamental shift in how we define the problem. I have a moderately more technical take on some of the statistical challenges in an article published in Significance, earlier this year. This article discusses six technical challenges where we need substantial progress.

Statisticians sometimes dismiss these as "old news," claiming that the same problems exist in smaller datasets and the problems are well known. A recent example is Jon Schwabish's tweet saying that this discussion induces a "yawn". This reaction feels a bit like Fermat writing in the margin claiming he has a proof. The rest of the world don't want to wait 358 years to figure out what the goods are which these statisticians are hiding.

In my view, there has been some interesting work but nothing to settle debates. If we have great solutions, we won't be discussing these same problems today.

***

Back to flu prediction. It's really something that is well worth pursuing!

It's a nice-defined, self-contained problem that has social benefits and whose results can be easily measured. We should be grateful that the Googlers spent time working on it. It's a problem I'd love to work on if I have time and resources on my hands.

The researchers also have pioneered this type of research using search term data. This is highly significant, and the data represent a perfect example of what I call OCCAM data: the data is purely Observational (related to what Harford calls "found data" or what Dan Eckles calls "data exhaust"); it has no Controls; it is seemingly Complete; it wasn't collected for the purpose of predicting flu trends, that is, it is Adapted from other uses; and the search data was Merged with the CDC data (the matching of states and regions, and of weeks were not exact as you can tell from the original research article).

The several published versions of the predictive models are clearly failures but anyone who is in this business knows model building is an iteractive process. One can learn from these mistakes. I happen to think they need to wipe the slate clean and use an entirely different approach. It's a small price to pay if there is reward down the road.

I sincerely hope that this coverage will lead to improved modeling and analytical techniques rather than a retrenchment.