Posts categorized "Aggregation"

Stephen Stigler, the preeminent historian of statistics, gave a great talk at JSM, the annual gathering of statisticians on Monday afternoon in Boston. He outlined seven core ideas ("pillars of wisdom") in statistical research that sets the field apart; these are ideas developed by statisticians that represent significant advances to science and to human knowledge.

As he remarked, each of these advances overturned then-established science, but even today, many people outside statistics are not aware of these learning.

I will briefly recount three of these discredited beliefs about data and uncertainty:

I have written a few posts about this topic, especially on the Junk Charts blog. Take a look at this post from Monday (link) or this post about "loss aversion".

2. Fallacy: Information increases linearly and proportionately with the number of samples. i.e. small data, a few insights; big data, a lot of insights. What Statisticians Learned: Information increases only at a rate of square-root of the sample size, meaning there is diminishing returns to increasing sample size.

A corollary of this is that it will take a lot more effort to squeeze out ever decreasing amounts of marginal information in big data.

3. Fallacy: The only correct way to run an experiment is to alter one factor at a time while keeping everything else unchanged. What Statisticians Learned: The one-factor-at-a-time dogma is wrong; we should ask Nature many questions at the same time.

***

I loved Stigler's talk. I do wonder if we also need to look ourselves in the mirror. If we were to test our students after they took Intro Stats about the seven pillars of wisdom, I suspect we will learn a very unfortunate result, that they will not have appreciated any or most of these points. The intro curriculum is much too focused on mechanics.

***

Here are some of the talks I attended so far:

David Banks showed an application of LDA (topic models) to a corpus of posts from political blogs. The topic distribution is allowed to vary with time. They attempted a semi-successful automated naming of topics by matching words to Wikipedia articles.

Madeleine Cule described how Google uses a mixture of experiments and statistical adjustments to compute the effect of display advertising on brand interest. Brand interest, since this is Google, is defined as searching for branded terms on Google. How times have changed... digital advertisers have returned to the old world of measuring indirect brand metrics, abandoning clicks and direct responses.

Phillip Yelland showed how Google created a machine that generates sales pitches for its sales team whose objective is to increase advertiser's spending with Google. Like Cule's talk, this methodology is heavily influenced by end-user input: in this case, those are explicitly represented in a Bayes Belief Network. For those paying attention, the Google researchers discussed how they do not have all of the data, how true "controls" are almost impossible on the Web, how they are restricted by data collection practices of third parties (i.e. adapted data), how they write "lots of SQL".

Phillip Yu gave an overview of statistical models of ranking data.

Cynthia Rudin described a novel predictive model to find clusters (in space and time) of crimes in Cambridge, MA. Good questions from the floor.

Someone from Nielsen (I only heard the second half of this talk) mentioned a lot of practical problems with set-top box data. In this setting, you are supposed to have "all the data". In reality, you don't and what you have are problematic. Just one example: lots of people turn off their TV but not their set top boxes, and it's hard to know if you are still watching the same channel or have gone to bed. Also, the really tricky business of adjusting such data: you need different models at the user level from the aggregate level but those models then are inconsistent with each other.

R. Mazumder offered a reformulation of Factor Analysis as an optimization problem.

Pouliot tries to predict which restaurants in San Francisco may have health problems using text analysis from Yelp reviews. Another application of LDA although in my mind, not successful. First attempted as an unsupervised problem, then as a supervised problem using past inspections as the training data. The problem with using past inspections is that the model is now conditioned by past rules. Not sure if Yelp reviews is the way to go but this is an interesting open problem.

Twitter was supposed to give a talk but cancelled.

Facebook is as far as I know absent, perhaps because of the weird controversy about scientific research. If so, it's tragic.

UPDATE (8/8/2014):

Here are other sessions I attended, including one on statistical graphics I somehow missed in the first roundup:

Grace Wahba gave the Fisher Lecture. Her research from decades ago on kernels and splines has made a great impact on the field. It's one of those ideas that has lifted off once we reached a certain level of computational power. The support vector machine has been found to be a specific implementation of her more general ideas. She presented a paper on using smoothing-spline ANOVA to look at mortality within family members.

There was an important session on reproducibility of statistical research, and I heard talks by Phillip Stark and Y. Benjamini. Stark lays out the issues nicely, and calls a spade a spade: a lot of today's research, especially but not limited to computational statistics, amount to "hearsay" or "advertising" for unpublished content. This is because referees or readers do not have the ability to replicate such findings. He introduces the Berkeley Common Environment (BCE), which is a set of tools that his and other research teams use to keep track of codes, environments, etc. to make research replicable. Sounds interesting to me.

Benjamini's talk is primarily about what he calls "selective inference". One way to describe this is that one should not use the same data to both parametrize one's model and to generate estimates from that model. If ignored, the estimates will very likely to too optimistic (i.e. over-fitted to the observed data). Perhaps this is more understood in the context of modeling but it also is widely practised in statistical testing, where we first select significant effects based on p-values, and then issue interval estimates on those selected effects. As his provocative subtitle asserts, "it's not the p-value's fault". He then ran out of time but was getting into another issue of needing to estimate interaction effects due to clusters, e.g. medical centers in a clinical trial.

Another session brings together the business, legal and ethical dimensions of data privacy. I wish this session was structured more like a conversation than a series of presentations with a few questions at the end. The lawyer, Paul Ohm, contends that anonymization is impossible, and the notion of PII (personally identifiable information) is useless. I don't think I heard what the solution is. Dick Deveaux, the statistician, reviews recent controversies, such as the Google Maps, Facebook research, and OK Cupid experiments. Clearly our community is ill-prepared to make constructive contributions right now to this debate.

Leland WIlkinson discussed "scagnostics" which apparently originated with John Tukey. Scagnostics are summary statistics of scatter plots. Once you reduce those plots to numbers, you can then classify and cluster these plots, and Wilkinson shows how this is done and then how to explore a corpus of scatter plots in this way. Useful if you have that type of data.

Max Ghenis presented Glassbox, an R package created by Google HR analytics group, for visualizing the response surface of models. I have used this sort of tool for years. Definitely a worthy project.

Andrea Kaplan demonstrated Gravicom, a tool to manually generate clusters of nodes from network data. This is still preliminary work. Would like to see it again once she incorporates algorithmic approaches to complement the manual approach.

Github and Shiny win the Supporting Actors awards for that section of Statistical Graphics that I attended.

Carson Sievert presents a way to explore topics and words in LDA topic models. I like what I saw and think it would be very useful to people building these models. I hope he will expand this project from purely exploratory to allowing users to take actions based on what they see.

M. Majumder's talk was entitled Human Factors Influencing Visual Statistical Influence. It's really about "line-up plots" which is a brilliant concept by A. Buja. You generate multiple sets of simulation data from the null distribution and insert the observed data into this set at a random position. Then the analyst tries to pick out the observed data from the noise. If you can't do it, then we don't have a signal. This research used Mechnical Turks to test what are covariates that affect the performance of this task. I was a bit disturbed by the extreme variance of the "percent accurate" metric - the boxes in many of the boxplots went from almost 0 percent to almost 100 percent. I believe Majumder's conclusion is that the variance is almost entirely explained by the difficulty level of the tasks at hand. I would like to see this work repeated with a choice of tasks that do not range so much in difficulty level.

The other day, we are told that if we walk anywhere in New York, we will bump into a few millionaires (link). This week, we are told that wherever we go in the US, "Roughly, every third person you pass on the street is going to have debt in collections" (link). The woman who said this has a PhD in Economics from Cornell.

Oh please. The claim comes from yet another misinterpretation of a statistical average. That statistic appears to be "35.1 percent of people with credit records had been reported to collections for debt that averaged $5,178, based on September 2013 records."

Firstly, not every American has debt; indeed later in the same article, the reporter told us "people increasingly pay off balances each month" and evfen more directly, "only about 20 percent of Americans with credit records have any debt at all". Secondly, not every one has a credit record: for example, kids and students typically don't have credit records but you will pass by them on the street.

Thirdly, people who have debt trouble are not evenly distributed across the States, nor within a state or even a county. Again, the reporter helps us by telling us "the delinquent debt is overwhelmingly concentrated in Southern and Western states."

Besides, the statistic did not indicate a time frame. Is it 35 percent who have ever been reported to collections, or 35 percent who are currently in collections, or some mixture of the two? If the statistic includes anyone who is current today but was in collections in the past, then again, if you bump into such a person, you would think he or she does not have debt in collections.

It seems that the reporter has more numbersense than the PhD economist.

***

The underlying problem is that the statistical average is computed for a specific subgroup of Americans and the statement about walking on the street is for a different subgroup of Americans. In the prior example (link), the statistic about all New Yorkers is liberally applied to New Yorkers you meet on the street.

If a number is to be used to describe a different subpopulation, we must first adjust it to that group. If we aren't comfortable with such adjustment, then don't try to extrapolate.

To add to my prior post, having now read the published paper on the effect of DST on heart attacks, I can confirm that I disagree with the way the publicist hired by the journal messaged the research conclusion. And some of the fault lies with the researchers themselves who appear to have encouraged the exaggerated claim.

Here is the summary of the research as written up by the researchers themselves. First I note the following conclusion:

and right before, they write this explanation of the "timing" effect:

So indeed, if I were to believe the research, someone may have a heart attack on Monday instead of Tuesday "as a result of" daylight savings time in the spring. And wait a minute, by reversing this change in the fall, we seemingly postpone some heart attacks by two days. Hence my assertion that even if true, the phenomenon is not interesting.

In fact, I think this study provides negative evidence toward the idea that DST causes heart attacks. Here is how the authors describe their hypothesis:

The new data show no statistical difference in overall heart attack (admissions) for either period. That is their main result.

***

In this post, I want to discuss the challenges of this type of research. The underlying data is OCCAM (see definition here). It is observational in nature, it has no controls, it is seemingly complete (for "non-federal hospitals in Michigan), it is adapted and merged (as explained in the prior post).

Start with the raw data, in which there is a blip observed the Monday after Spring Forward. This problem is one of reverse causation: we see a blip, now we want to explain it.

Spring Forward is put forward as a hypothetical "cause" of this blip. But, we should realize that there is an infinity of alternative causes.

Seasonality is clearly something that needs to be considered. Is it normal to see an increase in admissions from Sunday (weekend) to Monday? To establish how unusual that blip is, we need to manufacture a "control," because none exists in the data.

In the poster presentation, the researchers use a simple control: what happened the week before? (This is known as a pre-post analysis.) The red line shown on the chart would suggest that a jump on Monday is unusual. This chart is a reproduction of the two charts from the poster but superimposed.

One can complain that the pre-1-week control is too simplistic. What if the week before was anomalous? A natural way forward is to use more weeks of data in the control. In the published paper, the researchers abandon the pre-1-week control, and basically use several years of data to establish a trend.

But this effort is complicated by the substantial variability in the data over time:

(I can't explain why the counts here are so much lower than the counts given in the post-DST week line in the first chart. In the paper, they describe the range of daily counts as 14 to 53.)

So expanding the window of analysis is double-edged. On the one hand, we guard against the one week prior to Spring Forward being an anomaly; on the other hand, we include other weeks of the year that are potentially not representative of the period immediately prior to Spring Forward.

The researchers do not simply average the prior weeks--they actually produce a statistical adjustment on the raw data, and call that the "trend model prediction". This is a very appealing concept. What we really want to know (but can't) is the "counterfactual": the number of cases if there were no DST time change.

In the next chart (reproduced from their paper), the "trend" line is what the authors claim the counterfactual counts would have been. They then compare the red line to the blue line (actual counts) and make claims about excess cases.

***

Of course, the devil is in the details. If you're going to make predictions about the counterfactual, the reader has to gain confidence in the assumptions you use to create those predictions.

One way to understand the statistical adjustment is to plot the raw data and the adjusted data side by side. Unfortunately we don't have the raw data. We do have the one week of pre-DST data from the poster. So I compare that to the "trend".

This chart raises two questions. First, the predicted counts in the paper are about 30% higher than the counts in pre-week of the poster. Second, the pre-week distribution of count by day matches the "trend" poorly.

While the pre-count is not expected to match the predicted "trend" perfectly, I'd expect that the post-counts should match since both the poster and the paper address what happens the week after the DST time change.

Strangely enough, the counts in the paper are 35% higher than those in the poster for the post-DST week! I'm not sure what to make of this: maybe they have expanded the definition of what counts as "hospital admissions for AMI requiring PCI".

The attempt to establish a control by predicting the counterfactual is a good idea. Given the subjectivity of such adjustments, researchers should be rigorous in explaining the effect of the adjustments. Stating the methodology or the equations involved is not sufficient. The easiest way to explain the adjustments is to visualize the unadjusted versus the adjusted data. The direction and magnitude of the adjustments should make sense.

***

Going back to the problem of reverse causation. Seasonality, trend and DST are only three possible causes for the Monday blip. Analysts must make an effort to rule out all other plausible explanations, such as bad data (e.g. every time the time changes, some people forget to move their clocks).

As I am testing your patience again with the length of this post, I will put my remaining comments in a third post.

Andrew Gelman discusses a paper and blog post by Ian Ayres on the Freakonomics blog. Their main result is summarized as:

We find that a ten percentage-point increase in state-level female sports participation generates a five to six percentage-point rise in the rate of female secularism, a five percentage-point increase in the proportion of women who are mothers, and a six percentage-point rise in the proportion of mothers who, at the time that they are interviewed, are single mothers.

Andrew finds these claims implausible, so do I.

Ayres uses the econometrics methodology called instrumental variables regression to support these claims. Since the data is observational, and as Andrew pointed out, there wasn't even a period of time in which one could find exposed and unexposed populations (since the TItle IX regulation was federal), one must treat such regression results with a heavy dose of skepticism.

It is useful to understand that causal claims are possible here only if we accept all the assumptions of the instrumental variables method.

Besides, plausibility is assisted by the ability to outline the causal pathways. It should be obvious that more females competing in college sports does not directly cause more females to become secular. The data on sports competition and on secularism come from different sources and this presents a hairy problem. The analysis would have been more convincing if it found that among the women who participated in college sports, more became secular; what the analysis linked was higher participation rate and higher secularism among all women in the state.

What is it about sports participation that would cause people to become secular? (The visual evidence from professional American sports would lead me to hypothesize the opposite--that sports participation may be associated with higher religosity!) Is this specific to the female gender? Do we find male secularism increase as sports participation by men went up?

As Andrew pointed out, the magnitude of the estimated effect seems too large to believe. I'd prefer to see these effects reported at more realistic increments. A jump of 10% participation is very drastic. For example, according to the chart here (the one titled "a dramatic, 40-year rise"), the percent of women participating in high school sports has moved just 2 percent from 1995 to 2011.

***

Andrew is right that this is an instance of "story time". And we are not saying that statisticians should not tell stories. Story-telling is one of our responsibilities. What we want to see is a clear delineation of what is data-driven and what is theory (i.e., assumptions). The plausibility of a claim depends on the strength of the data, plus whether we believe the parts of the theory that are assumed.

In case you are not subscribed to my dataviz feed, I put up a post yesterday that is highly relevant to readers here interested in statistical topics. The post discusses a graphic of a New York Times article that interprets the official inflation rate (known as the CPI). I devoted an entire chapter of Numbersense (link)to the question of why the official inflation rate diverges from our everyday experience.

In a larger context, inflation rate is an invented metric, invented to measure some quantity that has no objective reality. This is true of a lot of statistics. Revenues and profits are also invented concepts, for example, and only attain meaning through generally accepted accounting rules. Obesity, which is discussed in Chapter 2 of Numbersense (link) is another example of a quantity that has meaning only because of a convention of measuring.

The article in NYT brings up one of the points I raised in the book, which is that price increases are magnified in our imagination while price decreases are taken for granted.

The other larger point of the chapter on inflation is that anyone wishing to comment on whether CPI reflects real experience ought to understand how CPI is constructed. A superficial understanding such as that it is the average price of a basket of goods is useless because there are so many little details that affect the statistic. Because inflation has no objective basis, it is pointless to argue if it reflects reality: all we are left with is discussing the rules and you can't discuss the rules without knowing them well.

Details matter a lot in statistics. This is one of the reasons why I keep asking my Big Data colleagues to talk specifics. A statistician who only talks in generality is like the Manhattan realtor who can't tell you the size of the listed apartment.

I saw Joe N.'s tweet asking me about a study of how professors spend their time, reported by Lisa Wade at Sociological Images. This is an anthropological study, something that I am not at all familiar with although the people in the field seem to believe that they can make statistically valid observations.

I'm glad the author of the study, John Ziker, wrote a (really) long article describing what he was trying to accomplish. The key point is that the study is a preliminary exploration, with important limitations; a follow-up study is planned which may give generalizable conclusions.

Here are some issues with the first study that makes a statistician nervous:

- the sample was between 14 and 30 professors (tiny): Wade reported it to be 16. Ziker definitely started with 30.

- the selection was non-random, based on the first 30 people who responded to a school-wide announcement

- about half the initial respondents did not complete the study, and provided only partial data (one to six days)

- despite the tiny sample, some analysis required slicing the data further into four segments by grade level! I wonder how many department chairs were in that sample. (See chart on right)

- each professor is followed for a two-week period but only every other day, thus each professor at most contributed one observation per day of week

- the interviews were every other day "so the time taken for the interview did not appear on the previous day’s report." This is a horrible problem to deal with! Because time allocation is the subject of the study, the measurement method (in-depth interviewing) interferes with the measured outcome. It seems to me impossible to believe that the time spent answering questions every other day did not affect time allocation on the non-interview days.

- Ziker reasoned: "While we cannot make a claim that all faculty have the same work patterns as our initial subject pool — they do not comprise a random sample — the results are highly suggestive because of the consistency across our subjects who did represent.". In order not to fall prey to the law of small numbers, a better way to say this is: we make the assumption that the small sample is representative on both mean value and dispersion, which then leads to the assumption that all faculty have consistent work patterns similar to the observed.

- "With our initial 30 Homo academicus subjects, we ended up with a 166-day sample with each day of the week well represented." I am assuming that Ziker did not drop the 16 professors with partial data and made charts like the one on the right by ignoring the identity of the professor and aggregating over days of the week. Let's review what lies behind this chart. Each respondent contributed at most one observation per day of week; about half of the respondents did not even contribute data for all seven days. So the time allocation on any particular day is averaged over anywhere from 14 to 30 professors. These professors span a variety of ranks, departments, tenure, backgrounds, etc. and were not randomly selected. It's hard for me to trust this chart at all.

***

In general, I am a big fan of shoe leather research in which the researcher goes out there and gather the relevant data they need to address their specific research question, rather than picking up what data they could find, and then tailoring the research question to avoid the imperfection in the data. So I don't want to sound too negative. It's a difficult research problem they are dealing with. What they learned from this first study is useful to inform future explorations but drawing conclusions at this stage is premature.

At the end of his article, Ziker described the "experience sampling" method that will form the next phase of this study. I am very excited about this methodology.

Roughly speaking, they will ask participants to install a mobile app, which pops questions from time to time asking them what they are doing at that moment. Instead of exhaustively tracking a small number of participants over the course of time, they will get little bits of data, incomplete schedules, for a large number of professors. If the sample is big enough and randomized appropriately, they can analyze the data ignoring the professor identity, and report results for the "average professor". This method also retains the other benefit of the original design, which is that the respondents report their activities close to the time in which they occurred.

Data scientists pay attention! You don't have to collect complete data at the user level to do proper research. Designs like this "experience sampling" approach produce statistically valid findings without the need for complete data. In fact, trying to collect complete data is counterproductive, leading to shaky conclusions as shown above.

There is now some serious soul-searching in the mainstream media about their (previously) breath-taking coverage of the Big Data revolution. I am collecting some useful links here for those interested in learning more.

Here's my Harvard Business Review article in which I discussed the Sciencepaper disclosing that Google Flu Trends, that key exhibit of the Big Data lobby, has systematically over-estimated flu activity for 100 out of the last 108 weeks. I also wrote about the OCCAM framework, which I find useful to think about the "Big Data" datasets we analyze today versus more traditional datasets from the past.

Slate was probably the earliest to react, and noticed a post on this blog that was the precursor to the HBR article.

Readers who are specifically interested in GFT should read the source materials themselves, which are quite accessible. Start with the Science paper. After that, you can read the original research article by the Google team, hosted at google.org (click on the PDF link in the blue box at the bottom of the page). There are some bold claims in this paper, as well as caveats. They seemed to be concerned about "false alerts" at the time, such as news events rather than illness that drive certain searches. (For those statistically inclined, the underlying model involves only 1,152 points of data--128 weekly aggregates in nine regions--but a search through 450 million simple logistic models to not only define which search terms are important but also determine how many search terms to include in the final regression.)

Then read the article by Cook, et. al., which covers an update to the model made after the 2009 season when GFT totally missed the pH1N1 swine flu epidemic. Notice that this Big Miss is the opposite error to the "false alert" problem. (See Chapter 4 of Numbers Rule Your World for a thorough discussion of different types of prediction errors, and how to think about them.) From the charts in the Cook article, you can see that in the runup to the Big Miss, GFT systematically under-estimated flu activities for as many weeks as you can count.

The overhaul was drastic. The search term topics that accounted for 70% of the original model were reduced in importance to 6% while two other topics that counted for 8% originally were inflated to 69% in the updated model. This dramatically improved the "fit" statistic (RMSE) for the "first phase" of the Big Miss from 0.008 to 0.001.

Next, there is Butler's article for Nature, (Feb 2013) which precedes the Science article, but first pointed out the over-estimation problem for the 2012 flu season. One possibility is that the model update described above over-compensated for the Big Miss, making it more susceptible to the False Alert.

Other media coverage of Google Flu Trends include Guardian (which focuses on the need to understand causality), CRN, and the Economist (which talks mostly about Twitter data which is much more problematic than Google search data).

***

Tim Harford has probably the most educational of the pieces revisiting Big Data for the Financial Times. (When I wrote this line, the FT link wasn't working. The title of the article is "Big Data: Are we making a big mistake?" if you need to find a different link.) His is the longest and covers a lot of ground, and has great examples, including one of my own. Highly recommended.

***

One of the slogans of the Big Data industry (of which I'm a part) is the push toward "evidence-based" decision-making in place of "gut feelings" or "instincts". Until now, I'm afraid there has been plenty little "evidence" presented to support the assertions of universal, revolutionary goodness of Big Data (try searching for quantitative assessment of Big Data projects). I hope we are witnessing the birth of evidence-based decision-making inside the industry of dishing out evidence-based advice.

The reason I put out the OCCAM framework is to steer our community toward a more constructive approach to tackling "Big Data" problems. It requires a fundamental shift in how we define the problem. I have a moderately more technical take on some of the statistical challenges in an article published in Significance, earlier this year. This article discusses six technical challenges where we need substantial progress.

Statisticians sometimes dismiss these as "old news," claiming that the same problems exist in smaller datasets and the problems are well known. A recent example is Jon Schwabish's tweet saying that this discussion induces a "yawn". This reaction feels a bit like Fermat writing in the margin claiming he has a proof. The rest of the world don't want to wait 358 years to figure out what the goods are which these statisticians are hiding.

In my view, there has been some interesting work but nothing to settle debates. If we have great solutions, we won't be discussing these same problems today.

***

Back to flu prediction. It's really something that is well worth pursuing!

It's a nice-defined, self-contained problem that has social benefits and whose results can be easily measured. We should be grateful that the Googlers spent time working on it. It's a problem I'd love to work on if I have time and resources on my hands.

The researchers also have pioneered this type of research using search term data. This is highly significant, and the data represent a perfect example of what I call OCCAM data: the data is purely Observational (related to what Harford calls "found data" or what Dan Eckles calls "data exhaust"); it has no Controls; it is seemingly Complete; it wasn't collected for the purpose of predicting flu trends, that is, it is Adapted from other uses; and the search data was Merged with the CDC data (the matching of states and regions, and of weeks were not exact as you can tell from the original research article).

The several published versions of the predictive models are clearly failures but anyone who is in this business knows model building is an iteractive process. One can learn from these mistakes. I happen to think they need to wipe the slate clean and use an entirely different approach. It's a small price to pay if there is reward down the road.

I sincerely hope that this coverage will lead to improved modeling and analytical techniques rather than a retrenchment.

This article printed by VentureBeat is too much. The title claims "The Internet is killing off marketing surveys & it's for the best". This article is tagged as "Big Data". Big delusion is what it is.

This is a great example of the kind of revisionist history that is practised in the name of Big Data. You'd also notice that there is no data or evidence presented to support any of its many far-reaching claims.

First comes the howler:

about eight years ago, people started raising concerns about respondent quality [of traditional marketing surveys]; and as social media took off, some dared to wonder aloud whether online ratings and reviews were eliminating the need for surveys altogether.

Eight years ago would be 20062008. What happened in 20062008 that made people question instruments that have been used for decades? What kinds of concerns? How do "social media" and online ratings and reviews solve those problems? We will not learn.

Apparently, those eight-year-old concerns were not sufficient to sink the sorry enterprise until "recently". We're told -- again with no data -- that "we’re witnessing the demise of the lengthy, grid-question littered, rating-scale driven survey as we know it."

Raise your hand if you have done an online survey for a market research company. Is the survey "lengthy"? Does it contain a "grid"? Does it ask you to use a "rating scale"? Well, I thought so. Just in the last week, I reviewed an online survey design submitted by an outside agency which contains thirty questions, replete with multiple grids and multiple rating scales.

***

Later on, the author attacks blinded designs. I'm not kidding. This is the charge: '[Consumers] ask themselves, “Why should I invest time and candor in responding to questions posed by some person or entity that won’t even reveal their identity, let alone respond?”'

Apparently, tweets and online reviews are the new way. Don't worry if the tweets are unrelated to your research question or that most of those online reviews were purchased by your social media marketing agency and written by people who have not ever used your product.

And may we ask how they propose to measure "unaided brand awareness"? It is well known that the following two questions lead to vastly different responses:

(A) Name three services which can be used to create an online survey

(B) Have you heard of the following companies that provide online survey tools? SurveyMonkey, Zoomerang,...

***

Next comes the obligatory Big Data moment: "Big Data is also reducing the need for quantitatively rigorous, predictive surveys." What could that even mean? We now prefer quantitatively weak, unpredictive surveys?

The author explains, today we can just "harvest and analyze the masses of behavioral data already available." From where? Do log data or tweets inform us about attitudes, motivation, and psychology? As usual, we are asked to assume Big Data solves some problems, therefore Big Data solves those problems.

***

Now I have no doubt that "Big Data" will impact the market research field. This does not excuse poor arguments presented without evidence.

Making Big Data add value is not as simple as "harvesting". Tweets and reviews have all the characteristics of the OCCAM framework: they are observational in nature, lack controls, adapted from their original purpose, merged with other datasets, and to the deluded, they are complete (N=All).

Some time ago, there was a lot of hype about how new tech will demolish the superstar effect in entertainment sales because all the little titles in the long tail will be exposed to consumers. I recall Amazon being labeled the shiny example of a company that made profits off the long tail (as opposed to the boring top of the distribution). I still remember this graphic from Wired (link):

A reader Patrick S. pointed me to a study of music services that pronounces "the death of the long tail" (Warning: they want your email address in order to read the full report. The gist of the report was written up in this other blog.). Reading these pieces, one wonders whether this long-tail miracle even existed in the first place. The main thrust of the argument is that the new digital subscription/music services have not changed the allocation of spoils amongst artists. The little guys out in the long tail are still earning much less of a (shrinking) pie.

The long tail is an example of those intuitive, elegant scientific concepts that are much less impactful in the real world than claimed. Here is what I think caught some smart people on the wrong foot:

The distribution of profits has always been much more extreme than the kind of ballpark graphics (like the Wired chart above) shows. The new study for example suggests that the top 1 percent earned 77 percent of all the money. This is much more extreme than the 80/20 rule. From the graphical perpective, you can think of the distribution as one very tall spike and a very flat, very long tail.

The cumulative weight of the very flat, very long tail is still not that heavy compared to the one spike. Even if you manage to increase the size of the tail by 10 percent, it still amounts to a small number.

The above assumes you can increase the size of the tail. But it is quite hard to do. One reason is that the tail consists of millions of little pieces, which don't necessarily move in sync.

The second, and more important reason, is that titles or artists don't randomly end up in the tail. If a title is in the tail, it's an indicator that the artist or title is not appealing to the mass audience.

We fell prey to the romantic notion that there are some unjustly neglected artists, and rejoice in the idea that the long-tail effect may allow a few of these to reverse their fortunes. But a few outliers do not change the overall distribution.

***

The report's authors also make this observation:

Ultimately it is the relatively niche group of engaged music aficionados that have most interest in discovering as diverse a range of music as possible. Most mainstream consumers want leading by the hand to the very top slither of music catalogue. This is why radio has held its own for so long and why curated and programmed music services are so important for engaging the masses with digital.

While I believe this story, I should note that there is no quantitative evidence provided (at least not in the summary). If this is true, it has important implications for anyone in the business of "personalizing" marketing to consumers.

At the start of the year, The Atlantic published a very nice, long article about Netflix's movie recommendation algorithm. You may remember this algorithm (internally known as Cinematch) received a $1 million makeover several years ago (the Netflix Prize), only that the prize-winning entry was deemed too complex--and does not generate sufficient incremental value--to be put into production.

The reporter, Alexis Madrigal, noticed that Netflix has shifted attention from the queue of recommended movies to providing (micro-)genres of movies you might be interested in. His article is a great example of powerful data journalism: he reverse-engineered the internal structure of Netflix's new algorithm by extracting all of the keywords ("About Horses", "Critically Acclaimed", "Visually Striking", to name a few), and then creating all sensible combinations of these keywords (e.g. "Critically Acclaimed, Visually Striking Movies About Horses"), producing the roughly 80,000 possible microgenres used by Netflix. (It's clear that Netflix management endorsed this exercise and article but it's not clear how much proactive support they provided.)

One of my favorite columnists, Felix Salmon, reacted negatively to the change in algorithms, titling his post "Netflix's Dumbed-Down Algorithm". He interpreted the change as foreshadowing the day when Netflix no longer could offer any movie any user places in his/her queue because the third-party content providers have ratcheted the costs too high. It's a longstanding weakness in Netflix's streaming business model.

Felix lamented that the genre-driven recommendations would be far inferior to the original recommendations:

The original Netflix prediction algorithm — the one which guessed how much you’d like a movie based on your ratings of other movies — was an amazing piece of computer technology, precisely because it managed to find things you didn’t know that you’d love. More than once I would order a movie based on a high predicted rating...

The next generation of Netflix personalization, by contrast, ratchets the sophistication down a few dozen notches: at this point, it’s just saying “well, you watched one of these Period Pieces About Royalty Based on Real Life, here’s a bunch more”.

***

Felix is right on the business model but misses the mark on the analytics. As someone who builds predictive models, I had the opposite reaction when reading The Atlantic's piece. I thought Netflix's data engineers learned something from the Netflix Prize "fiasco".

The major change to the analytical approach is shifting from predicting whether you'd like a movie to whether you'd watch a movie. This shift makes a lot of sense to Netflix as a business. It is sensible even from the user's perspective: since when is it that we never watch a bad movie? (Even the movies we place in the queue ourselves could turn out to be bad.)

One big problem with the Netflix Prize was its singular focus on the RMSE metric, which roughly speaking measures the average error of the predicted ratings against actual ratings. The ratings data, though, is extremely skewed, making an average error criterion worse than misleading. By skew, I mean (a) a very small number of popular movies receives the majority of the ratings and (b) a small number of highly active users contribute the majority of movie ratings. Put differently, missing data is far and away the most important feature of the data.

Because of missing data, it is next to impossible to get good predictions for niche movies (with few ratings) or for users who do not actively feed signals into the algorithm. Improving RMSE by 10 percent does not mean every user's prediction improved by 10 percent. The improvement is likely concentrated to user-movie pairings for which there is sufficient data to work with. It would be enlightening if someone does an analysis of the performance of the winning algorithms by segments of users (based on the amount of prior data to work with).

Now, consider predicting what you'd watch next based on the viewing behavior of you (and other users). For every user and movie combination, the user either have or have not watched the movie. Just like that, the missing-data issue vanishes. The result of what Felix sees as "dumbing down" may be a stoking up.

***

As I pointed out in Chapter 5 of Numbersense (in talking about Groupon's bid to personalize offers; link), every business faces a set of conflicting objectives when trying to "personalize" marketing to customers. I believe this Netflix shift shows they have found a good balanced solution.