Big Data needs Big Model

Gary Marcus and Ernest Davis wrote this useful news article on the promise and limitations of “big data.”

And let me add this related point:

Big data are typically not random samples, hence the need for “big model” to map from sample to population. Here’s an example (with Wei Wang, David Rothschild, and Sharad Goel):

Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked for whom they intend to vote. While representative polling has historically proven to be quite effective, it comes at considerable financial and time costs. Moreover, as response rates have declined over the past several decades, the statistical ben- efits of representative sampling have diminished. In this paper, we show that with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and often faster and at less expense than traditional survey methods. We demon- strate this approach by creating forecasts from a novel and highly non-representative survey dataset: a series of daily voter intention polls for the 2012 presidential election conducted on the Xbox gaming platform. After adjusting the Xbox responses via multilevel regression and poststratification, we obtain estimates in line with forecasts from leading poll analysts, which were based on aggregating hundreds of traditional polls conducted during the election cycle. We conclude by arguing that non-representative polling shows promise not only for election forecasting, but also for measuring public opinion on a broad range of social, economic and cultural issues.

49 Comments

You are arguing that non-representative polling can be a useful alternative because representative polling comes with “considerable financial and time costs”. I think your approach (together with Wei Wang, David Rothschild, and Sharad Goel) to make non-representative polling data useful for election prediction is highly interesting, but if we discuss cost-efficiency we should also consider that using big and complex models does also cost a considerable amount of work-time from people who also need a quite sophisticated background in statistics and sampling theory. I’d say that complex modelling of non-representative polling-data might not really be a great time and money saver.

Good point. But there is a learning curve. It took a lot of effort and trial and error to come up with MRP and get it working in so many different examples. But once it works, it works, and it will just become easier and easier to use. We have an mrp package in R that’s still pretty clunky but it’s a first step. I don’t think it will be long before we have mrp software that is general and easy to use.

Also I should emphasize that statistical sampling and statistical analysis are complementary tools. Whatever sort of sampling is done, I think it makes sense to do MRP where possible to get better estimates. Conversely, conditional on doing MRP, it still makes sense to get as close to a random sample as possible to reduce nonresponse bias.

Finally, recall the original point of this post which is the discussion of Big Data. My key point is that we can often make the most of Big Data by using Big Model. Setting up Big Model can take effort but that effort can be worth it if we care about the answer.

“the only designs I know of that can be mass produced with relative success rely on random assignment. Rigorous
observational studies are important and needed. But I do not know how to mass produce them.”

The generally agreed on ways to analyse data, that can be safely delegated to any _qualified_ statistician, would seem to be limited to random assignment.

Of course real surveys have so much non-reponse bias, maybe they are already beyong _generally agreed on ways to analyse_ …

I fear complex and successful modeling for non-random samples might induce the idea that there is no need for careful sampling anymore. People will read or probably rather hear about successes in using non-random samples for prediction but ignore and forget the more complex modeling thus it might result in using non-random samples with standard assumptions and models. It might also induce counter incentives to recent developments in psychology to take the problem of sampling more seriously. These are of course no technical arguments against Andrew’s or other’s work on MRP but I fear the more and more increasing gap between developments in methodology and the actual statistical knowledge of most researchers in the social sciences is a serious problem. On the other hand we won’t see an increased usage of better suited models and methods if “we” don’t apply them ourselves and show how useful they can be. Borsboom (2006) made the same argument in Psychometrika discussing the big gap between psychometrics and actual psychological research (“The attack of the psychometricians”: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2779444/).

Nonetheless, accessibility is an (very!) important issue and easy-to-use mrp software would be a good thing to have!

The high non-respone rates in “traditional” random sampling is als an important point, though. I’m always wondering if it is at all reasonable to assume a “true” random sampling when you have at best a response rate of 20% … In sociology we have somewhat of the opposite problem compared to psychology: There is naive believe in “true” random sampling methods and studies claiming to have data based on “random sampling” are considered to be far more reliable than others even if the response-rate does not exceed ten percent and representativity is thus highly questionable.

Rahul:

Big Data does not introduce a new problem it rather amplifies old ones. Of course non-representative sampling does not need to be necessarily big but Big Data are *typically* non-random samples and the Flu example shows that the tendency to ignore sampling issues for Big Data can lead to biased predictions. Therefor it makes sense to discuss this issues in the context of Big Data.

I’m not convinced Big Data amplifies the sampling problem. I think often Big Data eliminates the sampling problem because it means not having to sample any more or at least a reduced population-to-sample-size ratio.

If you have all the data, that’s fine. But usually we’re interested in generalization. For example we have all the data for 2013 but we want to make predictions about people’s behavior in 2014. In addition, available data tend to be messy, and work needs to be done to move from the data we have to the questions we want to answer.

The Xbox data were not big in the sense of choking the computer. (We give the exact sample size in the linked paper.) My point was just that the Xbox analysis illustrates the kind of thing that can be done to match available data to the inferential or predictive questions of interest. If we were doing it with 10^12 data points rather than 10^5, we’d certainly have to put a lot more effort into computational efficiency, computing shortcuts, etc.

Unfortunately, in polling there are no random samples. As you say, point out, when nonresponse is 80%, the fact that the original sampling was done via random digit dialing doesn’t get you so much, since all you have are the respondents.

This came up in a blog discussion a few months ago, when Russ Lyons was objecting to the use of standard errors from non-random samples, and I replied that in the real world, it’s extremely rare that we get random samples, and I don’t think Gallup Poll, the U.S. Census, and other survey organizations would or should be happy to stop giving out standard errors!

Yes, I think, I’ve read that discussion but my memories are a little bit shady. I’m not sure what standard errors actually *mean* outside the statistical theory if we have only 20 or 10 % response rates, though. Would you consider them a crude way to give at least some idea about the uncertainty?

Yes, of course, but usually we are not explicitly considering (unit-)non-response in our models in the Social Sciences except for maybe Econometrics. So if we have a random sample but our response rates are quite low my guess would be that our standard errors will underestimate the real uncertainty. This is probably a good reason to use post-stratification or selection models, especially in the case of low response rates. This probably resonates with your point that “not poststratifying is also model-based (even if the model in that case is implicit) and thus is risky too”.

I just wonder, how bad the standard praxis to ignore the actual response rates actually is, how much we underestimate our uncertainty usually. There’s no general answer for that, though, as it will highly depend on the sensitivity of our variables to non-response …

e.g. I might have Gigabytes of logs from my 10,000 server datacenter. That’s a pure Bigdata problem. Or figuring out the tastes of my NetFlix customers. The challenge here is the bigness per se not the non-representative-ness.

The connection is that big data typically come from available sources. High quality random sample surveys are expensive and tend to be relatively small, whereas it’s not hard to scrape a gigabyte of data from somewhere but that source might not be representative of the target of interest.

Maybe it depends on our sectors. For me the Gigabyte data sources are especially representative of the target because the ability to collect & analyse the BigData means that I can do a job with direct, full analysis obviating the need of a sample.

e.g. Say I’m Walmart 30 years ago, I can only try to figure if people buy milk & bananas together or milk & bread (for say a shelf stocking decision) by surveying a few stores etc. BigData means being able to crunch numbers from *all* stores & producing an answer. No sampling needed. (or each states stores or whatever)

Thus to me Big Data eases the sampling problem rather than exacerbating it.

First of all, you are assuming every number in your dataset is accurate – that is almost never the case, especially with new data warehouses in which they explicitly devalue data consistency. We had a warehouse meltdown and spent weeks trying to get our metrics back to where they were, and failed.
Secondly, even if you think you have all the data (that’s your opinion, not mine), you have all of one sample path and now you assume you know everything.
Finally, just think about this: we have all of the data about every stock that is publicly traded forever but how good are we able to predict stock prices?

I don’t understand your objections. No, I’m not saying BigData is magically accurate but that’s hardly a problem unique to BigData. Garbage In Garbage Out. You had a crappy data warehouse. Bad luck.

No I don’t have “all” the data. But I’ve more data than a small sample, & if so, must I regret it? And where did I mention any omniscient predictive power?

All I’m saying is BigData (if I can analyse it) reduces (not eliminates) my worries about representative samples because I no longer have the complex, critical decision of how to draw a good, representative sample.

e.g. I can either (a) run an average purchase value query on my huge transaction database. Or (b) I can figure which 10 stores to manually sample & during what hours & then try to statistically model the population average on the basis of this data. Now why would I not prefer (a) if I have the technology to execute it.

What’s so wrong about my position? You seem to be attacking a strawman.

I think we can all agree that (1) for a given level of data quality, more data is better than less data, and (2) the usual purpose of statistical analysis is to take data from source A and use it to learn about question B, where there is some relation between A and B (for example, A is last year’s data and B is the forecast for tomorrow, or A is data on a sample and B is inference for a population, or A is data on some people who have had medical treatment and B is the effect of some drug on future patients).

What Kaiser and I are saying is that big data sets are typically not simply larger versions of small data sets. In particular, many small data sets that we work with have been carefully constructed (for example, via random sampling and somewhat expensive measurement), whereas huge data sets often have issues with quality and representativeness (and representativeness is almost always an issue, even if you have what appear to be complete data, they typically won’t be complete for the questions you’re asking).

I completely agree that small data can be crap; indeed that’s been an ongoing theme in this blog lately, regarding the “Psychological Science”-type papers that attempt to draw sweeping conclusions from a questionnaires given to 100 people on Mechanical Turk.

And, again, the point of my post is that Big Model can be a useful way of getting us from point A to point B with Big Data.

I am really not replying to myself but to @Rahul @4:46 am. Tried two different browsers neither of which gives me an option to reply directly to his comment so I have to place it here.

On your specific applications
(a) you are assuming that the average purchase value is a time-invariant constant. If you try doing this, you will find that your answer is wrong almost every day. Products come in and out of favor. Brands run campaigns to impact sales. Unpredictable events like an overturned truck could cause temporary shelf shortage. I can come up with 100 reasons why your prediction is still wrong. Of course, you should assign a prediction interval to your average purchase value. This is where Andrew’s comment comes in… current methodologies for computing interval estimates, and pretty much everything else rely on the iid random samples. BigData is not an iid random sample. Therefore, the interval estimates are way way way way too tiny.
(b) you don’t need Big Data to figure out which 10 stores to sample

The only way that your view of the world (which is the prevailing view among some Big Data proponents) will prove out is if you add a pile of assumptions e.g. no changes in the external environment, the causal mechanisms are non varying, the causal mechanism is accurate, there is no garbage data, etc. etc.

The fact that there is garbage data everywhere is true but misleading. A lot of random samples are carefully controlled and designed. BigData is almost always not. (I call this OCCAM data). They both have bad data but BigData has a lot more of it. In fact, going from small samples to BigData typically means adding a small amount of signal and an overwhelming amount of noise. That is one of several fundamental challenges of Big Data.

Finally, none of this purchase data captures intent or psychology or anything that can be used in a model that has resemblance to the underlying causal processes.

That’s a bait & switch. There’s the (a) historical data acquisition problem & then there’s (b) the prediction problem. Just because I’m using BigData for (a) doesn’t mean I’ve to be a moron & assume no prediction logic (b) is needed. No one’s saying BigData has supplanted the prediction model. That’s again attacking a straw man.

Who said anything about purchase value being a time-invariant constant? Do you assume my database is so coarse grained that I must be blind to any temporal trends? Why? Whatever sophisticated prediction logic you want to apply, why does layering it on top of Big Data somehow make it worse than attaching it to your old-school small data, representative sample?

Of course, “none of this purchase data captures intent or psychology”. But neither would it if I sent survey monkeys with clipboards to a few wisely chosen physical stores. That’s a *different* problem.

I think those other definitions just add fuzziness. I can totally understand people arguing about how big is big exactly. But size has to be a necessary feature of BigData.

Obviously a vendor like IBM would love to broaden the term as much as they can. Most salesmen embrace fuzziness.

I turn to Wikipedia for the description I like:

BiGData is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.

Stretching the “BigData” term to include fundamentally non-size related terms is not a useful extension, I feel. The origin & intent of the word is based on hitting some technology limit: be it storage, indexing, calculation, data-transfer, execution times, scaling etc. I think it is useful to maintain that distinction.

What is dangerous is the assumption of having all the data (which even in your world, is a rare case not an everywhere case). What is even more dangerous is to then claim the irrelevance of thinking about sampling errors, data cleanliness, causality, etc. because of the assumption that you have all the data.

Thank you for your answer and I’m not really arguing against using Big Models for Big Data, just wanted to raise my concerns about the practicability of your approach. I’ve always been somewhat critical of post-stratification so far, though, but this might be an artifact of my thoroughly frequentist upbringing and the widespread and unquestioned believe in random sampling in sociology. I’m no expert at all in this topic, so my questions might be very basic and naive.

Post-stratification relies on the assumption that correcting for key demographics improves the representativity of the date for the question at hand if I’ve understood it correctly. Therefor post-stratification only helps you if the key demographics actually have an influence on the relationship we are analyzing, right? If there are other important variables affecting response and the variables we’re interested in, post-stratification can’t help us if it isn’t one of the “typical” key-response indicators? I fear post-stratification may make us more confident in our inference than we should be. In what way is it advantageous compared to including the demographics as predictors in our models directly? Can’t post-stratification even increase the bias in our estimators?

The international PISA school achievement tests follow a (mostly unspoken) philosophy: We’ll fix it in post-production. They do a lot of manipulation of the data after they give the test to adjust for problems like misleading translations. It seems to work reasonably well.

Really, people have been doing Big Data for a long term, although the name has changed multiple times.
I first saw it at Bell Labs in 1970s, especially Murray Hill Bldg 5.
In the 1990s, anyone song it was using Teradata, IBM or scalable supercomputers like SGI’s.., but these were nowhere near as cheap as the current architectures.

Between current data center design, networks, cloud,Hadoop/etc, the market has greatly expanded, with usual results:
Lots of people entering without the long experience of folks who had been going it for a long time. Hence there will be lots of hype, noise, silliness and failures … While there be serious successes.
All this is standard behavior in technology waves.

Another given is when cheap commodity technology usurps the role of expensive, specialized boxes the experts of bygone years pooh pooh the utility of giving the unwashed masses a tool too powerful for their own good. :)

Right, we had all supermarket purchases for 10,000 identifiable households in 1984 and all purchases from 2,700 supermarkets by 1987. That was pretty Big Data. It made a difference in the consumer packaged goods industry, but it didn’t much Change the World.

That’s precisely the point. Diminishing returns of increasing sample size. You can have all the data you want but what is the incremental benefit that is obtained from the incremental data. I have not seen any serious measurement of the value.

@John Mashey. I think it’s more than just “…lots of hype, noise, silliness and failures.” It may well be that spurious correlations are not expected, but mathematically guaranteed, as dimensionality increases. I’ve posted some thoughts about how the Hales-Jewett theorem may apply to big data. I haven’t seen anyone else make this connection, which probably means there is something flawed with the conceptualization, but it bears thinking about.

I agree. If data can be considered nodes in a hypercube, and subject to combinatorial “painting” then beyond a certain threshold, truly random data is impossible. Today’s algorithms are so powerful that they will find these structure, even though they are truly artifacts of combinatorics and not meaningful in and of themselves. Techniques like regularization, which also serve to reduce dimensionality, should help prevent triggering Hales-Jewett.

Avraham:
“Hence there will be lots of hype, noise, silliness and failures … While there be serious successes.”

Let me try again:
1) There’s lots of hype.
2) Others push back against the hype.
3) But there will be many real successes, and they should not get lost in the shuffle.

There are numerous examples in the Computer History Museum alone, where a noticable capability/cost improvement led to big expansion, of which many well-hoopla’d efforts failed, but over time, generated huge successes. The expansion is generally good, although sometimes ignoring the past leads to flaws that cause trouble later.

Perhaps I am just reading too much into “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…”

But, there is (always has been) teenage sex, not everyone talks about it when they have been _successful_ and some will really know how to do it, but are quiet about that.

John, I do not disagree that big data techniques are useful. I think they can provide a lot of insight we could not obtain until now. I just think that people need to be aware of the inherent limitations of big data techniques. One of the key ones in my opinion, is that unlike small/medium data, a good big data algorithm is pretty much guaranteed to find something, even though that something is solely an artifact of the inability of highly dimensional data to be truly random. Big data statisticians need to approach their findings with somewhat more skepticism now.

Avraham: Yes, we agree strongly and this has been true for as long as I’ve had any intersection with this: neither 100% panacea nor 100% hype. Again, the main difference today is the large expansion beyond projects that normally used $Ms of investment and thus more justification.

As in some other cases I used to run into, *companies* that create really successful applications, especially the marketing-related subset, are naturally reluctant to say much about what they did. We often had this problem in the 1990s at SGI for that class of applications.

In my experience (natural language, speech, search, primarily), data is onion-like, and the bigger it is, the less directly useful it tends to be and hence the more clever you need to be to extract information out of it.

2. There’s a reasonable (millions of words) amount of labeled data (with things like nouns and verbs and names indicated), usually not in the domain you care about. For instance, there’s grammatical function tagging for newswire data focused on finance from the 1980s, but nothing focused on sports from more recently.

3. Then there’s usually a pile of unlabeled data in the domain you care about, but not always, because sometimes you’re doing things like launching a spoken dialog system in a new domain or writing a search engine for school homework or user shopping lists and you have no relevant data.

4. Then there’s labeled data in the domain you care about. This is usually customer or research-project defined through painstaking coding.

I believe there’s a similar distinction in web log analytics and customer data bases.

A. Very little data per user. For instance, Netflix might have between a handful and a few hundred (or maybe even a few thousand) ratings from me and Amazon has a relatively small search record, Google has the searches tied to my IP address and when I forget to log out, and so on.

B. Tons of background data for gazillions of users. The internet giants (spy agencies) have millions or even billions of users (surveillance targets).

The challenge in all these areas that I’ve seen is to somehow combine the specific and general data sets. It usually requires some clever modeling, because it’s not just a bigger sample from what you care about, which is modeling a particular user.