Big Data: A Revolution That Will Transform How We Live, Work, and Think

Big Data is a new book from Viktor Mayer-Schonberger, a respected Internet governance theorist; and Kenneth Cukier, a long-time technology journalist who's been on the Economist for many years. As the title and pedigree imply, this is a business-oriented book about "Big Data," a computational approach to business, regulation, science and entertainment that uses data-mining applied to massive, Internet-connected data-sets to learn things that previous generations weren't able to see because their data was too thin and diffuse.

Big Data is an eminently practical and sensible book, but it's also an exciting and excitable text, one that conveys enormous enthusiasm for the field and its fruits. The authors use well-chosen examples to show how everything from shipping logistics to video-game design to healthcare stand to benefit from studying the whole data-set, rather than random samples. They even pose this as a simple way of thinking of big data versus "small data." Small data relies on statistical sampling, and emphasises the reliability and accuracy of each measurement. With big data, you sample the entire pool of activities -- all the books sold, all the operations performed -- and worry less about inaccuracies and anomalies in individual measurements, because these are drowned out by the huge numbers of observations performed.

As you'd expect, Big Data is particularly fascinating when it explores the business implications of all this: the changing leverage between firms that own data versus the firms that know how to make sense of it, and why sometimes data is best processed by unaffiliated third parties who can examine data from rival firms and find out things from which all parties stand to benefit, but which none of them could have discovered on their own. They also cover some of the bigger Big Data business blunders through history -- companies whose culture blinkered them to the opportunities in their data, which were exploited by clever rivals.

The last fifth of the book is dedicated to issues of governance, regulation, and public policy. This is some of the most interesting material in the book and probably needs to be expanded into its own volume. As it is, there's a real sense that the authors are just scraping the surface. For example, many of the stories told in the book have deep privacy implications, and the authors make a point of touching on these, cabining them with phrases like "so long as the data is anonymized" or "adhering to privacy policy, of course." But in the final third, the authors examine the transcendental difficulty of real-world anonymization, and the titanic business blunders committed by firms that believed they'd stripped out the personal information from the data, only to have the data "de-anonymized" and their customers' privacy invaded in small and large ways. These two facts -- that many of the opportunities require effective anonymization and that no one knows how to do anonymization -- are a pretty big stumbling block in the world of Big Data, but the authors don't explicitly acknowledge the conundrum.

While Big Data is an excellent primer on the opportunities of the field, it's thin on the risks, overall. For example, Big Data is rightly fascinated with stories about how we can look at data sets and find predictors of consequential things: for example, when Google mined its query-history and compared it with CDC data on flu outbreaks, it found that it could predict flu outbreaks ahead of the CDC, which is amazingly useful. However, all those search-strings were entered by people who didn't expect to have them mined for subsequent action. If searching for "scratchy throat" and "runny nose" gets your neighborhood quarantined (or gets it extra healthcare dollars), you might get all your friends to search on those terms over and over -- or not at all. Google knows this -- or it should -- because when it started measuring the number of links between sites to define the latent authority of different parts of the Internet, it got great results, but immediately triggered a whole scummy ecosystem of linkfarms and other SEO tricks that create links whose purpose is to produce more of the indicators Google is searching for.

Another important subject is looking at algorithmic prediction in domains where the outcome is punishment, instead of reward. British Airways may get great results from using an algorithm to pick out passengers for upgrades, trying to find potential frequent fliers. But we should be very cautious about applying the same algorithm to building the TSA's No-Fly list. If BA's algorithm fails 20% of the time, it just means that a few lucky people get to ride up front of the plane. If the TSA has a 20% failure rate, it means that one in five "potential terrorists" is an innocent whose fundamental travel rights have been compromised by a secretive and unaccountable algorithm.

Secrecy and accountability are the third important area for examination in a Big Data world. Cukier and Mayer-Schonberger propose a kind of inspector-general for algorithms who'll make sure they're not corrupted to punish the undeserving or line someone's pockets unjustly. But they also talk about the fact that these algorithms are likely to be illegible -- the product of a continuously evolving machine-learning system -- and that no one will be able to tell you why a certain person was denied credit, refused insurance, kept out of a university, or blackballed for a choice job. And when you get into a world where you can't distinguish between an algorithm that gets it wrong because the math is unreliable (a "fair" wrong outcome) from an algorithm that gets it wrong because its creators set out to punish the innocent or enrich the undeserving, then we can't and won't have justice. We know that computers make mistakes, but when we combine the understandable enthusiasm for Big Data's remarkable, counterintuitive recommendations with the mysterious and oracular nature of the algorithms that produce those conclusions, then we're taking on a huge risk when we put these algorithms in charge of anything that matters.

4 Responses to “Big Data: A Revolution That Will Transform How We Live, Work, and Think”

Cory, this is a very interesting comment on the book. I don’t read it as a review, as it tells me less about what the book is about than what it’s not about.

It seems it’s too long for the majority of bb’surfers. But whether or not there’s a wave of comments to ride on, I wanted to add some short musings.

Disclosure: I am working on moderately sized datasets, which have given me terrible headaches during the last couple of years.

Now, if you just search “big data” on any news aggregator (take google news, which gives you your “local” share of the stuff), I sense something odd.

1st, BigData is sometimes discussed by applied statisticians and researchers as the new “DataMining”. It’s a buzz word for working with massive amounts of data, and to come up with some applied methods to detect and predict patterns. It is discussed that data mining never really held it’s promise to the stats community: creating some new, exiting opportunities – and jobs both for people especially qualified in data analysis, and new kinds of jobs as a result of these analyses.
As far as I can see, speedy reaction to some (not so complicated) models makes money (HFT, anyone?). But maybe not much else. Or am I completely missing some important things here? The “other customers who bought “The Settlers of Catan” also bought titles by Wil Wheaton, Robert A. Heinlein and Cory Doctorow does NOT count. This is making money, not creating anything interesting, and especially not many jobs. Now, why the buzz? What’s the news? Qui bono? Who profits, really? And don’t come up with facebook, last time I looked the ones getting rich weren’t the “shareholders” (I savour this term int this context, or should I say “Like”?)

2nd, I am (and some others are, too) perceiving a dichotomy between the hype and the outcome of the analyses. Thesis: BigData is suffocating science instead of producing new insights. Antithesis: BigData is bringing our understanding of processes and patterns to new heights. Now, do figure: how many really insightful studies are coming out of data pools right now?

The argument goes something like this: scientists used to collect data to test hypothesis (no matter if you were a bayesian or a frequentist). Nowadays, some of us are fishing in datapools for a meaning, but things like collinearity and the addition of noise by adding lots of variables make it hard to come up with a good explanation of something. We are already at our limits to understand the data we’ve got, but e.g. next gen sequencing produces data in a speed we can’t probably keep up with. The rate of error in the integrative approaches is expected to be really, really high.

Compared to the TSA example: if you just keep integrating data to your database, your prediction if someone is a terrorist is not gaining accuracy. It’s just gaining precision. And if the interpretation of this precision is flawed – go figure. Same is true for gene functions, metabolic pathways, individual traits, patterns of distribution of individuals, (meta)communities, populations, species etc. (pro parte). Just because we can detect patterns doesn’t make these patterns interpretable, or even predictable. You can insert your favourite analogy here. One that I found funny is the positions of celestial bodies: you’ve got astronomy there. And you’ve got other interpretations of heavenly stuff, also using the zodiac, and coming to interesting conclusions.

And this doesn’t even TOUCH the question of how the data was measured (specifications, and with which error margins…), and the question who controls access to the raw data, and who has the right to use it in which way, and how it stored and conserved for future use.

Most importantly to science, it doesn’t touch the question how to keep modern research reproducible.

This was also quite a sermon. I’m not expecting anybody to answer, but this really is an important issue to me. I could have posted this somewhere else in a blog, but at least here it will not be read by some techsavy nerdist whatever people, and not just not read by the usual innocent bysurfer.

Oh. Just BTW: I’m planning to have a cursory look at the book as soon as my library of choice gets it. You made me curious, but not curious enough to buy it. ;)

It was interesting to read this on the same day as a fascinating post by Will Davies on the trend towards developing public policy, which includes:

“The very character of Big Data is that it is collected with no particular purpose or theory in mind; it arises as a side-effect of other transactions and activities. It is, supposedly, ‘theory neutral’, like RCTs, indeed the techno-utopians have already argued that it can ultimately replace scientific theory altogether. Hence the facts that arise from big data are somehow self-validating; they simply emerge, thanks to mathematical analysis of patterns, meaning that there is no limit to the number of type of facts that can emerge. There’s almost no need for human cognition at all!

“The problem with all of this, politically, is that causal theories and methodologies aren’t simply clumsy ways of interfering with empirical data, but also provide (wittingly or otherwise) standards of valuation. ”

Thanks, marek! That certainly was interesting, adding a very different perspective. Top-down, from the policies possibly inferred from data analysis. This is very though-provoking indeed. I’m coming from the opposite direction, from the side of data analysis. My opinion is that analysts might ask the wrong questions about what causes the patterns, and therefore get the whole explanation wrong. They might deduce based on their prior assumptions fed with a lot of (noisy?) data overfitting their models, or/and induce because of spurious correlations and collinearities.

Sidenote, there’s one comment on the page about what I use to call the “perfect map analogy” – and it’s getting the key message of Davis text wrong. These analyses and models do work, and you can infer or induce generalised hypothesis from the data. Davis question is not if it’s workable, the question is, if it’s sensible. I admit he writes something in the direction of this:

Without the extreme simplifications of rationalist theories, society would appear too complex to be governed at all. The empiricist response to the government’s paper title, ‘What Works’, might end up being ‘very little’, unless government becomes frighteningly ‘smart’.

I think, ‘smart’ is misleading here. Gouvernement/companies/scientists already are ‘smart’ in away that they are collecting a lot of data, and we stated to integrate this to BigData. Now, the answer is no longer ‘very little’, but ‘with a p < 0.001, this works'. Without being reproducible (or, maybe in the case of politicians: understandable), and probably still without strong (realworld) pedictive value in a very complex system. It might be, I muse, that the perfect map analogy is more in my line of argument. Add so much complexity, and you find patterns everywhere, rendering predictions useless. But I have to think about this some more, I admit.