As the Obama Administration’s new “Big Data Research and Development Initiative” has made clear, the “big data” era is officially upon us. The term – “big data” has been used in multiple ways, but most generally refers to the avalanche of “raw data” generated by the internet and other new kinds of data-capturing sensor and digital technologies. Or, as one big data guru more pithily put it, it is “all the stuff we do online” – and more. With the “big data revolution” comes unflagging optimism regarding more comprehensive methods for the collection of vast new stores of technologically-produced data, enabling the pursuit of previously unanswerable questions, and carrying the promise of breakthroughs in how we access and understand the information composing our world. Time will tell.

The turn to “big data” represents a potentially exciting set of developments along multiple frontiers of advanced supercomputing, new software tools, other information collection technologies such as GIS, database management systems, and massive data sets, such as the exponentially expanding corpus of information generated by Web 2.0 social media. Government funding has followed a corporate lead, where in recent years the likes of Google, Facebook, Apple, and Amazon have turned a pursuit of “big data” into a major business proposition focused on gathering increasingly nuanced information about consumer behavior to better service and target customers. Making sense of the implications of all this will preoccupy us for some time.

Techno-Optimism

As the press release from the White House Office of Science and Technology Policy explains, “big data” projects hold great promise for “scientific discovery, environmental and biomedical research, education, and national security.” The very early returns on “big data”-derived research are already turning heads, from predicting political upheavals like the Arab Spring, market volatility, or new epidemic outbreaks, to mapping emerging cultural trends or the evolution of languages.

And the attraction of “big data” hits a number of sweet spots. Most generally, “big data” is now carrying the torch for the whiz bang potential of the next Silicon Valley-derived infotech revolution for enhancing “innovation” – whatever that might specifically mean. For universities, it is a readily available advert for a more technologically-enabled higher education, which also happily relieves budgetary pressures to expand the physical holdings of campus libraries and other facilities.

“Big data” also has mass appeal: leveraging big medical data promises to help fix our broken healthcare system by making it less expensive; it has been presented as the newest super tool to combat global poverty; it also helps to power the imagination of urban planners hoping to incentivize new creative economies; for the security community, it beckons by offering “crystal ball”-like certainties of greater information dominance and more precise prediction; and in the spirit of C. P. Snow, it confers legitimacy on the so-called digital humanities in a cost-conscious era, as an apparent collaborative bridge for the “hard” scientists to bring more rigor to their colleagues in the humanities and “soft” (or social) sciences. Among other frontiers.

The “big data” train has left the station, with all the concomitant hyperbole and hoopla that so often appears to accompany promising new developments in science heralding paradigm shifts in research. However, from my perspective missing from the enthusiastic rush to adoption is a critically grounded accountability regarding what big data advocates are claiming as opposed to actually doing: attention not only to the benefits but also the costs, to its potential but also its limits. Unrelenting techno-futurist optimism does not nurture this.

Trained as a sociocultural anthropologist, I have been most interested in how “big data” has intersected with efforts to better leverage sociocultural information to different ends. Most notably, this includes the Google-powered development of the new “field” of culturomics, elliptically defined by some of its founding practitioners as “the application of high-throughput data collection and analysis to the study of human culture.”

This sounds promising, if not altogether clear. The novelty of culturomics is its potential “to investigate cultural trends quantitatively” by generating previously hidden “suitable data” from hitherto unavailable massive databases. Despite this potential, breathless claims about the unprecedented access offered by culturomics to our own cultural history or for the Isaac Asimov-style prediction of future cultural events have derailed more grounded attention to what the “culture” of culturomics actually corresponds to and what kind of knowledge it provides. More on culturomics presently.

Critically Engaging Data

In this era of teraflops, terabytes, and cloud computing, big data represents the future. But the field has so far also displayed a notable lack of interest in addressing what the term fundamentally references, what it’s relationships might be to other sorts of disciplinary and scientific pursuits, what these related developments might helpfully enable but – perhaps more importantly and most neglected – what “big data” either obscures or cannot meaningfully address.

The biggest problem with our conversation so far about the potential of “big data” efforts is that we are spending too much time enamored of the “big” – the prospect of the unprecedented and vast volume and scale of the collection, organization, and processing of mostly digital information, primarily through new data mining applications that rapidly amass unique digital data sets – and virtually no time thinking about what the “data” part might consist of – what the data essentially are. Often exhibiting a naïve digital positivism vis-à-vis “data,” in many ways the turn to “big data” is more like a return to the past. But we need to be much more scrupulous about what we mean by “data” here. What, in short, are the data of “big data” and what, basically, is their value?

What we mean by “data” for emerging “big data” fields like culturomics is an important question for a number of reasons. Big data projects are notably cross- or interdisciplinary. For example, the affiliated researchers at Harvard’s Cultural Observatory, where culturomics has been pioneered, include: several computer scientists and Google software engineers, mathematicians, evolutionary biologists, and one doctoral student in history.

Absent from the team is balance on the cultural end, or a range of disciplinary expertise likely to sustain fruitfully interdisciplinary back-and-forth, say, that might usefully problematize specific, perhaps directly competing, frameworks, perspectives, and characteristic forms of producing and evaluating knowledge, across different communities of computational and cultural research. Understandably, most computer scientists are at best only passingly aware of the characteristic methods and relationships to data among colleagues from the social sciences or humanities.

Its apparent “interdisciplinarity” is a big part of the enthusiasm the turn to “big data” has generated. Big data projects using computational techniques often involve carrying over methods from one disciplinary environment (e. g. the computer sciences) and applying them to often long-standing problems in other disciplines such as economics, hydrology, or in the applied humanities. Sometimes this is a good fit. But sometimes it is not. And, it is often hard to tell, since big data researchers often treat data questions as straightforward, with data presented as unproblematically readily available to collect and to manipulate.

However, when a computer scientist develops a new data mining tool to systematically harvest often vast quantities of online digital information, s/he is not simply collecting data. S/he is also carrying over specific assumptions about what “data” is, how it is identified and recognized, where it sits in a larger context or field of endeavor, how it is determined by an encompassing information ecology of concern to computer scientists, how it can be made legibly available for analysis, and what sorts of conclusions can be derived from it. We might say that this data carries a particular signature identifying it with its disciplinary source — a signature with technical, methodological, and meaningful consequences.

When asked about this, the Harvard team’s response was, “It’s irrelevant. What matters is the quality of the data…” But “data” is not all of a piece, varying simply in quality and quantity. Particular disciplines understand their knowledge production and their relationship to data in often starkly different – or even incompatible — ways. And culturomics relies upon a conception of data that makes particular sense for computer scientists but is not necessarily consistent with the ways different social sciences deal with the cultural data with which they work.

Different disciplines have historically specific relationships to data, and which significantly express that discipline’s unique development and characteristic pursuit of problems. And “data” are not self-evident, universally fungible, straightforwardly equivalent or comparable across these pursuits, say, in the same way as we might think of the circulation of currency in the global economy. But this is exactly how the NSF is talking about the “big data revolution.”

The data of “big data” are in fact a particular kind of data: largely digital in nature. And this has definite consequences. Early adopters of the techniques of culturomics are so far spending little time with the implications of this, instead opting to promote the seemingly limitless potential of such techniques. In part, the reason is because for them questions about data are more often than not technical problems to be solved (e. g. about building the platform architecture, writing computer codes and algorithms, or compatibility with one or another digital database) instead of more fundamental questions about the identity of “data,” the sources of knowledge, and – for culturomics – the relationship of culture to meaning.

Simply “plugging in” data collected and understood for use by one community of practitioners might, from another’s point of view, simply add up to: “garbage in, garbage out.” This problem can quickly lead to fundamental misunderstandings about what is being done with such work and about the potential it offers for better understandings of cultural questions.

Culturomics and Data

As the “big data” trend gains momentum, the concerns that have been raised have primarily revolved around two issues: privacy and transparency. On the one hand, primarily in the U.S. legal debates have focused on the potential negative implications of the increased vulnerability of personal information as a result of the tremendous improvements in online data mining and technological surveillance. On the other hand, researchers have pointed to the lack of public availability of these massive data sets, often because they are corporately owned, which makes restudies or assessments of results based on these data almost impossible.

These are legitimate and important concerns, deserving attention. But, in themselves, they do not add up to a nearly robust enough discussion of these data. Culturomics is not the only “big data” front to apply comparable techniques to trying to make sense of sociocultural knowledge. We can also point to the rapid growth of attention to computational sociocultural modeling and simulation on the part of the security sector, which uses similar techniques. Given this incredible enthusiasm, much more critical scrutiny of these tools is required so that users can better determine their appropriate niche.

For the universe of culturomics, if we were briefly to characterize its “data” – to identify its particular disciplinary signature – we might point to a variety of factors. First, culturomics pursues a quantitative content analysis but on a colossal scale, using automated forms of collection derived from algorithms – computer code – designed to look for, and to sort through, particular properties of information already identified as a relevant data set, like Google Books, financial market indicators, twitter feeds, or country surveys. Its goal, in other words, is to record the frequencies or associations of key words and phrases over time and across these already structured sets.

A “culturome” (yes, arrived at via analogy to the “genome”) has, therefore, been described as “the mass of structured data that characterizes a culture.” Like a “gene” or a “meme,” it seems to be largely taken for granted that the data of culturomics are standard, and comparable, bits of information. This claim is controversial for a contemporary sociocultural anthropology engaged with a diversity of forms of cultural expression, and for which cultural meanings are not generated in just one way.

Digitally, the data of culturomics largely are standard bits of information: they are frequency counts of 0’s and 1’s, that is, variables processed according to particular search and classification criteria that are themselves written into the search algorithm of the data mining phase of work. And yet, in the results stage, these variables are re-presented as “data,” but with an empirical and even positivist sensibility. They are presented as if preexistent “stuff” out there in the world waiting to be extracted, processed, and explained. This is a sleight-of-hand. They are in fact “variables.”

For the case of culturomics we might point to a close, even closed, relationship between a specific data mining and processing tool and the data it generates. Any work with Google Books, including Google’s N-gram viewer – created to allow researchers to generate frequency counts and distribution curves of words or phrases from the Google Books archive – of course ignores non-written, non-published words, and all non-linguistic expressions of culture. It is also limited to those books which have been scanned and digitized (approximately 4% of all published books), and works only where a book has been digitized with adequately extractable metadata tags (e. g. indicating publishing date, author, genre, etc.). Too, the Google Books project has been limited by other prevailing factors, such as legal limitations upon public dissemination presented by intellectual property restrictions.

Why, then, would we even suppose that any results from a culturomics study using Google Books could “roughly represent the larger culture that produced it”? Or, more ridiculously, why are we hearing talk about the promise of culturomics to help identify “power laws for culture”? Books are particular kinds of cultural artifacts not simply ciphers for them. But experts seem willing to suspend disbelief. Part of this suspension includes a lack of attention to the ways that culturomics data are notably prefigured – even determined – by the technical choices made, the platforms used, the algorithmic codes written to mine the data, as well as the digital availability and legibility of the already-formatted data in the first place.

Another way to say this is that, even as researchers treat culturomics data as interchangeable, we might suggest that the data of culturomics more accurately express the world view of culturomics. Culturomics researchers have acknowledged that their work is not intended to replace existing varieties of cultural analysis. But they refer only to the “close reading of texts,” presumably the activity of historians, literary critics, some semioticians or cultural studies scholars. This is a kind of interpretive work also conversant with the largely digital textual landscape with which culturomics is concerned, but in no way exhaustive of other cultural research methods and kinds of interpretive attention. Minimally, we need more regular reminders of the partiality of such projects.

Culturomics: Market Trend

One of the techniques culturomics researchers are using is “tone analysis” or “tone mining.” The object is to establish whether a particular word, phrase, or text possesses a positive, negative, or neutral tone. Terms like tone, mood, style, or texture have long been mainstays of the lexicon of literary criticism, in particular for the “new critics” inspired by the work of I. A. Richards. Tone has also come to inform other interpretive approaches, including contemporary attention to “voice.” Often associated with the work of Michael Bakhtin, such work is distinguished by attention to the dialogic interactions between a speaker in a text and multiple other points of view, for which any particular utterance is always multi-voiced. In other words, tone has been a doorway for appreciating the ways that texts are variously embedded in and animate different social and cultural contexts.

But culturomics treats tone as a “metric,” which can be turned into computable numeric data. A recent project funded in part by NSF’s Extreme Science and Engineering Discovery Environment program used a database from the Open Source Center and Summary of World Broadcasts of approximately 100 million news articles between 1979 and 2011 to measure shifts in the “global news tone,” which retroactively appears to forecast the recent Arab Spring. Such forecasting tricks are impressive.

But it is exactly at this juncture that much more scrutiny of what is involved in “tone mining” (also called “sentiment” or “opinion mining”) is needed, if we hope to come to terms with what such forecasting or trend data in fact mean in cultural terms. Here it is important to understand where this computational attention to tone comes from – what the genealogy of this kind of data is.

Amazon, among others, pioneered the proliferation of digital apps which transmit an increasing variety and volume of consumer preference data back to retailers. And for several years now many Fortune 500 companies have utilized tone mining to monitor news coverage and social media activity associated with their products. These companies, of course, have an interest in learning as much as possible about what consumers are saying about their products and in identifying new demographics. Most often they would like to be able to map or to anticipate consumer responses to particular products.

The work of data mining for tone, sentiment or opinion – incorporated into so-called culturomics 2.0 – basically works like this: 1. First, identify precompiled dictionaries of “positive” and “negative” words against which other digital texts can be compared and scored; 2. Develop an algorithm as the basis for an automated computational method for mining tone data; 3. Record frequencies of these properties across so-called “opinionated texts,” as comparable items that compose an already “structured” online database or archive; 4. Assign a “value” to each so that it can used as a variable to plot trend data; 5. For culturomics, take a leap of faith by treating these plots as meaningfully indicators of cultural trends of one sort or another, often spanning decades or centuries.

However, in the enthusiasm for culturomics we have been too quick to shake off the origins or history of these data. They are certainly not “raw data” of some sort. They are, instead, specific artifacts of digital business practice. Attention to “tone” or “sentiment” – as data – works well if you are invested in trying to figure out peoples’ preferences. But its meaningful or representative relationship to culture, or as any sort of expression of culture, requires much more unpacking and qualification than we are getting so far.

In interdisciplinary terms, this kind of quantitative knowledge about culture (read: products) might not be usefully complementary to other forms of cultural research, data, or analysis. It might simply be an entirely different sort of information, for which use of the word – “culture” or field “culturomics” – is in fact misleading and unconstructive.

I have emphasized briefly some of the ways that tone mining generates not “data” but a very particular kind of data significantly prefigured by the technological architecture of the tools used, organization of existing digital databases, and computer code supporting such tools. These are preconditions that queer the game, as it were, as doorways encouraging certain kinds of attention to information while rendering other kinds illegible or marginal. In their very form, we might say, culturomics data already answer the possible questions to ask.

But there’s more. Culturomics relies on an alarmingly consumerist, or neoliberal, theory of meaning, for which tone or sentiment is the product of choices by cultural agents (originally, consumers), only insofar as they take the form: pro/con, either/or, positive/negative, or similar variant. This makes perfect sense if you want to know what people think of a toaster or if you want to record distributions of “thumbs up” among Facebook or Twitter users – after all, the impetus for collecting such information in the first place.

Contesting Culture, Data, Meaning

The “culture” of culturomics expresses the organization of available, countable, compilable information, which can be systematically extracted from digitizable texts like books, newspapers, maps, and twitter feeds. In this way culturomics is itself an often very creative exercise in selective choice-making. But it is not in any way describing the shapes of previously undescribable macro-cultural landscapes.

Whatever “culture” is, to proceed as if it can be assembled from discrete and comparable units derived from algorithmically-assigned “values” of machine-processed digital information is to emphasize very particular structured properties available for a technically and commercially specific prior purpose. And it equates culture with consumer choice. But to reduce the meaning of cultural trends to the prodigious mass of opinion data generated online by consumers is to grossly reduce what “culture” is to a narrow market calculus. We are better off leaving the question of the sources for cultural meaning open-ended.

Despite frequent assertions to the contrary, “more – and better – data” does not automatically lead to “more robust results.” We need to temper our techno-futurist optimism with basic questions: What is meant by cultural data in the first place? What is significant about frequency counts of cultural “stuff”? How do we attribute meaning to cultural data? And what is their relationship to real-world referents? Among other relevant questions. Such a constructively skeptical approach should inform “big data”-type projects of all sorts.

Some early critiques of culturomics have complained that it cannot address the humanist “search for meaning.” But I have suggested that, with their focus on the interpretation of texts, such concerns are still located well within the culturomics world view. They represent a latter day revival of C. P. Snow’s “two cultures” debate about science and the humanities, which sets up a goal of interdisciplinarity that assumes a pride of place for the technologically-enabled “sciences” (specifically, computer science) to make sense of the world.

Developments like culturomics have intriguing potential. But the claims associated with them – in this case about “culture” – can obfuscate and confuse. Sociocultural anthropologists also aspire to make sense of cultures. They typically do this ethnographically, and where cultural meanings are not simply latent and extractable, but instead emergently negotiated with counterparts (people we encounter “in the field” who we used to call “informants”). The data are usually multivocal, polysemic and perspectival, and not simply reducible to a pro/con or either/or-type choice.

The often serendipitous open-endedness of ethnography also contrasts with the technological and other prefigurements of the method of culturomics. More proximate to different specific contexts of meaning-making, ethnography is likely better located to apprehend emergent ground truths, other cultural points of view, and the diverse ways difference travels through the world. It is not clear at all that culturomics is even compatible with, let alone complementary to, ethnographic apprehensions of culture. And this raises serious questions about the celebratory interdisciplinarity with which big data projects continue to be met.