Big data can only capture the past — without theory, they cannot predict into a changing future, says Martin Hilbert.

Recently, much has been written, talked, and done about the usefulness of big data for development. The UN Economic and Social Council recognises that “big data have the potential to produce more relevant and more timely statistics than traditional sources of official statistics, such as survey and administrative data sources”, while the OECD is convinced that “big data now represents a core economic asset that can create significant competitive advantage”. [1,2]

At the same time, obstacles and perils have been noted — mostly well-known challenges previously discussed in the context of the digital divide, including shortages in skills and infrastructure, and privacy concerns.

But there is one ultimate, theoretical limitation on what big data can do and what it cannot do, and it is particularly relevant for development work. This is a limitation inherent to big data, and should make advocates alert and cautious when working with and trusting in it.

Data from the past

The gist behind this limitation is known as the ‘Lucas critique’ in economics, as ‘Goodhart’s law’ in finance and as ‘Campbell’s law’ in education. All date back to 1976, when US economist Robert Lucas criticised colleagues who used sophisticated statistics to make economic predictions (econometrics). He argued that no useful information can emerge from such predictions because any policy change will also change the econometric models. [3]

The reasoning is that all kinds of data, including econometric or ‘big’, are from the past — or, at best, the ‘real-time present’. So any analysis that uses them can only tell us about what has already happened. Where the past, present and future follow the same logic, this is useful. However, if significant changes occur in the dynamic of the system being described, empirical statistics are, at best, limited.

Development work aims to create such changes. Its explicit goal is to create a future that significantly differs from the past. Given the complexity and uniqueness of each social, economic, cultural and natural system that is subject to development interventions, the result of such interventions is almost always novel and unique for each case. It is, in essence, a reality that has never been — different from the past and different from other cases. So what could the past possibly tell us about the future in this case?

Predicting a changing future

To predict a future that has never been, theory-driven models are necessary. These allow variables to be adjusted with values — based on theory — that have never existed in statistically observable reality. New variables can even be included, such as those introduced by a development intervention.

This is especially important in social systems. Complex social dynamics are notoriously stacked with non-linear properties that defy most methods of statistical extrapolation, which are linear. A developed Africa will not simply be an extrapolated version of Europe’s past development trajectory.

Think about big data this way. Facebook, Google and Amazon can predict your future behaviour better than any psychologist — but only if your future behaviour follows the same logic as your past behaviour. No theory is required. Big data is sufficient. This is often referred to as the ‘end of theory’ due to big data. [4]

But if you fall in love, or change your job, or change the country where you live, predictions from past data will be limited, if not deceiving. In that case, a psychologist or an economist who has a theory-driven model of you will still be able to make predictions, by changing the model’s variables according to the changed environment.

For example, if you worked as a bar tender in Brazil last year and then became a data analyst in Germany, ‘big data’ on your last year’s behaviour will be limited in predicting your future behaviour, while a more comprehensive model of your preferences might still be able to give insights about your shifting interests from caipirinhas to Hadoop analytics.

The same counts for development. For example, when Google’s search habits changed, the ability of Google flu trends to predict epidemics became limited. [5] The model can, of course, be constantly adjusted (after the fact) — but without some theory, one cannot predict into a changing future.

Millions of variables

Thankfully, the digital revolution is not limited to producing big data; it also helps with modelling. While in the past, theory-driven models had only a few variables, today’s computational power allows for thousands or even millions of variables.

Computer simulations (such as agent-based models) do not have any conceptual limitations regarding the achievable level of detail and precision. The behaviour of individuals and organisations can be adjusted to — and in response to — an ever-changing reality.

Beyond accuracy, the biggest advantage of computer simulation models for development is their modular flexibility. Each line of their code defines some kind of behaviour or characteristic. Adding them together and letting things interact recreates a social complexity often similar to the one we see in reality. Reusing the code allows us to create tailor-made models for concrete problems in specific, local- and context-dependent settings.

Similarly to the creation of different versions of the city-building video game SimCity, the computer simulation of a unique community in Africa can use existing software modules while evaluating a context-dependent future that is different from the past. The image of SimCity highlights an additional benefit: that multimedia visualisation can be used to engage and convince policy makers without sophisticated statistical or economic training.

This is the ultimate goal of computational social science: providing understandable and scalable solutions to embrace both the complexity of development and the uniqueness of its ever-changing paths.

Social science future

But does this sound like a scary future in which the behaviour of entire societies, communities and each of its members is replicated in real-time computer simulations that are constantly adjusted with the big data incessantly collected about each one of us? Scary or not, it is certainly the future of social science. Some branches of science, industry and governments are working on it at high speed.

For example, the UN Environment Programme has teamed up with Microsoft Research for the past three years to create a computer model that simulates all ecological life on Earth. [6] And the US city of Portland has simulated the daily behaviour of its 1.6 million residents over 180,000 locations in order to optimise the roll-out of light-rail infrastructure and to simulate epidemics. [7]

We should make these powerful tools work for development policy. While we have spent significant effort focusing on big data, we are far from having models that can be used in an ever-changing reality of ongoing challenges in health, education, economic growth, poverty or social cohesion. Much more effort has to be put into such theory-driven models.

If not, we run the risk of falling into the same traps that Lucas’s colleagues did in the 1970s, some four decades before the big data revolution.