Stop Hyping Big Data and Start Paying Attention to ‘Long Data’

Our species can’t seem to escape big data. We have more data inputs, storage, and computing resources than ever, so Homo sapiens naturally does what it has always done when given new tools: It goes even bigger, higher, and bolder.

We did it in buildings and now we’re doing it in data. Sure, big data is a powerful lens — some would even argue a liberating one — for looking at our world. Despite its limitations and requirements, crunching big numbers can help us learn a lot about ourselves.

But no matter how big that data is or what insights we glean from it, it is still just a snapshot: a moment in time. That’s why I think we need to stop getting stuck only on big data and start thinking about long data.

Samuel Arbesman

Samuel Arbesman is an applied mathematician and network scientist. He is currently a senior scholar at the Ewing Marion Kauffman Foundation and a fellow at the Institute for Quantitative Social Science at Harvard University. Arbesman is the author of The Half-Life of Facts and blogs for Wired Science on Social Dimension.

By “long” data, I mean datasets that have massive historical sweep — taking you from the dawn of civilization to the present day. The kinds of datasets you see in Michael Kremer’s “Population growth and technological change: one million BC to 1990,” which provides an economic model tied to the world’s population data for a million years; or in Tertius Chandler’s Four Thousand Years of Urban Growth, which contains an exhaustive dataset of city populations over millennia. These datasets can humble us and inspire wonder, but they also hold tremendous potential for learning about ourselves.

Because as beautiful as a snapshot is, how much richer is a moving picture, one that allows us to see how processes and interactions unfold over time?

We’re a species that evolves over ages — not just short hype cycles — so we can’t ignore datasets of long timescale. They offer us much more information than traditional datasets of big data that only span several years or even shorter time periods.

Why does the time dimension matter if we’re only interested in current or future phenomena? Because many of the things that affect us today and will affect us tomorrow have changed slowly over time: sometimes over the course of a single lifetime, and sometimes over generations or even eons.

Datasets of long timescales not only help us understand how the world is changing, but how we, as humans, are changing it — without this awareness, we fall victim to shifting baseline syndrome. This is the tendency to shift our “baseline,” or what is considered “normal” — blinding us to shifts that occur across generations (since the generation we are born into is taken to be the norm).

Shifting baselines have been cited, for example, as the reason why cod vanished off the coast of the Newfoundland: overfishing fishermen failed to see the slow, multi-generational loss of cod since the population decrease was too slow to notice in isolation. “It is blindness, stupidity, intergeneration data obliviousness,” Paul Kedrosky, writing for Edge, argued, further noting that our “data inadequacy … provides dangerous cover for missing important longer-term changes in the world around us.”

So we need to add long data to our big data toolkit. But don’t assume that long data is solely for analyzing “slow” changes. Fast changes should be seen through this lens, too — because long data provides context. Of course, big datasets provide some context too. We know for example if something is an aberration or is expected only after we understand the frequency distribution; doing that analysis well requires massive numbers of datapoints.

Big data puts slices of knowledge in context. But to really understand the big picture, we need to place a phenomenon in its longer, more historical context.

We’re a species that evolves over ages — not just short hype cycles.

Want to understand how the population of cities has changed? Use city population ranks over history along with some long datasets. Want to understand the costs of carbon-centric energy such as coal? Go much further back than data collected over previous decades. Want to see more clearly how knowledge is preserved? Use copies of a text created over a thousand years.

The general idea of long data is not really new. Fields such as geology and astronomy or evolutionary biology — where data spans millions of years — rely on long timescales to explain the world today. History itself is being given the long data treatment, with scientists attempting to use a quantitative framework to understand social processes through cliodynamics, as part of digital history. Examples range from understanding the lifespans of empires (does the U.S. as an “empire” have a time limit that policy makers should be aware of?) to mathematical equations of how religions spread (it’s not that different from how non-religious ideas spread today).

In a related intellectual approach, the Long Now Foundation focuses on long-term thinking, including projects like building a clock that can last 10,000 years. This involves taking into account everything from the nature of erosion to the 26,000-year cycle for the precession of equinoxes.

We are so focused on change, that projects like these force us to focus on the things that don’t change. Only then can we know what constants we can rely on for longer stretches of time — and what efforts to invest in if we care about our future.

If we’re going to move beyond long data as a mindset, however — and treat it as a serious application — we need to connect these intellectual approaches across fields. We need to connect professional and academic disciplines, ranging from data scientists and researchers to business leaders and policy makers.

We also need to build better tools. Just as big data scientists require skills and tools like Hadoop, long data scientists will need special skillsets. Statistics are essential, but so are subtle, even seemingly arbitrary pieces of knowledge such as how our calendar has changed over time. Depending on the dataset, one might need to know when different countries adopted the Gregorian calendar over the older Julian calendar. England for example adopted the Gregorian calendar nearly two hundred years after other parts of Europe did.

Long data shows us how our species has changed, revealing especially its youth and recency. Want data on the number of countries every half-century since the fall of the Roman Empire? That’s only about thirty data points. But insights from long data can also be brought to bear today — on everything from how markets change to how our current policies can affect the world over the really long term.

Big data may tell us what we need to know for hype cycles today. But long data can reach into our past … and help us lay a path to our future.