Google's BigQuery Brings GIS Into The Petascale Era

Perhaps the greatest challenge confronting today’s “big data” era lies not the acquisition of data, but rather how companies can uncover insights from the chaos of petabytes. Today’s companies have assembled breathtakingly large archives of incredibly rich data but struggle to perform more than basic analyses at scale. This is especially true when it comes to reasoning in terms of the more complex dimensions like time and space. As the world’s data is increasingly geospatially enriched and as companies increasingly need to understand their data in terms of its physical location, we need tools that can query, analyze and reason about immense geospatial data at population scale. Enter Google’s new BigQuery GIS.

Companies today manage an almost unthinkable amount of data from which they must extract meaning in order to run their businesses. At Google’s Cloud Next conference this past July, one of the underlying themes was the sheer size of the datasets companies must make sense of in 2018. Session after session mentioned petabyte-sized warehouses, from large companies managing tens and even hundreds of petabytes to small startups now grappling with hundreds of terabytes to petabytes. Twitter announced it had moved more than 300 petabytes over to Google Cloud. At the same time, the hours, days or even weeks-long batch computing analyses of the past no longer suffice in a world in which business decisions must be made in realtime. In short, the datasets of today require a fundamental rethinking in how we use them, harnessing computing and storage capacity that only the cloud can offer.

At the same time, the ways in which we can access information computationally is rapidly changing. Deep learning and data processing advances offer tools that can recognize images, speech, video and text, translate from one language to another with astonishing accuracy and identify correlative and temporal patterns at almost limitless scale. Meanwhile, the lenses through which we see our data are rapidly expanding, from the basic statistical summaries of yesteryear to new complex understandings incorporating latent undercurrents like emotional tenor and narrative framing. Even the way in which we organize and visualize information is changing, with geography in particular playing an ever more central role in understanding our large datasets.

As I noted in 2012, “The concepts of space and time are perhaps the most fundamental organizing dimensions of large archives, forming the root structure around which all other categories are situated. Space, in particular, is an integral part of human communication: every piece of information is created in a location, intended for an audience in the same or other locations, and may discuss yet other locations. Daily communication about the world revolves around space: the global news media, for example, mentions a location every 200-300 words, more than any other information type. Even access to information is mediated heavily by space, with over a quarter of web searches containing geographic terms and 13% of all web searches being primarily geographic in nature.”

I myself have spent more than a decade of my career exploring the geography of unexpected mediums, especially text, from news to Twitter to Wikipedia to academic literature to television to books to images. Along the way I’ve generated immense geographic datasets spanning many billions of points, with the greatest challenge being how to analyze and map them all.

Indeed, the single greatest challenge I hear from those working with my any of my open GDELT Project datasets is how to tractably work with a combined 3.2 trillion datapoints.

Mapping large textual datasets requires first processing the documents with textual geocoding algorithms that identify mentions of location and use context and world knowledge to disambiguate a mention of “Paris, France” from “Paris, Illinois” and put each on a map in the form of a centroid latitude and longitude coordinate. The result is a geographic annotation in the form of an array of identified locations and the position each was found in the document.

At first glance it might seem simple to take a pile of coordinates and place them onto a map. After all, there are countless JavaScript libraries that can do this dynamically in a browser running on a smartphone. The problem comes in when you have many billions of points across hundreds of millions or even billions of records that need to be collapsed into a single map or combined with other data to filter and aggregate.

Historically in my own work creating large maps typically involved manually running large batch jobs that might wait for hours in a queue and then take hours to actually work their way through the data in the heavily IO-limited world of academic computing. One set of scripts were used to filter the raw data, then a succession of additional scripts were used to aggregate, shape and merge the data into the final resulting file that could be mapped. The considerable lag between initial idea and final map meant that the kind of “what if” exploratory cartography that is at the heart of discovery was simply beyond reach.

This all changed when I first began to use Google’s BigQuery platform in which a single line of SQL could harness thousands of processors to interactively transform billions of points into a final map in the span of tens of seconds. Suddenly mapping the geography of two centuries of books was as simple as a single line of SQL.

Mapping the top five news outlets for every city on earth was now just a query away, processing 6.2 billion geographic mentions across 756 million news articles in 65 languages spanning three years and 343GB of coordinates. The analysis took just 33 seconds from button click to final map. Using the same data to map the top locations covered by each news outlet took just 15 seconds.

Yet, the power of BigQuery really shines when it comes to more complex analyses that require merging multiple dimensions together. In 2015, creating thematic maps using GDELT still required a final step to merge the location data with the topical structure of each article. For each article, the geocoder would output a list of all of the locations in a news article and their word offsets in the document, while a separate thematic coder would identify all of the themes in the document and their word offsets. Mapping a particular topic required taking the list of themes in an article and merging it with the list of locations to find the location mentioned most closely in the text to each narrative mention. While a primitive approach, textual proximity is a useful indicator of semantic relatedness at scale that is able to overcome the grammatical complexities of working across 65 languages across the world’s presses.

The problem is that merging the full list of locations and topics for each article and identifying the closest topic to each location mention requires brute forcing through every possible permutation and quickly becomes computationally intractable at larger scales.

Creating my map of global wildlife crime in 2015 represented a fundamental leap forward, leveraging BigQuery for much of the analysis, but still requiring a computationally expensive post processing step. In November of that year BigQuery’s Jordan Tigani showed how BigQuery’s then-nascent support of User Defined Functions could be used to write a JavaScript function that performed the topic-geography blending entirely inside of BigQuery, making it possible to quite literally create “one click” maps that could go from initial idea to finished map in just 60 seconds. Finally, it was possible to go from “I wonder” to a finished answer in just under a minute.

In turn, this opened up the possibility of conducting terascale and even petascale cartography in BigQuery, transforming breathtakingly massive spatial datasets into beautiful and informative maps that offered profoundly new insights into the planet we call home.

The year after creating that global poaching map I was able to use these advances to explore the question of what it would look like to literally “map global happiness” through the eyes of the news media, transforming a quarter billion articles, 1.4 million photographs, 89 million events, 1.48 billion location mentions and 860 billion emotional assessments into a series of maps charting the state of a world in motion. The following year it took just two SQL queries, one block of CSS and 30 seconds to transform 2.2 billion location mentions into a map of global happiness in 2016.

The power of BigQuery to brute force its way through enormous datasets in near realtime opens up the possibility of asking even more fundamental questions about human nature and the underlying geographic patterns of language. This past May I took a year’s worth of global news coverage and asked what would it look like to create a map for each word in the English language. In other words, if one took all the world’s news coverage for a year, translated it all into English and made a map of all of the places on earth mentioned together with the word “love,” what would we see?

Once again, taking 1.5 billion mentions of 740,000 distinct locations on earth and their mentions across 126 billion words of news coverage totaling more than a terabyte of text and transforming it all into a final geographic histogram of the most common locations associated with each word took just one line of SQL. Despite processing hundreds of billions of intermediate rows, it took BigQuery just five minutes to create the final geographic dataset.

Underlying all of these analyses was BigQuery’s ability to take a massive database of nearly a billion news articles stored as deeply nested delimited CSV files and subset, parse and process them into maps in realtime. Even though BigQuery did not itself natively support geospatial analyses when I created these maps, I was able to use its various building blocks to create an incredibly powerful custom-built terascale cartographic system right out of the box to explore the geography of the global news media.

Since its public debut in 2010, BigQuery has grown into one of the centerpieces of Google’s cloud platform. Today you can brute force examine an entire petabyte in just 3.3 minutes, down from 3.7 minutes just a year ago. Customers have performed single queries that have analyzed 5.5 petabytes and 29 trillion rows at once. This is the world of “big data” as it exists in the cloud.

Over the years BigQuery has expanded from its roots as a massively scalable search and aggregation system into a turnkey analytics platform in its own right, adding a wealth of new capabilities for turning data into insights. Among these new capabilities, Google announced earlier this summer the public debut of BigQuery Geographic Information Systems (GIS), a rapidly growing suite of query operators and features for BigQuery that bring it rich geographic capabilities.

In place of the complex and cumbersome regular expressions and deeply nested string operators I had to use to create my maps over the years, which translated my delimited CSV files into coordinates and performed basic operations on them, BigQuery’s new GIS features now allow it to see geographic information as a first-class data citizen, operating on it natively. More to the point, while my explorations were limited exclusively to aggregations and non-spatial analyses, BigQuery can now perform true spatial operations at BigQuery scale.

Over time it is not hard to imagine that BigQuery GIS will likely evolve into a sort of “petascale PostGIS” environment for performing spatial queries and operations on petabytes of data or tens of trillions of objects. Yet, perhaps the greatest impact will come as BigQuery leverages its raw power to move beyond the traditional limitations of geographic databases. Imagine harnessing thousands or tens of thousands of cores to run spatial clustering, spatial regression, KDEs and the myriad other geographic analytic functions that today cannot begin to scale to the sizes of data that companies need to explore through a cartographic lens. With BigQuery's recent addition of BigQuery ML, it is not hard to imagine that all sorts of similar geographic capabilities must not be far behind.

Indeed, performing geographic analysis at extreme scale, from simple spatial unions through massive analytic modeling, is an area where BigQuery has immense potential to upend the limitations of spatial analysis and bring the dimension of space into the cloud era. BigQuery’s unique ability to couple petascale query infrastructure with thousands or even tens of thousands of cores and apply it on demand is something we’ve not really been able to contemplate before, especially the ways in which this kind of computational scale may allow us to fundamentally rethink both the scale of how we incorporate space into our analyses and the kinds of questions that are now within our reach.

Putting this all together, over the last several years I’ve leveraged Google BigQuery’s raw power to construct my own terascale cartographic system for exploring the geography of the global news media, merging terabytes of text with trillions of annotations and billions of geographic coordinates to create maps that peer into the soul of global society and explore what makes us human. BigQuery’s new GIS initiative will finally allow researchers like myself to move from mere spatial aggregations to true spatial analysis at BigQuery scale, finally bringing the GIS world into the petascale era.

Based in Washington, DC, I founded my first internet startup the year after the Mosaic web browser debuted, while still in eighth grade, and have spent the last 20 years working to reimagine how we use data to understand the world around us at scales and in ways never befor...