Big Spatial Data – Your day has come

Despite all the buzz that surrounds the term “big data,” we hear surprisingly little about “big spatial data.” Of the top ten results from an Incognito Google search for “Big Spatial Data,” only two results are from 2015 and only one – “GIS Tools for Hadoop by ESRI” – attempts to present a solution. Despite apparently being the only tool available, there were only 25 commits on GitHub in 2015. There are only 43 results in a Google Scholar search for 2015. From a demand perspective, Google Trends shows us that Big Spatial Data is not what people are looking for, either.

The Value of Big Spatial Data

Are we to conclude that “Big Spatial Data” is an irrelevant topic? Is it perhaps no different than other big data? I would argue no.

Spatial data and spatio-temporal data are tremendously valuable. They can help ships, planes, and UAVs move more safely. They help organizations operate more efficiently and companies market more effectively, and they make mobile applications more functional. In short, spatio-temporal data have the potential to make our lives better, if we can build the analytics to harness them. Before we can build these analytics however, we need a platform to store and access Big Spatial Data. Everything from mobile apps to sophisticated analytics need fast, efficient access to select portions of data for computation or visualization.

After we’ve spent the last two decades digitizing almost everything, we have mounds of data stored in a multitude of storage systems. Simply figuring out how to handle all of this data is a tremendous challenge. If there is any question, refer back to the first chart to see how much interest there is in “Big Data.” Perhaps tackling the complexities of Big Spatial Data, and in particular big spatio-temporal data, has been kicked down the road. The time to address these challenges is now.

Examples of Big Spatial Data

Mobile apps, cars, GPS, UAVs, ships, airplanes, and IoT devices are generating more spatial data by the day. Consider the following example: A single track from an aircraft reporting its location every second (as is typically done using Automatic Dependent Surveillance – Broadcast (ADS-B) systems) for a 2.5 hour flight generates 9,000 points and the National Air Traffic Controllers Association estimates there are 87,000 flights per day in the US. Roughly figured, that’s 285 billion points per year (all flights will be required to use ADS-B in 2020). Furthermore, that assumes that each point is collected only once, when the reality is that many sensors have overlapping coverage, likely making this volume much higher.

There are nearly a billion tweets every day and many of them are geolocated. Now think about Instagram, Facebook, Google Maps, Uber, and the dozens of other applications that capture your location using your mobile phone. All of these tools are generating huge streams of spatial data. Few would argue that these data streams lack value, but if we cannot store them in such a way that they can be accessed and manipulated quickly, what is the point of collecting them at all?

Challenges with Big Spatial Data

Storing and querying spatio-temporal data in general is different than most other types of data. To begin with, there are several different types of spatial data. Vector data are comprised of points, lines, and polygons. Raster data includes imagery data. Different applications require different kinds of queries. Some may only wish to extract data within a specific time and area. But what if our data is track data and the track crosses through my search window, but none of the reported points occur within that window? Should that track be included in the query results? Most spatial data have spatial correlation. Does that impact your modeling assumptions when building an analytic?

The challenges of spatial data have been enumerated many times before; and honestly, there are many excellent solutions for handling these data. However, these solutions all have their limitations. For example, PostGIS is one of my favorite tools for storing spatial data, but eventually it reaches a capacity limit. The point here is that spatio-temporal data need special treatment, and many of the solutions that have been offered to address big data do not sufficiently support big spatial data. A solution for spatio-temporal data in a big data environment is sorely needed.

The most logical starting point is to leverage a distributed database to create an analog to the spatial versions of relational databases – that is to say, an “HBaseGIS” or “AccumuloGIS”. The inherent structure of these systems, however, presents a new challenge. Most of these databases have a single index for a datum. Unlike relational databases where you can simply add indexes to as many columns as you like, NoSQL databases have a single primary index which has to support most of your queries. Naturally, spatio-temporal data are most frequently queried by a space-time window. Thus, to quickly query spatio-temporal data, we have to collapse three dimensions of data down into a single index.

Conclusion

It is safe to say that Big Spatial Data is an important topic for discussion, and it certainly warrants more attention than a few outdated links on a Google Search. To do my part, I will write a series of posts about Big Spatial Data, GeoMesa, Spatial Analytics, and open-source technology. I’ll conclude by asking: What are your Big Spatial Data challenges?

To keep up-to-date with the latest work going on at CCRi, follow us on Twitter at @ccr_inc.

4 Responses

Hi Jamie, look forward to your continued discussion on this topic. The Open Geospatial Consortium has several domain and standards groups working on Big Spatial-Temporal data challenges – http://www.opengeospatial.org/projects/groups/dggsswg. One of which is a new standard Earth reference formally called a Discrete Global Grid System. I would be pleased to share some aspects of the DGGS as a “digital” spatial reference designed specifically for aggregating, decomposing, and integrating big geospatial data. You can try out our DGGS WorldVIew Studio at https://worldview.gallery/.

Jamie Conklin

Perry – Thanks for your comment. I am planning a couple of posts in the future that will discuss GeoMesa’s Indexing strategy. The first will be an overview of GeoMesa and another will be dedicated to the index. Please stay tuned! All of the posts will be found at GeoMesa Posts

Carl Reed

As with any Google search, results can vary based on the keywords used. Much better results if one searches for:

geospatial big data

Lots of research, projects, etc.

I have done a number of presentations on geospatial big data and standards as have other OGC Members. There is also a book:

Big Data: Techniques and Technologies in Geoinformatics (CRC Press).

As to geo big data, as I told a US Gov CTO led discussion on big data, geo big data has been around for a loooong time. Early Landsat, seismic studies, NRO sources and so forth. This was news to all the non-geo folks in that discussion!

Also, Big Data is all relative. Big Data in 1980 is small data today 🙂

Regards

Carl

Jamie Conklin

Carl,
Thank you for you thoughtful comment. I tried the “geospatial big data” search vs “big spatial data” and as you said the results were more relevant. As you point out, Big Spatial Data, where Big is relative to the time, has been around for a long time. Given that, you might expect there to be more in the way of solutions for managing these data. Another metric, albeit somewhat less useful, is the number search results. “Geospatial Big Data” returns about 1.2M hits. “Big Data” returns 760M hits.

This post articulates CCRi’s motivation for building GeoMesa. I believe that there is a general capability gap when it comes to scalable spatial storage and analysis tools – and in particular open-source analysis tools.

It is my hope to raise awareness of the existence and value of spatial data. I also hope that this awareness will help lead to further investment in great open-source tools and communities.
Regards,
Jamie