NoSQL Databases: What Geospatial Users Need to Know

By
Adena Schutzberg

NoSQL databases are bubbling up into mainstream technology companies, especially location-based services applications. But they've not been discussed much in the geospatial media. After several dozen active, engaged GIS practitioners (including my graduate students at Penn State) confirmed they'd not heard of the term "NoSQL," I thought it was time for an introduction.

What is NoSQL?
As Paul Ramsey eloquently details at his blog, NoSQL databases are not "anti" SQL. Instead, they are perhaps more correctly described as "not SQL." That is, they do things differently than the structured query language, SQL/relational databases many readers use, and have used for some time.

What is different? Wikipedia states: NoSQL databases "may not require fixed table schemas, and usually avoid join operations and typically scale horizontally." That means some do not have strict structures (schemas), some avoid combining records from multiple tables based on comment values in one field (something we do a lot in GIS!), and some scale simply by adding more hardware. A key point here is that different NoSQL databases have different strengths based on their architecture. There are no universal properties that every NoSQL database will have.

There's an academic school of thought that sees SQL/relational databases as a subset of "structured storage," its name for NoSQL offerings. One other point: NoSQL databases are often categorized by "data store type" such as "document store," "key/value store," "tabular," etc. I will not be addressing those distinctions here, but GigaOm has a nice table as a starting point to understand the differences.

Why would you want a database like that?
In short, because SQL/relational databases are not that good at some tasks like "indexing a large number of documents, serving pages on high-traffic websites, and delivering streaming media" (Wikipedia).

NoSQL solutions are particularly good at dealing with lots and lots of reading/writing tasks coming in at once, something that tends to slow down SQL/relational databases. I think naming just three companies that use NoSQL solutions helps define "lots and lots of reading and writing tasks": Facebook, eBay, Google.

Joe Stump of SimpleGeo concedes it's possible to use SQL/relational databases to do what NoSQL databases can. For him, the choice to use NoSQL is about money.

I guess what I'm saying is that my decision to use NoSQL, and I'm guessing others' decisions to do so, has less to do with the fact that we can't squeeze a few thousand writes a second out of MySQL and more to do with management and cost overhead. NoSQL solutions allow us to serve absurd amounts of data for a really, really low price.

What about geospatial data?
Some NoSQL solutions include support for geospatial data either natively or with an extension. Others are not designed for geospatial applications but have been implemented to support geospatial data. Some NoSQLdatabases currently being used to manage geospatial data include: MongoDB (open source), BigTable (developed by Google, proprietary, used in Google Earth), Cassandra (developed by Facebook, now open source and maintained by Apache), CouchDB (open source, Apache).

How do I know if I should use NoSQL?
This is not a question for marketing or management to make without consulting a sophisticated development team. I defer to the experts who say, "It depends!"

Paul Ramsey suggests NoSQL databases are valuable in "the high-volume, high-availability use case which has emerged in the age of consumer web services." But he concedes, "There's no free lunch though. In exchange for the high-throughput/high-availability you lose the expressiveness [richness] and power of SQL."

Jurg van Vliet offers this observation and advice when thinking about a large point of interest (POI) store:

The traditional solution to storing information is a relational database. But relational databases were designed for a different era, for a different infrastructural paradigm. Scaling Oracle on a mainframe is fine, you can get great performance like this. But building something big with commodity infrastructure components is much more difficult. This is why NoSQL is starting to get so much attention lately. NoSQL makes concessions to the relational part of the information so that it can scale by distributing information around.

So, unless you are willing to buy some mainframes for your POI store, you will have to jump the RDBMs bandwagon. NoSQL, however, is not a one-size-fits-all container of datastores. There are the web service approaches like Amazon SimpleDB. Great for storing many key/value pairs, good for retrieving simple selections, but not so good for complex searches. On top of that you have the latency because it is a web service.

One other point about NoSQL use, per Paul Ramsey, is that it's likely many developers will use NoSQL for spatial functions without really using a database. Instead, they will use a local or hosted service. He offers these examples: storing data in Google Fusion tables, using the SimpleGeo API, and using the spatial types in the Google App Engine. "These are all instances of using them [NoSQL databases] without using them. You see the limitations immediately in the kinds of queries you are able to do (not so many) but you reap the benefit of pushing the infrastructure responsibility off to someone else."

What's the future?
Database guru Michael Stonebraker, as paraphrased by Paul Ramsey, said in 2005(!), we should expect "to see increasing fragmentation of database technology based on use case." So, expect more options in the database realm, not fewer. And, I'll add, expect more options (built in and extensions) to store and manage geospatial data in those new offerings.

The author thanks Paul Ramsey and students of Penn State's Geog897g for helpful input to this article.