Don’t Believe the Hype, It’s a SQL – A Brief Guide to the NoSQL Landscape

SQL. It’s so last century, amirite?

Although it was created around ten years earlier, Structured Query Language (SQL) really began to catch hold in the commercial world during the mid-eighties, and brought with it some great improvements over what had come before.

Probably the most important feature was transactionality, which enabled concurrency and rollback capabilities. SQL also lent itself to ad-hoc business reporting, since queries could be easily written by sales and other departments to gain fresh insights into their company’s data without application programming.

As testament to its usefulness, SQL reigned supreme for over 20 years before it met any serious opposition, mostly owing to how deeply integrated it had become in major business applications.

But I’ve come here to bury Caesar, not to praise him.

Let’s talk NoSQL.

Google Trends says the term has been bandied about on the web since around 2009. But that doesn’t mean everyone was happy with SQL up until then. Google’s Ngram Viewer (which searches a large corpus of books published prior to the web’s existence) says that the phrase “SQL alternative” began appearing in books around 1984 — a full 2 years before it’s first standards release!

Interestingly the trendline for this term takes a nose-dive around the point when “NoSQL” began trending on the web. It turns out that was all due to Twitter.

Back in 2009, a fellow named Johan Oskarsson wanted to set up a meetup to discuss alternatives to SQL. To make sure all the right people showed up, he figured he needed a catchy hashtag. He contacted his pal Eric Evans, who came up with the term. They had no idea they’d be starting a movement!

What’s Wrong With SQL?

Generally speaking, the syntax isn’t that onerous. Complex SQL queries can get pretty ugly, but that’s why God made DBAs; to cope with this kind of stuff. Regardless how you feel about slinging SQL, there are other, more impactful reasons for enterprise to seek an alternative.

The two main issues are…

1. Data Isn’t Stored in the Same Shape as It Is Used

In short, the way we represent and reason about data in our user interfaces is considerably different from the way we store that data in a relational database. With modern UIs, we typically manipulate hierarchies of “objects”, often without a defined schema. But in a relational database, we deal with tables of data related to each other by keys. Schemas are not optional.

This disparity in data organization gives rise to object-relational mapping middleware such as Hibernate. Along with that comes extra management overhead, since changes to the data model require changes not only to the UI and DB, but also to the ORM layer that allows them to communicate. It can also add a certain amount of processing time to each request/response cycle.

2. Clustering is Difficult at Web Scale

Roughly speaking, clustering is an attempt to pull together data from all the rows of all the tables that share a particular database column, so that it will be handy when people ask for it. Until the web super-sized the data of very large enterprises, clustering mostly worked fine.

But, as their databases grew, Google and Amazon both reached a critical mass at which the insurmountable challenges of clustering led them to develop their own SQL alternatives — Bigtable and DynamoDB, respectively.

As they sharded and replicated the data to more and more servers around the world, failures or delays between network nodes caused problems with database availability and consistency.

ACID

BASE

Basically Available (data is never ‘locked’), Soft state (system can change over time even without input, because of…), Eventually consistent (once input stops).

Of the two, BASE definitely scores higher on the contrive-o-meter. And it doesn’t really engender a lot of excitement about what’s on offer. For real insight we need to compare SQL to NoSQL with something called CAP theorem.

CAP Theorem

Like many of the “Iron Triangles” described in various industries, CAP describes a set of three important attributes, of which you can only optimize for two. In this case it is Consistency, Availability, and Partition tolerance.

Unsurprisingly the CA side of the triangle is where SQL lives, because the P vertex is one of the big pressures that started the NoSQL movement to start with — namely that clustering at scale is a pain with relational databases.

Column family databases are tabular and the closest in data model to RDBMSs. It is really the only NoSQL solution that’s managed to stake out territory on the CA edge.

The others are split between AP and CP, based on the priorities chosen by their implementers.

Choosing a database is ultimately about which two vertices of the CAP triangle are most important for your application.

Resources

I hope this article frames the subject well enough get you on your way to exploring the wild and wooly world of NoSQL. The accompanying slide deck mentions several well-known implementations of each NoSQL database type, but is by no means exhaustive.