Blog

I came to databases a little late in life, as a physical scientist I didn’t have much call for them. Then a few years ago I discovered the wonders of relational databases and the power of SQL. The ScraperWiki platform strongly encourages you to save data to SQLite databases to integrate with its tools.

There is life beyond SQL databases much of it evolved in the last few years. I wanted to learn more and a plea on twitter quickly brought me a recommendation for Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson.

The book covers the key classes of database starting with relational databases in the form of PostgreSQL. It then goes on to look at six further databases in the so-called NoSQL family – all relatively new compared to venerable relational databases. The six other databases fall into several classes: Riak and Redis are key-value stores, CouchDB and MongoDB are document databases, HBase is a columnar database and Neo4J is a graph database.

Relational databases are characterised by storage schemas involving multiple interlinked tables containing rows and columns, this layout is designed to minimise the repetition of data and to provide maximum query-ability. Key-value stores only store a key and a value in the manner of a dictionary but the “value” may be of a complex type. A value can be returned very fast given a key – this is the core strength of the key-value stores. The document stores MongoDB and CouchDB store JSON “documents” rather than rows. These documents can store information in nested hierarchies which don’t necessarily need to all have the same structure this allows maximum flexibility in the type of data to be stored but at the cost of ease of query.

HBase fits into the Hadoop ecosystem, the language used to describe it looks superficially like that used to describe tables in a relational database but this is a bit misleading. HBase is designed to work with massive quantities of data but not necessarily give the full querying flexibility of SQL. Neo4J is designed to store graph data – collections of nodes and edges and comes with a query language particularly suited to querying (or walking) data so arranged. This seems very similar to triplestores and the SPARQL – used in semantic web technologies.

Relational databases are designed to give you ACID (Atomicity, Consistency, Isolation, Durability), essentially you shouldn’t be able to introduce inconsistent changes to the database and it should always give you the same answer to the same query. The NoSQL databases described here have a subtly different core goal. Most of them are designed to work on the web and address CAP (Consistency, Availability, Partition), indeed several of them offer native REST interfaces over HTTP which means they are very straightforward to integrate into web applications. CAP refers to the ability to return a consistent answer, from any instance of the database, in the face of network (or partition) problems. This assumes that these databases may be stored in multiple locations on the web. A famous theorem contends that you can have any two of Consistency, Availability and Partition resistance at any one time but not all three together.

NoSQL databases are variously designed to scale horizontally and vertically. Horizontal scaling means replicating the same database in multiple places to provide greater capacity to serve requests even with network connectivity problems. Vertically scaling by “sharding” provides the ability to store more data by fragmenting the data such that some items are stored on one server and some on another.

I’m not a SQL expert by any means but it’s telling that I learnt a huge amount about PostgreSQL in the forty or so pages on the database. I think this is because the focus was not on the SQL query language but rather on the infrastructure that PostgreSQL provides. For example, it discusses triggers, rules, plugins and specialised indexing for text search. I assume this style of coverage applies to the other databases. This book is not about the nitty-gritty of querying particular database types but rather about the different database systems.

The NoSQL databases generally support MapReduce style queries this is a scheme most closely associated with Big Data and the Hadoop ecosystem but in this instance it is more a framework for doing queries which maybe executed across a cluster of computers.

I’m on a bit of a graph theory binge at the moment so Neo4J was the most interesting to me.

As an older data scientist I have a certain fondness for things that have been around for a while, like FORTRAN and SQL databases, I’ve looked with some disdain at these newfangled NoSQL things. To a degree this book has converted me, at least to the point where I look at ScraperWiki projects and think – “It might be better to use a * database for this piece of work”.

This is an excellent book which was pitched at just the right level for my purposes, I’ll be looking for more Pragmatic Programmers books in future.