An introduction to the NoSQL world

In this blog post the NoSQL world will be introduced in as simple words as possible. All main types of NoSQL databases will be described incuding some examples for every one of each to have a good understanding of the subject. The world has changed recently and the NoSQL trend is getting stronger and stronger. Old fashion applications working only with relational databases are rare nowadays. The way of developing applications has changed, single server side architecture and heavy clients is not a trend anymore. Till 2010 number of Internet users grew almost linearly – that means more and more server requests. Additionally unstructured data is growing exponentially when structured data only linearly. Why NoSQL? Using NoSQL ensures scalability, availability, flexible data model and runs well on clusters. All these features make NoSQL databases very popular. Architects are running from using databases as integration points towards including databases inside applications. For integration very light REST web services could for example be used. These are easier to test, easier to scale and it is easier to divide big problems into smaller ones. Of course at the beginning every architect should think about the kind of database an application really needs and what data model it must support. A relational data model divides the information into relations. A relation is a set of records with the same structure, related with different relations from the system by internal data. On the other hand, aggregate orientation recognizes how you want to use those data. The more complex and deformalized records than in relational model are usually stored inside NoSQL databases. In the NoSQL approach you need to be aware of how your application works, the business has to be defined so you can set the boundaries between the aggregates, which is not always an easy task. Lets start with a very simple NoSQL database types comparison and then go to details with examples for every mentioned NoSQL type.

Key Value Store – this is the most basic NoSQL type, it can be replaced with almost all types below. But if our application does not need more – developers should consider the simplest solution. This database is easy to use from an API perspective. The client can go with two options like saving or deleting data based on a key. The application has to know how to generate this key and how the data is stored in the value part.

Document Databases – Databases that manage documents like XML, JSON and more. Those data can be self-describing tree structures, which can consist of maps or collections. A nice feature is that documents that are stored in a document database don’t have to be exactly the same – just the opposite of relational database. Every document can be different with more features describing every record in our database.

Column-Family Data Store – in those types of NoSQL databases, data are stored in column families that are manipulated with row keys or indexes. Column family data stores can be divided into two groups – row-oriented subtype where row exists as aggregate of column families representing various parts of data. The second, column-oriented type, where for each column family row is used as a join of records in all column families. Column families are a group of related data that is often accessed together like user information or order details.

Graph Databases allow you to store entities and relationships between those entities. Graph databases are the only ones in this article that support ACID transactions. In graph databases nodes (entities) have properties, the relations between entities are called edges. A well-known graph database is for example Neo4J.

Key-Value Stores

Let’s start with the most basic from above types. There are many implementations of key-value stores like Redis, Riak, Berkley DB or Amazon’s Dynamo (will come back to this in case of talking about Cassandra). Those databases can be represented as a simple map that allows us to read and write values based on a key. In this section Redis database will be described in a bit more details. So how does Redis scale? The first term is partitioning. Partitioning is about spreading data across several machines that allows to keep databases much bigger and fast. Partitioning in Redis can be done in a few ways and depending on the algorithm different benefits will be achieved. Range partitioning is the simplest type – some ranges for particular machines will be set. Like machine1 has a key range from 0 to 1000, machine2 has a key range from 1001 to 2000 and so on. Second type of partitioning is Hash Partitioning and the idea is that for every key some number is calculated and after using modulo operation, exact node number where the data belongs is obtained. There are many types of hashing, some of them fit better to our solution some of them not – the user needs to be aware how the application will be used and what solution fits the best. In the next sections, during speaking about Cassandra, the Consistent Hashing algorithm will be described. In short words Consistent Hashing gives a faster recovery when one of the nodes fails. For node failures Redis has a simple solution – master-slave replication. Slave Redis servers are exact copies of master. The replication of data is asynchronous, with multiple slaves for one master. Slaves can connect to each other. In Redis replication on the master and slaves is non-blocking – it means that every node will perform the queries even though it already started the replication process. There are more configuration options in Redis if you are interested please have a look inside redis.conf. Having idea how Redis scales and replicates the data – as it’s a distributed computer system, let’s now introduce some terms related to this class of applications – CAP (Consistency, Availability, Partition tolerance) theorem. It means that for a distributed computer system it’s impossible to simultaneously provide all three of the following guarantees:

Consistency – all data on every node has the same state at the same time

Availability – every request always get an answer

Partition tolerance – can survive communication breakages in the cluster that separate the cluster into multiple partitions unable to communicate with each other

In the next chapters there will be a talk about Eventual Consistency and other features of distributed systems – like low-level technical details of implementation of protocols and algorithms (used for ensuring consistency, saving data on disk etc...).