Friday, February 8, 2013

NoSQL Databases: An Introduction

The explosion of Cloud Computing in the last decade has shifted the dynamics of data storage from large, redundant, and specialized static hardware to inexpensive, failure-prone, commodity hardware and virtual machines. Coupled with the rise of social networks featuring content-rich internet media, the traditional relational database management system (RDBMS) has had trouble responding to such dramatic changes. Where-as data normalization and ACID properties were once considered to be important to every well designed system, new web applications may give up ACID in exchange for low latency with massive data storage, leading to new eventually consistent systems of unnormalized data. The new breed of "NoSQL" databases utilized by these applications offer a small subset of the features found in a traditional relational database in exchange for low latency and trivial massive linear scaling. In this series, I am going to explore the basics of NoSQL through the exploration of the architectures typically employed by various NoSQL databases. This first article in the series will provide a high level view of all architectures of the three major architectures of NoSQL. I will explore the each architecture in greater detail in future articles.

What is NoSQL?

The term NoSQL originally derives from the limitation of these databases only being accessible through some non-standardized API or SQL ("No SQL" systems). However, these systems are better categorized as being non-relational. The data that they store does not automatically maintain referential integrity or may abandon the concept completely. In fact, some NoSQL systems do support a tiny subset of SQL, leaving some in the industry to modify the terminology to instead mean "Not Only SQL".

The significantly smaller subset of analysis and algebraic abilities of a NoSQL system compared to a traditional database management system has also shifted the terminology of these databases from "database" to "data store". Data store is a more accurate description of NoSQL databases as data is often schema-less and not significantly evaluated by most NoSQL systems. In other words, NoSQL systems are dumb in their ability to interpret and join data, relying on the application itself to enforce type restrictions or combine tabular data.

Architecture

NoSQL data stores are distinguished by several features that generally stand in contrast to traditional relational databases.

Dynamic Data Models: The data models are often very flexible or schema-less in design. Referential integrity is abandoned in favor of restructured or de-normalized data. Furthermore, the attributes or fields available for storing data are typically dynamic.

Replicated and Redundant: Since many NoSQL database systems target the cloud environment, the systems allow for failure to be the norm instead of the exception. In order to accommodate this assumption, data stores typically provide built-in replication and failover abilities with fast fault recovery.

Simple API: The complexity of SQL is abandoned in favor of simple APIs. Some computation and analysis of data is moved outside of the database and onto the client or map-reduce machines which can more easily scale in a cloud environment. The expensive joins that are encouraged by SQL are largely abandoned.

High Throughput: Most NoSQL systems feature simultaneous high throughput reading and writing across their entire dataset. Many systems boast lockless designs that allow for consistent reading and writing speeds regardless of the number of users reading or updating the data.

The linear scaling capabilities of most NoSQL databases are accomplished through horizontal scaling ”shared nothing” systems that tend to abandon the idea of ACID transactions that feature the following properties:

Atomicity: All operations that are part of a transaction are an ”all or nothing” operation. If any operation in the transaction fails, then all of the operations must fail. There cannot be a partial transaction.

Consistency: Transactions performed by the database will always leave the data in a consistent state. An operation cannot leave the database in an inconsistent state, but rather the database will move from one consistent state to another.

Isolation: Each transaction will be isolated from other transactions, even if they are operating on the same data. The operations in one transaction will not affect the intermediate operations of another transaction.

Durability: A committed transaction will remain committed even in the event of a system failure. Thus, if a transaction is committed, it is guaranteed that the data was successfully stored in the database.

CAP Theorem

Instead of adhering to ACID transactions, NoSQL data stores recognize the CAP theorem, first explored by Professor Eric Brewer of UC Berkeley which states that it is impossible for a distributed computer system to simultaneously guarantee all three of the properties of consistency, availability, and partition tolerance.

With this idea in mind, a new set of BASE properties have been proposed and recognized by most NoSQL database systems:

Basically Available: The database is basically available such that if some part of the database becomes unavailable, other parts of the database continue to function as expected.

Soft State: Data may be time-dependent on user interaction with possible expiration after a period of time. The data must be updated or accessed to remain relevant in the system.

Eventually Consistent: Updated data may not be immediately consistent across the entire system but with time will become consistent. Therefore, the data is said to be consistent in the future.

NoSQL Data Store Categories

The NoSQL database systems are typically scalable, distributed, persistent systems. However, the systems can still be categorized by their data model and the level of data-awareness in the database as well as their robustness. NoSQL data stores can be separated into three distinct data model categories:

Key-Value Stores

The most trivial type of NoSQL data store is the key-value data store. Key-value stores store data values into a system that can later be recalled using a key. The values are uninterpreted arbitrary data that the application can interpret as it sees fit. This simple and completely schemaless data model allows for easy scaling and very simple APIs that can easily enhance the functionality of existing systems.

Document Stores

Document data stores feature the ability to store documents into a database. Documents consist of one or more named fields that are self-contained in each document. Most document stores offer different types of fields as well as special nested fields such as lists or arrays. The structure of documents (their fields) is dynamic and can be freely modified by the client with the ability to add or remove fields of existing documents. Since the documents are self-contained, their fields are usually stored and distributed together as individual documents in the backend storage system.

Extensible Record Stores

The most powerful of the document stores are the extensible record stores, also known as wide-column data stores. These data stores feature data models that resemble the tables found in a traditional database but with several important differences. Unlike a traditional database, the columns in an extensible record store are dynamic and can be added or modified during operation. Furthermore, different rows that are part of the same table may subscribe to different columns. The columns may also contain arbitrary names that themselves are considered values. Lastly, some NoSQL data stores feature the ability to have columns that contain other columns.

The storage of extensible record stores typically groups together a row key with one or more columns. Thus, the same row key may be repeated and stored in different locations with different columns but still available as a single table in the data store. The storage is typically column-based instead of row-based, meaning that columns of data are stored together rather than rows of data as in a traditional database. Thus, the rows of a given table could be scattered across many servers.