Sorry

"The more alternatives, the more difficult the choice." -- Abbe' D'Allanival

In 2010, when the world became enchanted by the capabilities of cloud systems and new databases designed to serve them, a group of researchers from Yahoo decided to look into NoSQL. They developed the YCSB framework to assess the performance of new tools and find the best cases for their use. The results were published in the paper, "Benchmarking Cloud Serving Systems with YCSB."

The Yahoo guys did a great job, but like any paper, it could not include everything:

● The research did not provide all the information we needed for our own analysis. Cassandra, HBase, Yahoo's PNUTS, and a simple sharded MySQL implementation were analyzed, some of the databases we often work with were not covered.

● Though

● Yahoo used high-performance hardware, while it would be more useful for most companies to see how these databases perform on average hardware.

As R&D engineers at Altoros Systems Inc., a big data specialist, we were inspired by Yahoo's endeavors and decided to add some effort of our own. This article is our vendor-independent analysis of NoSQL databases, based on performance measured under different system workloads.

What makes this research unique?

Often referred to as NoSQL, non-relational databases feature elasticity and scalability in combination with a capability to store big data and work with cloud computing systems, all of which make them extremely popular. NoSQL data management systems are inherently schema-free (with no obsessive complexity and a flexible data model) and eventually consistent (complying with BASE rather than ACID). They have a simple API, serve huge amounts of data and provide high throughput.

In 2012, the number of NoSQL products reached 120-plus and the figure is still growing. That variety makes it difficult to select the best tool for a particular case. Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. We wanted to do independent and unbiased research to complement the work done by the folks at Yahoo.

Using Amazon virtual machines to ensure verifiable results and research transparency (which also helped minimize errors due to hardware differences), we have analyzed and evaluated the following NoSQL solutions:

● Cassandra, a column family store MongoDB, a document-oriented database

● HBase (column-oriented, too)

●

● Riak, a key-value store

We also tested MySQL Cluster and sharded MySQL, taking them as benchmarks.

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

Tools, libraries and methods

For benchmarking, we used Yahoo Cloud Serving Benchmark, which consists of the following components:

● a framework with a workload generator

● a set of workload scenarios

We have measured database performance under certain types of workloads. A workload was defined by different distributions assigned to the two main choices:

• which operation to perform

• which record to read or write

Operations against a data store were randomly selected and could be of the following types:

• Insert: Inserts a new record. Update: Updates a record by replacing the value of one field. Read: Reads a record, either one randomly selected field, or all fields. Scan: Scans records in order, starting at a randomly selected record key. The number of records to scan is also selected randomly from the range between 1 and 100.

•

•

•

Each workload was targeted at a table of 100,000,000 records; each record was 1,000 bytes in size and contained 10 fields. A primary key identified each record, which was a string, such as "user234123." Each field was named field0, field1, and so on. The values in each field were random strings of ASCII characters, 100 bytes each.

Database performance was defined by the speed at which a database computed basic operations. A basic operation is an action performed by the workload executor, which drives multiple client threads. Each thread executes a sequential series of operations by making calls to the database interface layer both to load the database (the load phase) and to execute the workload (the transaction phase). The threads throttle the rate at which they generate requests, so that we may directly control the offered load against the database. In addition, the threads measure the latency and achieved throughput of their operations and report these measurements to the statistics module.

The performance of the system was evaluated under different workloads:

Workload A: Update heavily

Workload B: Read mostly

Workload C: Read only

Workload D: Read latest

Workload E: Scan short ranges

Workload F: Read-modify-write

Workload G: Write heavily

Each workload was defined by:

1) The number of records manipulated (read or written)

2) The number of columns per each record

3) The total size of a record or the size of each column

4) The number of threads used to load the system

This research also specifies configuration settings for each type of the workloads. We used the following default settings:

1) 100,000,000 records manipulated

2) The total size of a record equal to 1Kb

3) 10 fields of 100 bytes each per record

4) Multithreaded communications with the system (100 threads)

Testing environment

To provide verifiable results, benchmarking was performed on Amazon Elastic Compute Cloud instances. Yahoo Cloud Serving Benchmark Client was deployed on one Amazon Large Instance:

Amazon is often blamed for its high I/O wait time and comparatively slow EBS performance. To mitigate these drawbacks, EBS disks had been assembled in a RAID0 array with stripping and after that they were able to provide up to two times higher performance.

The results

When we started our research into NoSQL databases, we wanted to get unbiased results that would show which solution is best suitable for each particular task. That is why we decided to test performance of each database under different types of loads and let the users decide what product better suits their needs.

We started with measuring the load phase, during which 100 million records, each containing 10 fields of 100 randomly generated bytes, were imported to a four-node cluster.

HBase demonstrated by far the best writing speed. With pre-created regions and deferred log flush enabled, it reached 40K ops/sec. Cassandra also showed great performance during the loading phase with around 15K ops/sec. The data is first saved to the commit log, using the append method, which is a fast operation. Then it is written to a per-column family memory store called a Memtable. Once the Memtable becomes full, the data is saved to disk as an SSTable. In the "just in-memory" mode, MySQL Cluster could show much better results, by the way.

* Workload A: Update-heavily mode. Workload A is an update-heavily scenario that simulates the database work, during which typical actions of an e-commerce solution user are recorded. Settings for the workload:

1) Read/update ratio: 50/50

2) Zipfian request distribution

During updates, HBase and Cassandra went far ahead from the main group with the average response latency time not exceeding two milliseconds. HBase was even faster. HBase client was configured with AutoFlush turned off. The updates aggregated in the client buffer and pending writes flushed asynchronously, as soon as the buffer became full. To accelerate updates processing on the server, the deferred log flush was enabled and WAL edits were kept in memory during the flush period.

Cassandra wrote the mutation to the commit log for the transaction purposes and then an in-memory Memtable was updated. This is a slower, but safer scenario if compared to the HBase deferred log flushing.

* Workload A: Read. During reads, per-column family compression provides HBase and Cassandra with faster data access. HBase was configured with native LZO and Cassandra with Google's Snappy compression codecs. Although the computation ran longer, the compression reduces the number of bytes read from the disk.

* Workload B: Read-heavy mode. Workload B consisted of 95% of reads and 5% of writes. Content tagging can serve as an example task matching this workload; adding a tag is an update, but most operations imply reading tags. Settings for the "read-mostly" workload:

1) Read/update ratio: 95/5

2) Zipfian request distribution

Sharded MySQL showed the best performance in reads. MongoDB -- accelerated by the "memory mapped files" type of cache -- was close to that result. Memory-mapped files were used for all disk I/O in MongoDB. Cassandra's key and row caching enabled very fast access to frequently requested data. With the off-heap row caching feature added in Version 0.8, it showed excellent read performance while using less per-row memory. The key cache held locations of keys in memory on a per-column family basis and defined the offset for the SSTable where the rows were stored. With a key cache, there was no need to look for the position of the row in the SSTable index file. Thanks to the row cache, we did not have to read rows from the SSTable data file. In other words, each key cache hit saved us one disk seek and each row cache hit saved two disk seeks. In HBase, random read performance was slower. However, Cassandra and HBase can provide faster data access with per-column-family compression.

* Workload B: Update. Thanks to deferred log flushing, HBase showed very high throughput with extremely small latency under heavy writes. With deferred log flush turned on, the edits were first committed to the memstore. Then the aggregated edits were flushed to HLog asynchronously. On the client side, HBase write buffer cached writes with the autoFlush option set to true, which also improved performance greatly. For security purposes, HBase confirms every write after its write-ahead log reaches a particular number of in-memory HDFS replicas. HBase's write latency with memory commit was roughly equal to the latency of data transmission over the network. Cassandra demonstrated great write throughput, since it first writes to the commit log -- using the append method, which is a pretty fast operation -- and then to a per-column-family memory store called Memtable.

* Workload C: Read-only. Settings for the workload:

1) Read/update ratio: 100/0

2) Zipfian request distribution

This read-only workload simulated a data caching system. The data was stored outside the system, while the application was only reading it. Thanks to B-tree indexes, sharded MySQL became the winner in this competition.

* Workload E: Scanning short ranges. Settings for the workload:

1) Read/update/insert ratio: 95/0/5

2) Latest request distribution

3) Max scan length: 100 records

4) Scan length distribution: uniform

HBase performed better than Cassandra in range scans. HBase scanning is a form of hierarchical fast merge-sort operation performed by HRegionScanner. It merges the results received from HStoreScanners (one per family), which, in their turn, merge the results received from HStoreFileScanners (one for each file in the family). If caching is turned on, the server will simply provide the number of specified records instead of bouncing back to the HRegionServer to process every record.

Cassandra's scan performance with a random partitioner has improved considerably compared to Version 0.6, where this feature was initially introduced.

Cassandra's SSTable is a sorted strings table that can be described as a file of key-value string pairs sorted by keys and key-value pairs written in a particular order. To achieve maximum performance during range scans, we had to use an order-preserving partitioner. Scanning over ranges of order-preserving rows is super-fast. It is similar to moving a cursor through a continuous index. However, the database cannot distribute individual keys and corresponding rows over the cluster evenly, thus a random partitioner is used to ensure even data distribution. This is the default partitioning strategy in Cassandra. Random partitioning ensures good load balancing and provides some additional speed in range scans with an order preserving partitioner.

In MongoDB 2.5, the table scan triggered by the { "_id":{"$gte": startKey}} query showed a maximum throughput of 20 ops/sec with a latency of ≈ 1 sec.