This is the second article on SSD and the performance implications of switching to SSD. In the first article, Will the use of SSD increase the speed of DBMS?, we have reviewed how much performance gain we would see if we switched from HDD to SSD for a DBMS like CUBRID, and what factors would affect the performance.

In this article, I will compare the SSD structure to that of HDD and discuss how software architecture is changing.

Overview

Last year in July Amazon released a new cloud product - High I/O Quadruple Extra Large. This product uses 2 TB Solid-State Drive (SSD). So now even cloud products have started using SSD. That means the SSD price is becoming low enough to be adopted by cloud products. The operational know-how for cloud services is now established. Then, can we assume that this is the era for using SSDs?

A Solid-State Drive uses NAND flash memory and its operation is different from the operation of a Hard Disk Drive. To develop a storage system, including a database, we needed to fully understand the HDD structure for higher performance. Do we also need to fully understand the SSD structure for greater performance?

The answer is Yes. As SSDs become popular, software architecture is changing. The architecture designed according to the HDD characteristics is being redesigned considering the SSD characteristics.

Introduction to the Experiences of EC2 Users

Below I will use some of the reviews posted on Amazon High IO Instance related High Scalability blog.

Case #1: Conversocial

Conversocial, the social CRM firm, was using MongoDB when the High IO product was released, requiring MongoDB to run on High IO. Compared to the previous HDD, the average response time was reduced by 43% (from 392 ms to 274 ms), the large data for random IO was reduced by 74% (from 877 ms to 504 ms), and the iowait was reduced from 90% to 3%.

Case #2: Netflix

Netflix, the DVD rental site, has posted a long article related to SSD as well. In the previous configuration, the performance was 100K IOPS or 1 GB/sec. As hi1.4xlarge was used, the response to the average read request was reduced from 10 ms to 2.2 ms and the 99% request response time was reduced from 65 ms to 10 ms.

Case #3: MySQL on EC2

Though SSD adoption does not guarantee enhanced performance all the time, on the High Scalability page we can also see the following statement:

Implications

If you use HDDs, you should prepare the memory based on the working set. If the size exceeds, increase the number of equipments for horizontal partitioning.

On the other hand, SSDs can help to reduce the required memory and can delay introduction of horizontal partitioning. For example, Netflix has removed the memcached layer for 48 IO instances. Then it replaced the layer with 15 instances where no cache was involved. Finally, the DBMS did not need to use additional cache to reduce IO time.

Combined with the flexibility of EC2, SSDs can cut costs. For example, for a large capacity index of the genome mapping, the index operation can be performed intensively for a few hours per week. For the rest of the time, there is no need to perform the computation. Without EC2, you must purchase an expensive equipment, use it periodically, and leave it unused for most of the time. With EC2 you can save on the equipment, and pay for only what you use.

While the popularization of SSD cuts costs as well as enhances the performance, the cost-cutting effect can be further maximized when SSDs are combined with cloud products.

Characteristic Comparisons between SSD and HDD IO

An SSD is a non-volatile memory, running in an electronic manner. Compared to HDDs, SSDs provide lower seek times, fewer mechanical delays, and lower fault rates. Therefore, they provide a very fast random read. However, an overhead occurs in random write. The reason is that SSDs cannot overwrite data while writing data. An SSD is divided into many layers in the order of block, page, row, and cell. Data can be written by changing some cells to 0 while all cells in the block have been initialized to 1. To raise 0 to 1, high voltage is required. However, high voltage affects the neighboring cells (write amplification). Therefore, writing data requires two stages of erase and program.

An SSD writes by page and the number of writes on the page is limited (write endurance). If the endurance is exceeded, the block fails. To solve this problem, SSD manufacturers use the following methods on their firmware:

Wear Leveling

Uses blocks equally for lifetime extension. Counts the number of writes per block, and selects and uses the blocks with the fewest counts.

Write Gathering

If there is an erased empty block, gathers the blocks in use, in order to make the blocks empty. In addition, metadata is updated whenever data is written. As write positions are scattered, metadata update times become longer. Sometimes, the time is almost the same as HDD seek time. Therefore, to reduce write overhead, several write requests are gathered and executed at once.

Garbage Collection

The write unit is page and the erase unit is block. Among blocks, blocks that have the largest number of unused pages are initialized (erased). It is a type of defragmentation method; first, back up valid pages in a block and then initialize the block to an empty block, then push the valid pages backed up on another block into the empty block.

Firmware runs differently by manufacturer. This is the competitive element of an SSD. Manufacturers keep their secrets and provide products by level. There are two levels, the low-end Multi Level Cell (MLC) and the high-end Single Level Cell (SLC). The two show significant differences in reliability and speed.

When comparing the performance of an SSD to that of HDD, an SSD is generally 120,000 random reads and 10,000~85,000 random writes, based on IOPS. However, an HDD with 15,000 RPM generally shows performance of 175~210 IOPS. As a result, SSD shows 50~600 times of high performance, compared to an HDD.

An SSD has no seek time, so it provides faster IO than an HDD. However, there is overhead in random write, as we have discussed earlier. The write optimization method of each manufacturer cannot be known. So, the exact IO pattern can be determined by testing each product. However, forecasting the write optimization method seems possible to a certain level. The next chapter will show how to optimize IO under these SSD constraints.

Trials to Optimize IO on SSD

An HDD is significantly slower than the memory. The performance of 10,000 RPM HDD is different from the performance of DDR3-2500 by approximately 800 times. Since HDDs have been used, many methods have been created to improve storage system performance. For example, when the OS and the device driver receive IO requests, they schedule the requests to minimize seek time. When an SSD is used, they do not schedule requests because SSD firmware performs that scheduling.

Systems like a DBMS effectively implement the buffer (cache) to minimize IO on HDDs. As a result, when the hit rate is high, replacing an HDD to an SSD may improve the performance but slightly. However, with SSD, there is no need to keep the buffer size large, you may not need to retain a large memory capacity.

In addition, a variety of methods are used to write data compactly, minimizing movement of the HDD head. However, these methods are not necessary since the random read of SSD is fast.

The operation methods of SSD firmware are not open. So, it is difficult to model the IO pattern at the OS or device driver.

However, at the application level, the architecture is being changed considering that the performance of random read and sequential write of SSD is great but random write performance is poor. Let's see how the architecture is changed.

Log Structured File System

As the memory capacity of a server increases, random read occurs less often than before due to memory cache. Therefore, random write becomes an important performance issue. An HDD performs an in-place write. If the random IO becomes larger, the head must move, causing an increase of seek time and degradation of performance. To compensate for that, there is a way to log the write history only and perform data write occasionally. This method is called journal file system or logging file system. Furthermore, you can avoid the in-place write by writing the data, instead of the history, whenever logging is performed. This is called log-structured file system (LFS).

When writing data, the LFS reads the previous version to create the latest version and appends the latest version to the end of a file.

￼

Figure 2: Architecture of Log Structured File System (source).

At this time, the previous version is marked as an empty space. The version is always appended at the end of the file whenever it is required, only the sequential write is needed, so you can obtain the good performance at the early stage. However, as empty space increases, data is fragmented and an overhead occurs to gather the fragmented data. LFS writes a file with the serial chunk unit, called segment. In order to reduce the overhead of metadata updates, which occurs per write, several writes are collected and written at once. Since most of segments are partially written, a task (GC) to create an empty segment is performed. It is similar to the efficient write of an SSD where data is gathered and written on sequential spaces with defragmentation for creating an empty space.

If an SSD does not perform wear leveling and write gathering, the file system will be more effective. The Journaling Flash File System (JFFS2), which is frequently used for embedded devices, is an example of the file system.

On the performance side, in some cases, LFS can reduce disk usage by 40% compared to general file systems. However, LFS uses most of its time gathering segments. So, performance is lowered because it must process the segments at a certain period.

B-Tree Performance and Copy on-write B-Tree

B-Tree is a data architecture frequently used by file systems and databases (eg. CUBRID). Will the B-Tree work well on an SSD? The following graph shows the performance result calculated by measuring the B-Tree insert on the X25M SSD with three workloads.

￼

Figure 3: Performance of B-Tree Insert on SSD (source).

The red color is the random write at 4 K size, the green color is the sequential write at 4 K size, and the blue color is the random write at 512 K size. At 4 K size, it seems natural that the sequential write shows the highest performance. An interesting thing is that the write performance is significantly lowered at 512 K size when the device capacity (160 G) is exceeded. First, performance is degraded when device capacity is full with the write volume. Why? The reason is that most SSDs have a log structure due to wear leveling and error correction. Originally, a block is 512 K; however, the latest MLC devices allow erasing data with the larger size. Finally, write can erase hundreds of MB with device firmware execution.

Anyway, an SSD shows better performance of the sequential write than the random write. Is there a B-Tree that has a method allowing sequential write? Copy on-write (CoW) copies the path of B-Tree. It copies the traversed nodes whenever the nodes of the tree are changed, and then changes the nodes to create a new tree. See the following two sides:

Traverses the tree to find a key

Rewrites the path

In the case of (1) for the CoW, the random IO (random write) occurs (of course, IO may not occur when cache is used); however, case (2) is completely the sequential write. Because the existing nodes are immutable, only the sequential write occurs (append only). This method has already been used in file systems such as ZFS, WAFL, and Btrfs.

￼

Figure 4: CoW B-Tree Structure (source).

Then, does the CoW B-Tree work well with the log-structured storage structure of an SSD? Basically, it works well as it is 'append only'. However, there are two issues: space blowup and garbage collection.

CoW B-tree potentially creates a large space blowup. If a 16-byte key/value should be saved in an index with a depth of 3 and the block size is 256 K, data write should perform as much as 256*3 K (768 K). Compared to the small change, larger data should be written. In addition, as the size of space blowup becomes larger, GC should perform more tasks. Since GC runs when there are few IO, performance may be degraded.

Fractal Tree

At B-Tree, sequential insert is performed quickly because it has optimum locality by changing only specific pages. However, random insert causes high entropy. At the early stage, most pages are loaded on the memory. As time goes by, the possibility that some pages to be accessed in seeking are on the disk (aging) becomes higher. Especially, when terminal pages are scattered and saved on the disk, performance worsens.

In the DAM model, IO is performed in the unit of block size between memory and storage. Generally, IO is performed by seeking the optimum block size for tuning. However, B-Tree generally uses one key but loads all blocks on the disk, consuming a lot of IO bandwidth. To optimize IO, the Cache Oblivious model is sometimes used. For this model, the IO size should be decided based on the algorithm applied because it is impossible to estimate optimum block size. In addition, as with LFS, B-Tree requires append to reduce random IO.

Fractal Tree is based on them. First, data is inserted as several arrays which are increased exponentially. When insert is performed, the entry is saved in the smallest array. Arrays where data is inserted are merged into a larger array; at this time, the arrays are sorted in the order of keys. This is called Doubling Array.

Figure 5: Data Insert Process in Fractal Tree (source).

This array performs sequential write whenever the array grows. In this structure, insert is performed quickly. However, if the key should be navigated, binary navigation is performed so it performs more slowly than B-Tree. In order to improve this, the forward pointer is made and the tree is configured with levels. This is called Fractal Cascading.

Figure 6: Fractal Cascading.

This method is used by TokuDB, one of the MySQL storage engines. TokuDB is storage for write-intensive workloads.

Stratified B-Tree

Weak points of the CoW B-Tree are space blowup and performance degradation caused by GC. The Stratified B-Tree makes up for the weak points. It is one of the versioned dictionaries. It follows the Cache Oblivious model similar to Fractal Tree. It is a hierarchical tree, which runs by using Doubling Array and has the forward pointer. The difference is that it has the version by key.

First, when a key is changed, the existing key version is maintained with the entry {key, version, value_or_pointer}. When Insert comes in, the key value is saved in the memory buffer, arranged, and written in the array with the lowest level when flushed. After that, as the arrays grow, the key versions of arrays with the same level may be duplicated. In this case, the arrays are promoted and then demoted by disjointing them to avoid version duplication. The reason for demoting arrays is a concept, called density. Density is the density degree of a live key in an array based on version v. When the density is low, unnecessary keys should be searched when range query is performed based on version v. On the contrary, when density is high, most keys are under the corresponding condition. However, a great deal of space blowup is made because data should be duplicated to increase the density. Finally, Stratified B-Tree promotes arrays to prevent each array version from being different, and then demotes the arrays with high density in order to improve the range query performance.

This tree provides slow point query performance, but is good for analysis query and range query for big data. In addition, Acunu is a storage platform implemented with this method. It is used for NoSQL including Cassandra.

Changes in Software Architecture Caused by SSD

So far, I have introduced how the software architecture of storage systems is being changed in the era of SSDs. When SSDs are generalized, products leading the marketplace will be software with architecture suitable for SSDs. No one knows whether HDD era winners will hold dominant positions in the SSD era or if a new hero will appear. Successful methods for existing HDDs are not suitable for SSDs. Therefore, it is very interesting to imagine the changes we will meet in this ongoing evolution.

By Hyejeong Lee, Senior Software Engineer at Storage System Development Team, NHN Corporation.