Coding for SSDs – Part 2: Architecture of an SSD and Benchmarking

This is Part 2 over 6 of “Coding for SSDs”, covering Sections 1 and 2. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

In this part, I am explaining the basics of NAND-flash memory, cell types, and basic SSD internal architecture. I am also covering SSD benchmarking and how to interpret those benchmarks.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

1. Structure of an SSD

1.1 NAND-flash memory cells

A solid-state drives (SSD) is a flash-memory based data storage device. Bits are stored into cells, which are made of floating-gate transistors. SSDs are made entirely of electronic components, there are no moving or mechanical parts like in hard drives.

Voltages are applied to the floating-gate transistors, which is how bits are being read, written, and erased. Two solutions exist for wiring transistors: the NOR flash memory, and the NAND flash memory. I will not go into more details regarding the difference between NOR and NAND flash memory. This article only covers NAND flash memory, which is the solution chosen by the majority of the manufacturers. For more information on the difference between NOR and NAND, you can refer to this article by Lee Hutchinson [31].

An important property of NAND-flash modules is that their cells are wearing off, and therefore have a limited lifespan. Indeed, the transistors forming the cells store bits by holding electrons. At each P/E cycle (i.e. Program/Erase, “Program” here means write), electrons might get trapped in the transistor by mistake, and after some time, too many electrons will have been trapped and the cells would become unusable.

Limited lifespan

Each cell has a maximum number of P/E cycles (Program/Erase), after which the cell is considered defective. NAND-flash memory wears off and has a limited lifespan. The different types of NAND-flash memory have different lifespans [31].

Recent research has shown that by applying very high temperatures to NAND chips, trapped electrons can be cleared out [14, 51]. The lifespan of SSDs could be tremendously increased, though this is still research and there is no certainty that this will one day reach the consumer market.

The types of cells currently present in the industry are:

Single level cell (SLC), in which transistors can store only 1 bit but have a long lifespan

Multiple level cell (MLC), in which transistors can store 2 bits, at the cost of a higher latency and reduced lifespan compared to SLC

Triple-level cell (TLC), in which transistors can store 3 bits, but at an even higher latency and reduced lifespan

Having more bits for the same amount of transistors reduces the manufacturing costs. SLC-based SSDs are known to be more reliable and have a longer life expectancy than MLC-based SSDs, but at a higher manufacturing cost. Therefore, most general public SSDs are MLC- or TLC-based, and only professional SSDs are SLC-based. Choosing the right memory type depends on the workload the drive will be used for, and how often the data is likely to be updated. For high-update workloads, SLC is the best choice, whereas for high-read and low-write workloads (ex: video storage and streaming), then TLC will be perfectly fine. Moreover, benchmarks of TLC drives under a workload of realistic usage show that the lifespan of TLC-based SSDs is not a concern in practice [36].

NAND-flash pages and blocks

Cells are grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. For example, the Samsung SSD 840 EVO has blocks of size 2048 KB, and each block contains 256 pages of 8 KB each. The way pages and blocks can be accessed is covered in details in Section 3.1.

1.2 Organization of an SSD

Figure 1 below is representing an SSD drive and its main components. I have simply reproduced the basic schematic already presented in various papers [2, 3, 6].

Figure 1: Architecture of a solid-state drive

Commands come from the user through the host interface. At the moment I am writing this article, the two most common interfaces for newly released SSDs are Serial ATA (SATA), PCI Express (PCIe). The processor in the SSD controller takes the commands and pass them to the flash controller. SSDs also have embedded RAM memory, generally for caching purposes and to store mapping information. Section 4 covers mapping policies in more details. The packages of NAND flash memory are organized in gangs, over multiple channels, which is covered in Section 6.

Figure 2 and 3 below, reproduced from StorageReview.com [26, 27], show what SSDs look like in real life. Figure 2 shows the 512 GB version of the Samsung 840 Pro SSD, released in August 2013. As it can be seen on the circuit board, the main components are:

1.3 Manufacturing process

Many SSD manufacturers use surface-mount technology (SMT) to produce SSDs, a production method in which electronic components are placed directly on top of printed circuit boards (PCBs). SMT lines are composed of a chain of machines, each machine being plugged into the next and having a specific task to perform in the process, such as placing components or melting the solder. Multiple quality checks are also performed throughout the entire process. Photos and videos of SMT lines can be seen in two articles by Steve Burke [67, 68], in which he visited the production facilities of Kingston Technologies in Fountain Valley, California, and in an article by Cameron Wilmot about the Kingston installations in Taiwan [69].

Other interesting resources are two videos, the first one about the Crucial SSDs by Micron [70] and the second one about Kingston [71]. In the latter, which is part of Steve Burke’s articles and that I also have embedded below, Mark Tekunoff from Kingston is giving a tour of one of their SMT lines. Interesting detail, everyone in the video is wearing a cute antistatic pyjama and seems to be having a lot of fun!

2. Benchmarking and performance metrics

2.1 Basic benchmarks

Table 2 below shows the throughput for sequential and random workloads on different solid-state drives. For the sake of comparison, I have included SSDs released in 2008 and 2013, along with one hard drive, and one RAM memory chip.

Samsung 64 GB

Intel X25-M

Samsung 840 EVO

Micron P420m

HDD

RAM

Brand/Model

Samsung (MCCDE64G5MPP-OVA)

Intel X25-M (SSDSA2MH080G1GC)

Samsung (SSD 840 EVO mSATA)

Micron P420m

Western Digital Black 7200 rpm

Corsair Vengeance DDR3

Memory cell type

MLC

MLC

TLC

MLC

*

*

Release year

2008

2008

2013

2013

2013

2012

Interface

SATA 2.0

SATA 2.0

SATA 3.0

PCIe 2.0

SATA 3.0

*

Total capacity

64 GB

80 GB

1 TB

1.4 TB

4 TB

4 x 4 GB

Pages per block

128

128

256

512

*

*

Page size

4 KB

4 KB

8 KB

16 KB

*

*

Block size

512 KB

512 KB

2048 KB

8196 KB

*

*

Sequential reads (MB/s)

100

254

540

3300

185

7233

Sequential writes (MB/s)

92

78

520

630

185

5872

4KB random reads (MB/s)

17

23.6

383

2292

0.54

5319 **

4KB random writes (MB/s)

5.5

11.2

352

390

0.85

5729 **

4KB Random reads (KIOPS)

4

6

98

587

0.14

105

4KB Random writes (KIOPS)

1.5

2.8

90

100

0.22

102

Notes

* metric is not applicable for that storage solution
** measured with 2 MB chunks, not 4 KB

Metrics

MB/s: Megabytes per Second
KIOPS: Kilo IOPS, i.e 1000 Input/Output Operations Per Second

Table 2: Characteristics and throughput of solid-state drives compared to other storage solutions

An important factor for performance is the host interface. The most common interfaces for newly released SSDs are SATA 3.0, PCI Express 3.0. On a SATA 3.0 interface, data can be transferred up to 6 Gbit/s, which in practice gives around 550 MB/s, and on a PCIe 3.0 interface, data can be transferred up to 8 GT/s per lane, which in practice is roughly 1 GB/s (GT/s stands for Gigatransfers per second). SSDs on the PCIe 3.0 interface are more than a single lane. With four lanes, PCIe 3.0 can offer a maximum bandwidth of 4 GB/s, which is eight times faster than SATA 3.0. Some enterprise SSDs also offer a Serial Attached SCSI interface (SAS) which in its latest version can offer up to 12 GBit/s, although at the moment SAS is only a tiny fraction of the market.

Most recent SSDs are fast enough internally to easily reach the 550 MB/s limitation of SATA 3.0, therefore the interface is the bottleneck for them. The SSDs using PCI Express 3.0 or SAS offer tremendous performance increases [15].

PCI Express and SAS are faster than SATA

The two main host interfaces offered by manufacturers are SATA 3.0 (550 MB/s) and PCI Express 3.0 (1 GB/s per lane, using multiple lanes). Serial Attached SCSI (SAS) is also available for enterprise SSDs. In their latest versions, PCI Express and SAS are faster than SATA, but they are also more expensive.

2.2 Pre-conditioning

If you torture the data long enough, it will confess.
— Ronald Coase

The data sheets provided by SSD manufacturers are filled with amazing performance values. And indeed, by banging whatever random operations for long enough, manufacturers seem to always find a way to show shinny numbers in their marketing flyers. Whether or not those numbers really mean anything and allow to predict the performance of a production system is a different problem.

In his articles about common flaws in SSD benchmarking [66], Marc Bevand mentioned that for instance it is common for the IOPS of random write workloads to be reported without any mention of the span of the LBA, and that many IOPS are also reported for queue depth of 1 instead of the maximum value for the drive being tested. There are also many cases of bugs and misuses of the benchmarking tools.

Correctly assessing the performance of SSDs is not an easy task. Many articles from hardware reviewing blogs run ten minutes of random writes on a drive and claim that the drive is ready to be tested, and that the results can be trusted. However, the performance of SSDs only drops under a sustained workload of random writes, which depending on the total size of the SSD can take just 30 minutes or up to three hours. This is why the more serious benchmarks start by applying such a sustained workload of random writes, also called “pre-conditioning” [50]. Figure 7 below, reproduced from an article on StorageReview.com [26], shows the effect of pre-conditioning on multiple SSDs. A clear drop in performance can be observed after around 30 minutes, where the throughput decreases and the latency increases for all drives. It then takes another four hours for the performance to slowly decay to a constant minimum.

What is happening in Figure 7 essentially is that, as explained in Section 5.2, the amount of random writes is so large and applied in such a sustained way that the garbage collection process is unable to keep up in background. The garbage collection must erase blocks as write commands arrive, therefore competing with the foreground operations from the host. People using pre-conditioning claim that the benchmarks it produces accurately represent how a drive will behave in its worst possible state. Whether or not this is a good model for how a drive will behave under all workloads is arguable.

In order to compare various models coming from different manufacturers, a common ground must be found, and the worst possible state is a valid one. But picking the drive that performs best under the worst possible workload does not always guarantee that it will perform best under the workload of a production environment. Indeed, in most production environments, an SSD drive will serve one and only one system. That system has a specific workload due to its internal characteristics, and therefore a better and more accurate way to compare different drives would be to run the same replay of this workload on those drives, and then compare their respective performance. This is why, even though a pre-conditioning using a sustained workload of random writes allows for a fair comparison of different SSDs, one has to be careful and should, whenever possible, run in-house benchmarks based on the target workload. Benchmarking in-house also allows not to over-allocate resources, by avoiding using the “best” SSD model when a cheaper one would be enough and save a lot of money.

Benchmarking is hard

Testers are humans, therefore not all benchmarks are exempt of errors. Be careful when reading the benchmarks from manufacturers or third parties, and use multiple sources before trusting any numbers. Whenever possible, run your own in-house benchmarking using the specific workload of your system, along with the specific SSD model that you want to use. Finally, make sure you look at the performance metrics that matter most for the system at hand.

2.3 Workloads and metrics

Performance benchmarks all share the same varying parameters and provide results using the same metrics. In this section, I wish to give some insights as to how to interpret those parameters and metrics.

The parameters used are generally the following:

The type of workload: can be a specific benchmark based on data collected from users, or just only sequential or random accesses of the same type (ex: only random writes)

The percentages of reads and writes performed concurrently (ex: 30% reads and 70% writes)

The queue length: this is the number of concurrent execution threads running commands on a drive

The size of the data chunks being accessed (4 KB, 8 KB, etc.)

Benchmark results are presented using different metrics. The most common are:

Throughput: The speed of transfer, generally in KB/s or MB/s, respectively kilobytes per second, and megabytes per second. This is the metric chosen for sequential benchmarks.

IOPS: the number of Input/Output Operations Per Second, each operations being of the same data chunk size (generally 4 KB/s). This is the metrics chosen for the random benchmarks.

Latency: the response time of a device after a command is emitted, generally in μs or ms, respectively microseconds or milliseconds.

While the throughput is easy to understand and relate to, the IOPS is more difficult to grasp. For example, if a disk shows a performance for random writes at 1000 IOPS for 4 KB chunks, this means that the throughput is of 1000 x 4096 = 4 MB/s. Consequently, a high IOPS will translate into a high throughput only if the size of the chunks is the largest possible. A high IOPS at a low average chuck size will translate into a low throughput.

To illustrate this point, let’s imagine that we have a logging system performing tiny updates over thousands of different files per minute, giving a performance of 10k IOPS. Because the updates are spread over so many different files, the throughput could be close to something like 20 MB/s, whereas writing sequentially to only one file with the same system could lead to an increased throughput of 200 MB/s, which is a tenfold improvement. I am making up those numbers for the sake of this example, although they are close to production systems I have encountered.

Another concept to grasp is that a high throughput does not necessarily means a fast system. Indeed, if the latency is high, no matter how good is the throughput, the overall system will be slow. Let’s take the example of a hypothetical single-threaded process that requires connections to 25 databases, each connection having a latency of 20 ms. Because the connection latencies are cumulative, obtaining the 25 connections will require 25 x 20 ms = 500 ms. Therefore, even if the machines running the database queries have fast network cards, let’s say 5 GBits/s of bandwidth, the script will still be slow due to the latency.

The takeaway from this section is that it is important to keep an eye on all the metrics, as they will show different aspects of the system and will allow to identify the bottlenecks when they come up. When looking at the benchmarks of SSDs and deciding which model to pick, keeping in mind which metric is the most critical to the system in which those SSDs will used is generally a good rule of thumb. Then of course, nothing will replace proper in-house benchmarking as explained in Section 2.2.

An interesting follow-up on the topic is the article “IOPS are a scam” by Jeremiah Peschka [46].

What’s next

Part 3 is available here. You can also go to the Table of Content for this series of articles, and if you’re in a rush, you can also directly go to Part 6, which is summarizing the content from all the other parts.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog. As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

Looking for a job?

Do you have experience in infrastructure, and are you interested in building and scaling large distributed systems? My employer, Booking.com, is recruiting Software Engineers and Site Reliability Engineers (SREs) in Amsterdam, Netherlands. If you think you have what it takes, send me your CV at emmanuel [at] codecapsule [dot] com.

In section 1.2, you have pictures of an 840 pro SSD, and describe that one of the components on it is 8 TLC modules. I thought that while the regular 840 used TLC, then 840 pro used MLC. Could be mistaken about that, though.

Fascinating article, thank you for your work. But you wrote that “At each P/E cycle (i.e. Program/Erase, “Program” here means write), electrons might get trapped in the transistor by mistake”. It seems unlikely as far as you know floating gate transistor has a limited cycle of writes because electrons loose energy during injection by damaging of an oxide layer, so electric characteristics of a transistor changes and as result transistor becomes unusable. Of course, that is a very minor difference, and I can be mistaken. Also, probably may be good to mention that in general NAND flash memory has more capacity but also bigger latency, but NOR flash more has a lower capacity but lower latency too.