This is Part 4 over 6 of “Coding for SSDs”, covering Sections 5 and 6. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

In this part, I cover briefly some of the main SSD functionalities such as TRIM and over-provisioning. I am also presenting the different levels of internal parallelism in an SSD, and the concept of clustered block.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

5. Advanced functionalities

5.1 TRIM

Let’s imagine that a program write files to all the logical block addresses of an SSD: this SSD would be considered full. Now let’s assume that all those files gets deleted. The filesystem would report 100% free space, although the drive would still be full, because an SSD controller has no way to know when logical data is deleted by the host. The SSD controller will see the free space only when the logical block addresses that used to be holding the files get overwritten. At that moment, the garbage collection process will erase the blocks associated with the deleted files, providing free pages for incoming writes. As a consequence, instead of erasing the blocks as soon as they are known to be holding stale data, the erasing is being delayed, which hurts performance badly.

Another concern is that, since the pages holding deleted files are unknown to the SSD controller, the garbage collection mechanism will continue move them around to ensure wear leveling. This increases write amplification, and interferes with the foreground workload of the host for no good reason.

A solution to the problem of delayed erasing is the TRIM command, which can be sent by the operating system to notify the SSD controller that pages are no longer in use in the logical space. With that information, the garbage collection process knows that it doesn’t need to move those pages around, and also that it can erase them whenever needed. The TRIM command will only work if the SSD controller, the operating system, and the filesystem are supporting it.

The Wikipedia page for the TRIM command is listing the operating systems and filesystems that support TRIM [16]. Under Linux, support for the ATA TRIM was added in version 2.6.33. Although the ext2 and ext3 filesystems do not support TRIM, ext4 and XFS, among others, do support it. Under Mac OS 10.6.8, HFS+ supports the TRIM operation. As for Windows 7, it only supports TRIM for SSD using a SATA interface and not PCI-Express.

The majority of recent drives support TRIM, and indeed, allowing the garbage collection to work as early as possible significantly improves future performance. Therefore, it is strongly preferable to use SSDs that support TRIM, and to make sure that support is enabled both at the operating system and filesystem levels.

5.2 Over-provisioning

Over-provisioning is simply having more physical blocks than logical blocks, by keeping a ratio of the physical blocks reserved for the controller and not visible to the user. Most manufacturers of professional SSDs already include some over-provisioning, generally in the order of 7 to 25% [13]. Users can create more over-provisioning simply by partitioning a disk to a lower logical capacity than its maximum physical capacity. For example, one could create a 90 GB partition in a 100 GB drive, and leave the remaining 10 GB for over-provisioning. Even if the over-provisioned space is not visible at the level of the operation system, the SSD controller can still see it. The main reason for which manufacturers are offering over-provisioning is to cope with the inherent limited lifespan of NAND-flash cells. The invisible over-provisioned blocks are here to seamlessly replace the blocks wearing off in the visible space.

AnandTech has an interesting article showing the impact of over-provisioning on the life-span and performance of SSDs [34]. For the disk they studied, the conclusion was that performance increased dramatically simply by making sure that 25% of the space was reserved for over-provisioning — summing up all levels of over-provisioning. Another interesting result was presented in an article by Percona, in which they tested an Intel 320 SSD and showed that the write throughput decreased as the disk was getting filled up [38].

Here is my explanation on what is happening. The garbage collection is using idle time to erase stale pages in the background. But since the erase operation has a higher latency than the write operation, i.e. erasing takes more time than writing, an SSD under a heavy workload of continuous random writes would use up all of its free blocks before the garbage collection would have time to erase the stale pages. At that point, the FTL would be unable to catch up with the foreground workload of random writes, and the garbage collection process would have to erase blocks at the same time as write commands are coming in. This is when performance drops and the SSD appears to be performing badly in benchmarks, as shown in Figure 7 below. Therefore, over-provisioning can act as a buffer to absorb high throughput write workloads, leaving enough time to the garbage collection to catch up and erase blocks again. How much over-provisioning is needed depends mostly on the workload in which the SSD will be used and how much writes it will need to absorb. As a rule of thumb, somewhere around 25% of over-provisioning is recommended for sustained workload of random writes [34]. If the workload is not so heavy, somewhere around 10-15% may be largely enough.

Over-provisioning is useful for wear leveling and performance

A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller. Over-provisioning helps the wear leveling mechanisms to cope with the inherent limited lifespan of NAND-flash cells. For workloads in which writes are not so heavy, 10% to 15% of over-provisioning is enough. For workloads of sustained random writes, keeping up to 25% of over-provisioning will improve performance. The over-provisioning will act as a buffer of NAND-flash blocks, helping the garbage collection process to absorb peaks of writes.

From there, it can also be deduced that over-provisioning offers even greater improvements for setups in which the TRIM command is not supported — note that I am just making an assumption here, and that I have yet to find a reference to support this idea. Let’s imagine that only 75% of the drive is used by the operating system and the remaining 25% is reserved for over-provisioning. Because the SSD controller can see the whole drive, 100% of the blocks are rotating and alternating between being used, stale, and erased, although only 75% of the physical NAND-flash memory is actually used at any single moment in time. This means that the remaining 25% of physical memory should be safely assumed not to be holding any data, since it is not mapped to any logical block addresses. Therefore, the garbage collection process should be able to erase blocks from the over-provisioned space in advance, and this even in the absence of TRIM support.

5.3 Secure Erase

Some SSD controllers offer the ATA Secure Erase functionality, the goal being to restore the performance of the drive back to its fresh out-of-box state. This command erases all data written by the user and resets the FTL mapping tables, but obviously cannot overcome the physical limitations of the limited P/E cycles. Even though this functionality looks very promising in the specs, it’s up to each manufacturer to implement it correctly. In their review of the secure erase command, Wei et al., 2011, have shown that over the 12 models of SSDs studied, only eight offered the ATA Secure Erase functionality, and over those eight drives, three had buggy implementations [11].

The implications for performance are important, and they are all the more so important for security, but it is not my intent to cover this topic here. There are a couple of discussions on Stack Overflow which explain in more details how to reliably erase data from an SSD [48, 49].

5.4 Native Command Queueing (NCQ)

Native Command Queueing (NCQ) is a feature of Serial ATA that allows for an SSD to accept multiple commands from the host in order to complete them concurrently using the internal parallelism [3]. In addition to reducing latency due to the drive, some newer drives also use NCQ to cope with latency from the host. For example, NCQ can prioritize incoming commands to ensure that the drive always has commands to process while the host CPU is busy [39].

5.5 Power-loss protection

Whether it is at home or in datacenter, power loss will happen. Some manufacturers include a supercapacitor in their SSD architecture, which is supposed to hold enough power in order to commit the I/O requests in the bus in case of a power outage, and leave the drive in a consistent state. The problem is that not all SSD manufacturer include a supercapacitor or some sort of power-fault data protection for their drives, and that those who include it do not always mention it in their specifications. Then, like for the secure erase command, it is not clear whether or not the power-fault mechanisms are correctly implemented and will indeed protect the drive from data corruption when a power outage occurs.

A study by Zheng et al., 2013, tested 15 SSDs, without revealing their brands [72]. They stressed the drives with various power faults, and found that 13 out of the 15 tested SSDs ended up losing some data or being massively corrupted. Another article about power fault by Luke Kenneth Casson Leighton showed that three out of four tested drives were left in a corrupted state, and that the fourth was fine (an Intel drive) [73].

SSDs are still a very young technology and I am convinced that their resistance to data corruption under power fault will improve over the next generations. Nevertheless for the time being, it is probably worth it to invest in an uninterruptible power supply (UPS) in datacenter setups. And as with any other storage solution, backup sensitive data regularly.

6. Internal Parallelism in SSDs

6.1 Limited I/O bus bandwidth

Due to physical limitations, an asynchronous NAND-flash I/O bus cannot provide more than 32-40 MB/s of bandwidth [5]. The only way for SSD manufacturers to increase performance is to design their drives in such a way that multiple packages can be parallelized or interleaved. A good explanation of interleaving can be found in Section 2.2 of [2].

By combining all the levels of internal parallelism inside an SSD, multiple blocks can be accessed simultaneously across separate chips, as a unit called a clustered block. Explaining all the details regarding the inner parallelism of an SSD is not my intent here, therefore I am just covering briefly the levels of parallelism and the clustered block. To learn more about these topics and more generally about parallelism inside SSDs, two great starting points are the papers [2, 3]. In addition, the advanced commands such as copyback and inter-plane transfer are presented in [5].

Internal parallelism

Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.

6.2 Multiple levels of parallelism

Figure 6 below shows the internals of a NAND-flash package, which is organized as a hierarchical structure. The levels are channel, package, chip, plane, block, and page. As exposed in [3], those different levels offer parallelism as follows:

Channel-level parallelism. The flash controller communicates with the flash packages through multiple channels. Those channels can be accessed independently and simultaneously. Each individual channel is shared by multiple packages.

Package-level parallelism. The packages on a channel can be accessed independently. Interleaving can be used to run commands simultaneously on the packages shared by the same channel.

Chip-level parallelism. A package contains two or more chips, which can be accessed independently in parallel. Note: chips are also called “dies”.

Plane-level parallelism. A chip contains two or more planes. The same operation (read, write or erase) can be run simultaneously on multiple planes inside a chip. Planes contain blocks, which themselves contains pages. The plane also contains registers (small RAM buffers), which are used for plane-level operations.

Figure 6: NAND flash package

6.3 Clustered blocks

Multiple blocks accessed across multiple chips are called a clustered block[2]. The idea is similar to the concept of striping encountered in RAID systems [1, 5].

Logical block addresses accessed at once are striped over different SSD chips in distinct flash packages. This is done thanks to the mapping algorithm of the FTL, and it is independent of whether or not those addresses are sequential. Striping blocks allows to use multiple channels simultaneously and combine their bandwidths, and also to perform multiple read, write and erase operations in parallel. This means that I/O operations that are both aligned and multiple of the clustered block size guarantee an optimal use of all the performance offered by the various levels of internal parallelism in an SSD. See Sections 8.2 and 8.3 for more information about clustered blocks.

What’s next

Part 5 is available here. You can also go to the Table of Content for this series of articles, and if you’re in a rush, you can also directly go to Part 6, which is summarizing the content from all the other parts.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

Looking for a job?

Do you have experience in infrastructure, and are you interested in building and scaling large distributed systems? My employer, Booking.com, is recruiting Software Engineers and Site Reliability Engineers (SREs) in Amsterdam, Netherlands. If you think you have what it takes, send me your CV at emmanuel [at] codecapsule [dot] com.