This is Part 3 over 6 of “Coding for SSDs”, covering Sections 3 and 4. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

In this part, I am explaining how writes are handled at the page and block level, and I talk about the fundamental concepts of write amplification and wear leveling. Moreover, I describe what is a Flash Translation Layer (FTL), and I cover its two main purposes, logical block mapping and garbage collection. More particularly, I explain how write operations work in the context of a hybrid log-block mapping.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

3. Basic operations

3.1 Read, write, erase

Due to the organization of NAND-flash cells, it is not possible to read or write single cells individually. Memory is grouped and is accessed with very specific properties. Knowing those properties is crucial for optimizing data structures for solid-state drives and for understanding their behavior. I am describing below the basic properties of SSDs regarding the read, write and erase operations.

Reads are aligned on page size

It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary.

Writes are aligned on page size

When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write amplification, a concept that is covered in Section 3.3. In addition, writing data to a page is sometimes referred to as “to program” a page, therefore the terms “write” and “program” are used interchangeably in most publications and articles related to SSDs.

Pages cannot be overwritten

A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”. The data is not updated in-place, as the “free” page is a different page than the page that originally contained the data. Once the data is persisted to the drive, the original page is marked as being “stale”, and will remain as such until it is erased.

Erases are aligned on block size

Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once. From a user perspective, only read and write commands can be emitted when data is accessed. The erase command is triggered automatically by the garbage collection process in the SSD controller when it needs to reclaim stale pages to make free space.

3.2 Example of a write

Let’s illustrate the concepts from Section 3.1. Figure 4 below shows an example of data being written to an SSD. Only two blocks are shown, and those blocks contain only four pages each. This is obviously a simplified representation of a NAND-flash package, created for the sake of the reduced examples I am presenting here. At each step in the figure, bullet points on the right of the schematics explain what is happening.

Figure 4: Writing data to a solid-state drive

3.3 Write amplification

Because writes are aligned on the page size, any write operation that is not both aligned on the page size and a multiple of the page size will require more data to be written than necessary, a concept called write amplification[13]. Writing one byte will end up writing a page, which can amount up to 16 KB for some models of SSD and be extremely inefficient.

But this is not the only problem. In addition to writing more data than necessary, those writes also trigger more internal operations than necessary. Indeed, writing data in an unaligned way causes the pages to be read into cache before being modified and written back to the drive, which is slower than directly writing pages to the disk. This operations is known as read-modify-write, and should be avoided whenever possible [2, 5].

Never write less than a page

Avoid writing chunks of data that are below the size of a NAND-flash page to minimize write amplification and prevent read-modify-write operations. The largest size for a page at the moment is 16 KB, therefore it is the value that should be used by default. This size depends on the SSD models and you may need to increase it in the future as SSDs improve.

Align writes

Align writes on the page size, and write chunks of data that are multiple of the page size.

Buffer small writes

To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes.

3.4 Wear leveling

As discussed in Section 1.1, NAND-flash cells have a limited lifespan due to their limited number of P/E cycles. Let’s imagine that we had an hypothetical SSD in which data was always read and written from the same exact block. This block would quickly exceed its P/E cycle limit, wear off, and the SSD controller would mark it as being unusable. The overall capacity of the disk would then decrease. Imagine buying a 500 GB drive and being left with at 250 GB a couple of years later, that would be outrageous!

For that reason, one of the main goals of an SSD controller is to implement wear leveling, which distributes P/E cycles as evenly as possible among the blocks. Ideally, all blocks would reach their P/E cycle limits and wear off at the same time [12, 14].

In order to achieve the best overall wear leveling, the SSD controller will need to choose blocks judiciously when writing, and may have to move around some blocks, a process which in itself incurs an increase of the write amplification. Therefore, block management is a trade-off between maximizing wear leveling and minimizing write amplification.

Manufacturers have come up with various functionalities to achieve wear leveling, such as garbage collection, which is covered in the next section.

Wear leveling

Because NAND-flash cells are wearing off, one of the main goals of the FTL is to distribute the work among cells as evenly as possible so that blocks will reach their P/E cycle limit and wear off at the same time.

4. Flash Translation Layer (FTL)

4.1 On the necessity of having an FTL

The main factor that made adoption of SSDs so easy is that they use the same host interfaces as HDDs. Although presenting an array of Logical Block Addresses (LBA) makes sense for HDDs as their sectors can be overwritten, it is not fully suited to the way flash memory works. For this reason, an additional component is required to hide the inner characteristics of NAND flash memory and expose only an array of LBAs to the host. This component is called the Flash Translation Layer (FTL), and resides in the SSD controller. The FTL is critical and has two main purposes: logical block mapping and garbage collection.

4.2 Logical block mapping

The logical block mapping translates logical block addresses (LBAs) from the host space into physical block addresses (PBAs) in the physical NAND-flash memory space. This mapping takes the form of a table, which for any LBA gives the corresponding PBA. This mapping table is stored in the RAM of the SSD for speed of access, and is persisted in flash memory in case of power failure. When the SSD powers up, the table is read from the persisted version and reconstructed into the RAM of the SSD [1, 5].

The naive approach is to use a page-level mapping to map any logical page from the host to a physical page. This mapping policy offers a lot of flexibility, but the major drawback is that the mapping table requires a lot of RAM, which can significantly increase the manufacturing costs. A solution to that would be to map blocks instead of pages, using a block-level mapping. Let’s assume that an SSD drive has 256 pages per block. This means that block-level mapping requires 256 times less memory than page-level mapping, which is a huge improvement for space utilization. However, the mapping still needs to be persisted on disk in case of power failure, and in case of workloads with a lot of small updates, full blocks of flash memory will be written whereas pages would have been enough. This increases the write amplification and makes block-level mapping widely inefficient [1, 2].

The tradeoff between page-level mapping and block-level mapping is the one of performance versus space. Some researcher have tried to get the best of both worlds, giving birth to the so-called “hybrid” approaches [10]. The most common is the log-block mapping, which uses an approach similar to log-structured file systems. Incoming write operations are written sequentially to log blocks. When a log block is full, it is merged with the data block associated to the same logical block number (LBN) into a free block. Only a few log blocks need to be maintained, which allows to maintain them with a page granularity. Data blocks on the contrary are maintained with a block granularity [9, 10].

Figure 5 below shows a simplified representation of a hybrid log-block FTL, in which each block only has four pages. Four write operations are handled by the FTL, all having the size of a full page. The logical page numbers of 5 and 9 both resolve to LBN=1, which is associated to the physical block #1000. Initially, all the physical page offsets are null at the entry where LBN=1 is in the log-block mapping table, and the log block #1000 is entirely empty as well. The first write, b’ at LPN=5, is resolving to LBN=1 by the log-block mapping table, which is associated to PBN=1000 (log block #1000). The page b’ is therefore written at the physical offset 0 in block #1000. The metadata for the mapping now needs to be updated, and for this, the physical offset associated to the logical offset of 1 (arbitrary value for this example) is updated from null to 0.

The write operations go on and the mapping metadata is updated accordingly. When the log block #1000 is entirely filled, it is merged with the data block associated to the same logical block, which is block #3000 in this case. This information can be retrieved from the data-block mapping table, which maps logical block numbers to physical block numbers. The data resulting from the merge operation is written to a free block, #9000 in this case. When this is done, both blocks #1000 and #3000 can be erased and become free blocks, and block #9000 becomes a data block. The metadata for LBN=1 in the data-block mapping table is then updated from the initial data block #3000 to the new data block #9000.

An important thing to notice here is that the four write operations were concentrated on only two LPNs. The log-block approach enabled to hide the b’ and d’ operations during the merge, and directly use the more up-to-date b” and d” versions, allowing to achieve better write amplification.

Finally, if a read command is requesting a page that was recently updated and for which the merge step on the blocks has not occurred yet, then the page will be in a log block. Otherwise, the page will be found a data block. This is why read requests need to check both the log-block mapping table and the data-block mapping table, as shown in Figure 5.

Figure 5: Hybrid log-block FTL

The log-block FTL allows for optimizations, the most notable being the switch-merge, sometimes referred to as “swap-merge”. Let’s imagine that all the addresses in a logical block were written at once. This would mean that all the new data for those addresses would be written to the same log block. Since this log block contains the data for a whole logical block, it would be useless to merge this log block with a data block into a free block, because the resulting free block would contain exactly the data as the log block. It would be faster to only update the metadata in the data block mapping table, and switch the the data block in the data block mapping table for the log block, this is a switch-merge.

The log-block mapping scheme has been the topic of many papers, which has lead to a series of improvements, such as FAST (Fully Associative Sector Translation), superblock mapping, and flexible group mapping [10]. There are also other mapping schemes, such as the Mitsubishi algorithm, and SSR [9]. Two great starting points to learn more about the FTL and mapping schemes are the two following papers:

Flash Translation Layer

The Flash Translation Layer (FTL) is a component of the SSD controller which maps Logical Block Addresses (LBA) from the host to Physical Block Addresses (PBA) on the drive. Most recent drives implement an approach called “hybrid log-block mapping” or one of its derivatives, which works in a way that is similar to log-structured file systems. This allows random writes to be handled like sequential writes.

4.3 Notes on the state of the industry

As of February 2, 2014, there are 70 manufacturers of SSDs listed on Wikipedia [64], and interestingly, there are only 11 manufacturers of controllers [65]. Out of the 11 controller manufacturers, only four are “captive”, i.e. they use their controllers only for their own products (this is the case of Intel and Samsung), and the remaining seven are “independent”, i.e. they sell their controllers to other drive manufacturers. What those numbers are saying is that seven companies are providing controllers for 90% of the solid-state drive market.

I have no data on which controller manufacturers is selling to which drive manufacturer from these 90%, but following the Pareto principle, I would bet that only two or three controller manufacturers are sharing most of the cake. The direct consequence is that SSDs from non-captive drive manufacturers are extremely likely to behave similarly, since they are essentially running the same controllers, or at least controllers using the same general design and underlying ideas.

Mapping schemes, which are part of the controller, are critical components of SSDs because they will often entirely define the performance a drive. This explains why, in an industry with so much competition, SSD controller manufacturers do not share the details of their FTL implementations. Therefore, even though there is a lot of publicly available research regarding FTL algorithm, it is always unclear how much of that research is being used by controller manufacturers, and what are the exact implementations for specific brands and models.

The authors of [3] claim that by analyzing workloads, they can reverse engineer the mapping policies of a drive. I would argue that unless the binary code itself is being reversed engineered from the chip, there is no way to be completely sure what the mapping policy is really doing inside a specific drive. And it is even harder to predict how the mapping would behave under a specific workload.

There is a great deal of different mapping policies and it would take a considerable amount of time to reverse engineer all of the firmwares available in the wild. Then, even after getting the source code of all possible mapping policies, what would be the benefit? The system requirements of new projects are often to produce good overall results, using generic and inter-changeable commodity hardware. Therefore, it would be worthless to optimize for only one mapping policy, because the solution would be likely to perform poorly on all other mapping policies. The only reason why one would want to optimize for only one type of policy is when developing for an embedded system that is guaranteed to have consistent hardware.

For the reasons exposed above, I would argue that knowing the exact mapping policy of an SSD does not matter. The only important thing to know is that mapping schemes are the translation layer between LBAs and PBAs, and that it is very likely that the approach being used is the hybrid log-block or one of its derivatives. Consequently, writing chunks of data of at least the size of the NAND-flash block is more efficient, because for the FTL, it minimizes the overhead of updating the mapping and its metadata.

4.4 Garbage collection

As explained in Sections 4.1 and 4.2, pages cannot be overwritten. If the data in a page has to be updated, the new version is written to a free page, and the page containing the previous version is marked as stale. When blocks contain stale pages, they need to be erased before they can be written to.

Garbage collection

The garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed.

Because of the high latency required by the erase command compared to the write command — which are respectively 1500-3500 μs and 250-1500 μs as described in Section 1 — this extra erase step incurs a delay which makes the writes slower. Therefore, some controllers implement a background garbage collection process, also called idle garbage collection, which takes advantage of idle time and runs regularly in the background to reclaim stale pages and ensure that future foreground operations will have enough free pages available to achieve the highest performance [1]. Other implementations use a parallel garbage collection approach, which performs garbage collection operations in parallel with write operations from the host [13].

It is not uncommon to encounter workloads in which the writes are so heavy that the garbage collection needs to be run on-the-fly, at the same time as commands from the host. In that case, the garbage collection process supposed to run in background could be interfering with the foreground commands [1]. The TRIM command and over-provisioning are two great ways to reduce this effect, and are covered in more details in Sections 6.1 and 6.2.

Background operations can affect foreground operations

Background operations such as garbage collection can impact negatively on foreground operations from the host, especially in the case of a sustained workload of small random writes.

A less important reason for blocks to be moved is the read disturb. Reading can change the state of nearby cells, thus blocks need to be moved around after a certain number of reads have been reached [14].

The rate at which data is changing is an important factor. Some data changes rarely, and is called cold or static data, while some other data is updated frequently, which is called hot or dynamic data. If a page stores partly cold and partly hot data, then the cold data will be copied along with the hot data during garbage collection for wear leveling, increasing write amplification due to the presence of cold data. This can be avoided by splitting cold data from hot data, simply by storing them in separate pages. The drawback is then that the pages containing the cold data are less frequently erased, and therefore the blocks storing cold and hot data have to be swapped regularly to ensure wear leveling.

Since the hotness of data is defined at the application level, the FTL has no way of knowing how much of cold and hot data is contained within a single page. A way to improve performance in SSDs is to split cold and hot data as much as possible into separate pages, which will make the job of the garbage collector easier [8].

Split cold and hot data

Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easier.

Buffer hot data

Extremely hot data should be buffered as much as possible and written to the drive as infrequently as possible.

Invalidate obsolete data in large batches

When some data is no longer needed or need to be deleted, it is better to wait and invalidate it in a large batches in a single operation. This will allow the garbage collector process to handle larger areas at once and will help minimizing internal fragmentation.

What’s next

Part 4 is available here. You can also go to the Table of Content for this series of articles, and if you’re in a rush, you can also directly go to Part 6, which is summarizing the content from all the other parts.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog. As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

Looking for a job?

Do you have experience in infrastructure, and are you interested in building and scaling large distributed systems? My employer, Booking.com, is recruiting Software Engineers and Site Reliability Engineers (SREs) in Amsterdam, Netherlands. If you think you have what it takes, send me your CV at emmanuel [at] codecapsule [dot] com.

Write amplification is defined as the amount of data written to the flash memory divided by the amount of data written by the host. This is independent from the garbage collection process.

It is true, however, that while moving pages around the garbage collection can also incur a lot of write amplification if pages contain a mixture of cold and hot data, or if pages that are no longer used in the host space were not marked as such in the FTL (with TRIM for example). You can read more about all this: http://en.wikipedia.org/wiki/Write_amplification

Remember write amplification was introduced due to garbage collection in flash/SSD. The definition you mentioned already assumes the write granularity is a page — when we come to I/O operation, the granularity is a page.

While what you are saying are true, they are not specific to flash/SSD, and they are not the things that “write amplification” tries to describe.

Hybrid log-block FTL implementations will vary depending on the manufacturer, model and possibly version of the firmware. The hybrid log-block FTL that I presented here is a generic, simplified example. In order to make my point, I needed to have the logical page numbers resolve to the same LBN so they would end up in the same log block. Therefore the numbers 5 and 9 are arbitrary, I chose to set them like that for the purpose of this simplified example. In other words, those numbers are random and have no particular meaning.

At the end of Section 4.2, I provided references to publications that give more details regarding FTLs. You may have read them already, but if not, those papers would be a good starting point in case you had an interest in learning more about FTLs.

I have a question, in the section of 4.2 you mentioned about the log-block mapping,in the figure the 4 writes is translated in logical block 1,so how to translate?when a read is coming ,
how can this read find his logical block in log-block page mapping table or in data-block mapping table?

That would depend on the SSD model and the FTL, but what you want for that is any dictionary data structure that has fast random reads and fast random writes. Typically, you would use either some sort of tree, preferably self-balanced, or a hash table.

Lets say we use a block mapping table scheme, to reduce memory usage.
Suppose a block has 128 pages, So we will have a mapping table reduced by 128 times ..correct ?

But Then page write request will come in any order for a block. We will need to store for each block , which page contains data , which is empty …
Hence we will have to store 128 bits, thats 16 bytes per block .. its still more .. right ?