Write Caching in Flash: A Dubious Distinction »

January 12, 2012

Flash memory shines on reads: it reads 100 times faster than a disk. But its performance advantage is much weaker on writes, and its write endurance is much lower than disk’s. Therefore, Nimble OS uses flash only for accelerating reads, aka “read caching”. It uses NVRAM (a DRAM-based device) for accelerating writes, aka “write caching”.

On the other hand, a few storage systems use flash memory for write caching. Here I describe what compels these systems to use flash in this manner and the cost-benefit tradeoff it entails.

In general, storage systems implement write caching using a non-volatile “write buffer.” On a write request, the system stores the data into the write buffer anacknowledges the request. In the background, as the buffer fills up, the system drains the buffer to the underlying storage. The speed at which the write buffer can be drained to underlying storage constrains the sustainable write throughput.

The write buffer helps in following ways:

It enables the storage system to acknowledge a write request with very low latency.

It can absorb a high-throughput burst of writes, while it drains less speedily to disk-based storage over a longer period of time.

It absorbs overwrites (multiple writes to the same blocks), thereby reducing the amount of drainage, which may support a higher write throughput.

It allows the data being drained to be sorted by logical addresses, thereby improving the sequentiality of drainage, which may improve the speed of draining and support a higher write throughput.

The latency advantage depends on the buffering medium. NVRAM (DRAM made non-volatile with battery backup or flash backup) provides latency of a few tens of microseconds. Flash a few hundreds of microseconds. Disk a few milliseconds. Most storage systems use NVRAM for write buffering. However, file systems that are not tied to a hardware platform cannot assume the availability of NVRAM, and may buffer writes on flash or even on disk. E.g., the write buffer in ZFS, called ZFS Intent Log (ZIL), is generally stored on flash or disk.

A few storage systems now use flash as a secondary write buffer in addition to using NVRAM. E.g., EMC “FAST cache” uses flash as both a read cache and a write buffer. In such systems, written data is staged through the NVRAM-based buffer, the flash-based buffer, and finally to disk. The flash-based buffer is much bigger than the NVRAM-based buffer, and therefore provides higher levels of burst absorption, overwrite absorption, and sequentiality improvement, which in turn may support a higher write throughput. These advantages are based on the assumption that the NVRAM-based buffer cannot be drained directly to disk-based storage at high throughput.

Most storage systems employ a simplistic disk layout such that draining the write buffer results in random writes on disk. Furthermore, these systems amplify the IO load in order to support parity RAID and copy-on-write snapshots. The resulting load cripples the speed at which data can be drained to disk. (NetApp’s WAFL performs better by concatenating random data blocks and writing them into free space, but it too degenerates gradually as the free space becomes fragmented.) Because these systems cannot drain to disk at high speed, they stand to benefit from adding a larger write buffer. Even so, this benefit is limited because it does not eliminate random writes to disk—it only reduces them by some modest amount.

Furthermore, many of these storage systems could instead use a disk-based write buffer, which would be similar to a write-ahead log used in database systems. The log is written sequentially, which disks perform just as well as flash drives (about 100MB/s per drive). One advantage of a flash-based buffer over a disk-based buffer is that it also serves as a read cache for newly written data. However, as described later, there are cheaper ways of building a read cache. Another advantage is that the draining process can read the flash-based buffer in random order, so it supports a more thorough sorting of the data, thereby extracting more sequentiality.

Now consider the cost of write buffering. A flash-based buffer is expensive. First, because it holds the only copy of newly written data, it must employ the more expensive forms of flash and controllers, and also some RAID-like redundancy in the form of parity or mirroring. (In fact, a flash-based buffer needs to be even more reliable than an NVRAM-based buffer, because it is larger and the overwrite-absorption and re-sorting might make it difficult to recover the system to a consistent state upon loss.) On the other hand, a read cache does not ever store the only copy of any data, so it can be constructed inexpensively without sacrificing reliability: add a checksum to every block, verify the checksum on every read, and toss the cached block if the checksum does not match. Second, pushing the writes through flash burns through its limited write endurance, again requiring expensive, high-endurance, flash. Third, to obtain a significant edge over NVRAM-based log, the flash-based log must be much bigger. E.g., it may need to be large enough to absorb all writes during a busy period lasting hours.

“We present Griffin, a hybrid storage device that uses a hard disk drive (HDD) as a write cache for a Solid State Device (SSD).”

In other words, the authors are proposing just the opposite of using a flash-based write cache for disk! These authors are reputable researchers from the academia and Microsoft Research, and they exhibit a deep understanding of flash characteristics as a storage medium. There are practical issues with following their proposal, but the mere existence of this proposal questions the wisdom of using flash for write caching.

Nimble’s CASL™ filesystem uses the entire disk storage as a log, and always writes data to disk in large sequential chunks. This enables it to drain data from NVRAM buffer to disk storage at high throughput. This avoids the need for a secondary write buffer. It is as if the entire disk subsystem is at once a write buffer and the end point of storage.

In summary, flash-based write caching addresses burst throughput but only partially improves sustained throughput, while a write-optimized disk layout addresses both with little cost. However, systems with legacy disk layouts are forced to cache writes in flash as a costly fix to improve their write performance partially.