Storage deduplication has received recent interest in the research community.
In scenarios where the backup process has to complete within short time windows,
inline deduplication can help to achieve higher backup throughput. In such
systems, the method of identifying duplicate data, using disk-based indexes on
chunk hashes, can create throughput bottlenecks due to disk I/Os involved in
index lookups. RAM prefetching and bloom-filter based techniques used by Zhu et
al. (2008) can avoid disk I/Os on close to 99{\%} of the index lookups. Even at this
reduced rate, an index lookup going to disk contributes about 0.1msec to the
average lookup time – this is about 1000 times slower than a lookup hitting in
RAM. We propose to reduce the penalty of index lookup misses in RAM by orders of
magnitude by serving such lookups from a flash-based index, thereby, increasing
inline deduplication throughput. Flash memory can reduce the huge gap between RAM
and hard disk in terms of both cost and access times and is a suitable choice for
this application.

To this end, we design a flash-assisted inline deduplication system using
ChunkStash, a chunk metadata store on flash. ChunkStash uses one flash read per
chunk lookup and works in concert with RAM prefetching strategies. It organizes
chunk metadata in a log-structure on flash to exploit fast sequential writes. It
uses an in-memory hash table to index them, with hash collisions resolved by a
variant of cuckoo hashing. The in-memory hash table stores (2-byte) compact key
signatures instead of full chunk-ids (20-byte SHA-1 hashes) so as to strike
tradeoffs between RAM usage and false flash reads. Further, by indexing a small
fraction of chunks per container, ChunkStash can reduce RAM usage significantly
with negligible loss in deduplication quality. Evaluations using real-world
enterprise backup datasets show that ChunkStash outperforms a hard disk index
based inline deduplication system by 7x-60x on the metric of backup throughput
(MB/sec).