Centralized cache management in HDFS is an explicit caching mechanism that enables
you to specify paths to directories or files that will be cached by HDFS. The
NameNode will communicate with DataNodes that have the desired blocks available on
disk, and instruct the DataNodes to cache the blocks in off-heap caches.

Explicit pinning prevents frequently used data from being evicted from
memory. This is particularly important when the size of the working set exceeds
the size of main memory, which is common for many HDFS workloads.

Because DataNode caches are managed by the NameNode, applications can
query the set of cached block locations when making task placement decisions.
Co-locating a task with a cached block replica improves read performance.

When a block has been cached by a DataNode, clients can use a new, more
efficient, zero-copy read API. Since checksum verification of cached data is
done once by the DataNode, clients can incur essentially zero overhead when
using this new API.

Centralized caching can improve overall cluster memory utilization. When
relying on the operating system buffer cache on each DataNode, repeated reads of
a block will result in all n replicas of the block being pulled into the
buffer cache. With centralized cache management, you can explicitly pin only
m of the n replicas, thereby saving n-m memory.