HDFS Block Concepts - Hadoop Tutorial

Filesystem Blocks: A block is the smallest unit of data that can be stored or retrieved from the disk. Filesystems deal with the data stored in blocks. Filesystem blocks are normally in few kilobytes of size. Blocks are transparent to the user who is performing filesystem operations like read and write.

HDFS Block

Hadoop distributed file system also stores the data in terms of blocks. However the block size in HDFS is very large. The default size of HDFS block is 64MB. The files are split into 64MB blocks and then stored into the hadoop filesystem. The hadoop application is responsible for distributing the data blocks across multiple nodes.

Advantages of HDFS Block

The benefits with HDFS block are:

The blocks are of fixed size, so it is very easy to calculate the number of blocks that can be stored on a disk.

HDFS block concept simplifies the storage of the datanodes. The datanodes doesn’t need to concern about the blocks metadata data like file permissions etc. The namenode maintains the metadata of all the blocks.

If the size of the file is less than the HDFS block size, then the file does not occupy the complete block storage.

As the file is chunked into blocks, it is easy to store a file that is larger than the disk size as the data blocks are distributed and stored on multiple nodes in a hadoop cluster.

Blocks are easy to replicate between the datanodes and thus provide fault tolerance and high availability. Hadoop framework replicates each block across multiple nodes (default replication factor is 3). In case of any node failure or block corruption, the same block can be read from another node.

Why HDFS Blocks are Large in Size

The main reason for having the HDFS blocks in large size is to reduce the cost of seek time. In general, the seek time is 10ms and disk transfer rate is 100MB/s. To make the seek time 1% of the disk transfer rate, the block size should be 100MB. The default size HDFS block is 64MB.