Detecting slow
DataNodes

Slow DataNodes in an HDFS cluster can negatively impact the cluster performance.
Therefore, HDFS provides a mechanism to detect and report slow DataNodes that have a negative
impact on the performance of the cluster.

HDFS is designed to detect
and recover from complete failure of DataNodes:

There is no single point of failure.

Automatic NameNode failover takes only a few seconds.

Because data replication can be massively parallelized in large clusters,
recovery from DataNode loss occurs within minutes.

Most jobs are not affected by DataNode failures.

However, partial failures can negatively affect the performance of running DataNodes:

Slow network connection due to a failing or misconfigured adapter.

Bad OS or JVM settings that affect service performance.

Slow hard disk.

Bad disk controller.

Slow DataNodes can have a
significant impact on cluster performance. A slow DataNode may continue sending heartbeats
successfully, and the NameNode will keep redirecting clients to slow DataNodes. HDFS
DataNode monitoring provides detection and reporting of slow DataNodes that negatively
affect cluster performance.

Enable disk IO statisticsDisk IO statistics are disabled by default. To enable disk IO statistics, you must set the file IO sampling percentage to a non-zero value in the hdfs-site.xml file.