{ cloud ☁️ architect }

Hadoop Distributed File System (HDFS) Tutorial

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications.

One of the primary advantages of HDFS is its transparency. Clients do not need to be particularly aware that they are working on files stored remotely. The existing standard library methods like open(), close(), fread(), etc. will work on files hosted over NFS.

Configuring HDFS

The HDFS can be found in the /conf folder of your Hadoop installation. The conf/hadoop-defaults.xml file contains default values for every parameter in Hadoop, this file is read-only. You override this configuration by setting new values inconf/hadoop-site.xml. This file should be replicated consistently across all machines in the cluster.

The following settings are necessary to configure HDFS:fs.default.name : This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. eg. hdfs://thys.michels.com:9000dfs.data.dir : his is the path on the local file system in which the DataNode instance should store its data. eg. /home/username/hdfs/datadfs.name.dir : This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. eg. /home/username/hdfs/name

The master node needs to know the addresses of all the machines to use as DataNodes; the startup scripts depend on this. Also in the conf/ directory, edit the file slaves so that it contains a list of fully-qualified hostnames for the slave instances, one host per line. On a multi-node setup, the master node (e.g., localhost) is not usually present in this file.