4.
Hadoop Approach (1/4)
● Data Distribution
○ Distributed to all the nodes in the cluster
○ Replicated to several nodes

5.
Hadoop Approach (2/4)
● Move computation to the data
○ Whenever possible, rather than moving data for
processing, computation is moved to the node that
contains the data
○ Most data is read from local disk straight into the
CPU, alleviating strain on network bandwidth and
preventing unnecessary network transfers
○ This data locality results in high performance

7.
Hadoop Approach (4/4)
● Isolated execution
○ Communication between nodes is limited and done
implicitly
○ Individual node failures can be worked around by
restarting tasks on other nodes
■ No message exchange needed by user task
■ No roll back to pre-arranged checkpoints to
partially restart the computation
■ Other workers continue to operate as though
nothing went wrong

10.
HDFS (1/2)
● Storage component of Hadoop
● Distributed file system modeled after GFS
● Optimized for high throughput
● Works best when reading and writing large files
(gigabytes and larger)
● To support this throughput HDFS leverages unusually
large (for a filesystem) block sizes and data locality
optimizations to reduce network input/output (I/O)

11.
HDFS (2/2)
● Scalability and availability are also key
traits of HDFS, achieved in part due to data
replication and fault tolerance
● HDFS replicates files for a configured
number of times, is tolerant of both
software and hardware failure, and
automatically re-replicates data blocks on
nodes that have failed

16.
Hadoop Installation
● Local mode
○ No need to communicate with other nodes, so it
does not use HDFS, nor will it launch any of the
Hadoop daemons
○ Used for developing and debugging the application
logic of a MapReduce program
● Pseudo Distributed Mode
○ All daemons running on a single machine
○ Helps to examine memory usage, HDFS
input/output issues, and other daemon interactions
● Fully Distributed Mode

22.
Hadoop Configuration
File Name Description
masters ● Name is misleading and should have been called secondary-masters
● When you start Hadoop it will launch NameNode and JobTracker on the local
host from which you issued the start command and then SSH to all the nodes
in this file to launch the SecondaryNameNode.
slaves ● Contains a list of hosts that are Hadoop slaves
● When you start Hadoop it will SSH to each host in this file and launch the
DataNode and TaskTracker daemons

26.
Recipes:
HDFS Block Size (1/3)
● HDFS stores files across the cluster by
breaking them down into coarser grained,
fixed-size blocks
● Default HDFS block size is 64 MB
● Affects performance of
○ filesystem operations where larger block sizes
would be more effective, if you are storing and
processing very large files
○ MapReduce computations, as the default behavior
of Hadoop is to create one map task for each data
block of the input files

27.
Recipes:
HDFS Block Size (2/3)
● Option 1: NameNode configuration
○ Add/modify dfs.block.size parameter at conf/hdfs-
site.xml
○ Block size in number of bytes
○ Only the files copied after the change will have the
new block size
○ Existing files in HDFS will not be affected
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>