Latest revision as of 21:48, 22 October 2018

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

Contents

Installation

Configuration

By default, hadoop is already configured for pseudo-distributed operation. Some environment variables are set in /etc/profile.d/hadoop.sh with different values than traditional hadoop.

ENV

Value

Description

Permission

HADOOP_CONF_DIR

/etc/hadoop

Where configuration files are stored.

Read

HADOOP_LOG_DIR

/tmp/hadoop/log

Where log files are stored.

Read and Write

HADOOP_SLAVES

/etc/hadoop/slaves

File naming remote slave hosts.

Read

HADOOP_PID_DIR

/tmp/hadoop/run

Where pid files are stored.

Read and Write

You also should set up the following files correctly.

/etc/hosts
/etc/hostname
/etc/locale.conf

You need to tell hadoop your JAVA_HOME in /etc/hadoop/hadoop-env.sh because it doesn't assume the location where it's installed to in Arch Linux by itself:

/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk/

Check installation with:

hadoop version

If you get warning message "WARNING: HADOOP_SLAVES has been replaced by HADOOP_WORKERS. Using value of HADOOP_SLAVES." Then replace export HADOOP_SLAVES=/etc/hadoop/slaves in /etc/profile.d/hadoop.sh with: