Pages

Thursday, March 14, 2013

Apache Hadoop HttpFS : A service that provides HTTP access to HDFS.

HttpFS : Introduction

Apache Hadoop HttpFS is a service that provides HTTP access to HDFS.

HttpFS provides a REST HTTP gateway supports HDFS operations like read and write, It can be used to transfer data between clusters running different versions of Hadoop. Also HttpFS can be used to access data in HDFS using HTTP utilities.

HttpFS was inspired by Hadoop HDFS proxy, It can be seening as a full rewrite of Hadoop HDFS proxy.

Hadoop
HDFS proxy provides a subset of file system operations (read only), Its also provides support for all file system operations.

HttpFS uses a clean HTTP REST API making its use with HTTP tools more intuitive.

Prerequisites for installing HttpFS are:

Installing HttpFS

HttpFS is distributed in the hadoop-httpfs package. To install
it, use your preferred package manager application. Install the package
on the system that will run the HttpFS server.

$ sudo yum install hadoop-httpfs //on a Red Hat-compatible system

$ sudo zypper install hadoop-httpfs/ /on a SLES system

$ sudo apt-get install hadoop-httpfs //on an Ubuntu or Debian system

or If you have a httpfs tarball then you can simply untar it,

$ tar xzf httpfs-2.0.3-alpha.tar.gz

now you are ready to configure HttpFS.

Configure HttpFS

HttpFS reads the HDFS configuration from the core-site.xml and hdfs-site.xml files in /etc/hadoop/conf/. If necessary edit those files to configure the HDFS HttpFS will use. By default, HttpFS assumes that Hadoop configuration files (core-site.xml & hdfs-site.xml) are in the HttpFS configuration directory.

Configure Hadoop

Edit Hadoop core-site.xml and defined the Unix user that will run the HttpFS server as a proxyuser. For example: