How To Enable NFS Access to HDFS in Hortonworks Sandbox

In this blog we’ll set up NFS for HDFS access with the Hortonworks Sandbox 1.3. This allows the reading and writing of files to Hadoop using familiar methods to desktop users. Sandbox is a great way to understand this particular type of access.

We will now enable Ambari so that we can edit the configuration to enable NFS. Log into the Sandbox as a root SSH session. The ‘root’ account password is ‘hadoop’.

Install the NFS Server bits for the Linux OS.

yum install nfs* -y

You may have to enable an externally facing network adapter to allow the yum command to resolve the correct repository. if this is not possible you will need the package called nfs-utils for Centos 6.

Start the Ambari Server

Navigating to the IP address provided when the sandbox started there is documentation provided on starting Ambari. The step are summarized below. Please be sure to remember to reboot the virtual machine after you have run the start_ambari script.

You must reboot the sandbox after you run the install Amabari script.

Open Ambari UI in the browser

Sign in with username: admin, password: admin.

We will now update the HDFS configs to enable NFS. To do that, we’ll need to stop the HDFS and MapReduce services, update the configs, and restart HDFS and MapReduce. MapReduce must be stopped first followed by HDFS.

Go to Services tab on top, then select MapReduce, and click Stop.

Go to Services tab on top, then select HDFS on the left, and choose Configs sub tab.

Click the Stop button to stop the HDFS services.

A successful service stoppage will show this:

In Configs tab, open the Advanced section, and change the value for dfs.access.time.precision to 3600000. This would be edited in the hdfs-default.xml via the command line.

In the same section, change the value for dfs.datanode.max.xcievers to 1024.

In Custom hdfs-site.xml section, add the following property:

This should then look like:

Then click the Save button.

Start the HDFS services, and then the MapReduce services

You need to stop the native Linux services nfs and portmap and then start the Hadoop enabled version:

service nfs stop
service rpcbind stop
hadoop portmap
hadoop nfs3

To get this started each time you restart your sandbox you can add a few lines to your rc.local startup script:

hadoop-daemon.sh start portmap
hadoop-daemon.sh start nfs3

This will place logs for each service in /var/log/hadoop.

Verify NFS server is up and running on the sandbox with the rpcinfo command. You may also run the showmount command both on the sandbox and on the client machine. You should see output similar to the output below stating “/” is available to everyone.

Create a user on your client machine that matches a user in the Sandbox HDP VM.

For example, hdfs is a user on the Sandbox VM. The UID for hdfs is 497.

On my client machine, which happens to be a Mac OS X machine, I’ll create a user hdfs with the same UID with the following commands:

If on another operating system, create a user hdfs with the UID 497 to match the user on the sandbox VM. This is easily accomplished in Linux using the -u option to the adduser command. In Windows you likely want to use a NFS Client such as this. The answer for Server and premium versions of Windows includes adding the Subsystem for Unix Applications.

Mount HDFS as a file system on local client machine

mount -t nfs -o vers=3,proto=tcp,nolock HOSTIP:/ /PATH/TO/MOUNTPOINT

Now browse HDFS as if it was part of the local filesystem.

Load data off HDFS onto the local file system:

Delete data in HDFS:

Load data into HDFS. Take a file from the local disk, mahout.zip, and load it into the hdfs user directory on HDFS file system. On this local machine, HDFS is mounted at /Users/hdfs/mnt/

Additionally you can verify your files are in HDFS via the file browser in the Hue interface provided with the sandbox or returning to the command line you can change to users hdfs (su – hdfs) and use standard hadoop command line commands to query for your files.

Conclusion

Using this interface allows users of a Hadoop cluster to rapidly push data to HDFS in a way in which they are familiar from their desktops. Additionally this opens up the possibilities for scripting the pushing of data from some networked machine into Hadoop including upstream preprocessing of data from other systems.