Database

Pydoop: Writing Hadoop Programs in Python

Installed as a layer above Hadoop, the open-source Pydoop package enables Python scripts to do big data work easily.

Earlier this year, Dr. Dobb's presented a three-part tutorial on handling so-called "big data" using Hadoop. In this article, I explore Pydoop, which provides a simple Python API for Hadoop.

If you are new to Hadoop or need updates about its latest version, I suggest you read two excellent articles written by Tom White in the Dr. Dobb's tutorial series: "Hadoop: The Lay of the Land" and "Hadoop: Writing and Running Your First Project." I will assume that you understand the basics of Hadoop and MapReduce (those tutorials provide all the necessary information to continue reading this article).

Pydoop Script enables you to write simple MapReduce programs for Hadoop with mapper and reducer functions in just a few lines of code. When Pydoop Script isn't enough, you can switch to the more complete Pydoop API, which provides the ability to implement a Python Partitioner, RecordReader, and RecordWriter. Pydoop might not be the best API for all Hadoop use cases, but its unique features make it suitable for specific scenarios and it is being actively improved. In this article, I'll explain how to install a single-node cluster with Pydoop and execute simple Pydoop Script MapReduce programs.

Working with a Python MapReduce and HDFS API

The researchers at the CRS4 research center recently released Pydoop 0.9.1. The product offers a programming model similar to Java's for MapReduce. You simply need to define classes that the MapReduce framework requires to instantiate, and then you call the methods. Because Pydoop is a CPython package, it makes it possible to access all standard Python libraries and third-party modules. Pydoop Script makes it even easier to write MapReduce modules by simply coding the necessary mapper and reducer functions.

Pydoop wraps Hadoop pipes and allows you to access the most important MapReduce components, such as Partitioner, RecordReader, and RecordWriter. In addition, Pydoop makes it easy to interact with HDFS (Hadoop Distributed File System) through a Pydoop HDFS API (pydoop.hdfs), which allows you to retrieve information about directories, files, and several file system properties. The Pydoop HDFS API makes it possible to easily read and write files within HDFS by writing Python code. In addition, the lower-level API provides features similar to the Hadoop C HDFS API, and so you can use it to build statistics of HDFS usage.

Creating a Single-Node Hadoop Cluster Ready for Pydoop

The installation instructions for Pydoop are a bit outdated, so I'll provide complete step-by-step instructions for building the latest Pydoop version from source in the most recent Ubuntu 12.10 (also known as Quantal Quetzal), 64-bit desktop version. This way, you will be able to work with Pydoop MapReduce jobs in a single-cluster Hadoop configuration. I'll start with a basic Ubuntu 12.10 64-bit installation with just the default OpenJDK installed, and I'll provide details on all the prerequisites that Pydoop 0.9.1 requires to run it on top of the most recent and stable Hadoop version 1.1.2. The installation consists of:

installing Java 7

creating a Hadoop user

disabling IPv6

installing Hadoop

installing Pydoop dependencies and installing Pydoop.

I want to use Oracle Java 7 JDK instead of OpenJDK in Ubuntu. Oracle Java 7 JDK isn't included in the official Ubuntu repositories, and therefore, it is necessary to add a PPA with this JDK. Run the following command in a terminal window:

sudo add-apt-repository ppa:webupd8team/java

The terminal will display a confirmation. Press Enter to continue and the system will add the PPA (see Figure 1).

Figure 1: Oracle Java JDK PPA added to your system.

Now, run the following commands and press Y when the system asks you whether you want to continue:

sudo apt-get update
sudo apt-get install oracle-java7-installer

The Oracle Java 7 JDK package configuration will display a dialog box asking you to accept the license, you must agree if you want to use the JDK. Press Enter, select the Yes button in the new dialog box that appears, then press Enter again.

Once the installation has finished you should see the following lines appearing in the terminal:

Oracle JDK 7 installed
Oracle JRE 7 browser plugin installed

Check the Java version by running the following command:

java -version

You should see something similar to the following lines. The minor version and build numbers might be different because there are frequent updates to Oracle Java JDK and its related PPA:

If the result of these commands don't display 1.7 as the first numbers of the build, you can run the following command to make Oracle Java 7 JDK the default Java:

sudo update-java-alternatives -s java-7-oracle

Run the following commands to create a Hadoop group named hadoop and a user named hduser.

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser

Enter and then repeat a password for the new hduser user, the full name (Hadoop User) and press Y when the system requires you to confirm whether the information is correct (see Figure 2).

Figure 2: The results of the creation of a new hduser added to the hadoop group.

Now, it is necessary to configure SSH. Run the following commands to generate a public/private RSA key pair . You must enter the password defined for the hduser user.

su - hduser
ssh-keygen -t rsa -P ""

Run the following command to check the SSH connection to localhost.

ssh localhost

If you receive a Connection refused error (see Figure 3), run the following command to install openssh-server. Press Y when the system asks you whether you want to continue with the package installation.

ssh localhost

Figure 3: The results of creating a public/private RSA key pair.

Open a new terminal window and run the following commands to add localhost to the list of known hosts. You must enter the password defined for the hduser user. Then, enter yes when the system asks you whether you want to continue connecting.

su - hduser
ssh localhost

Edit the sudoers file by using GNU nano. You must be extremely careful with the following steps because an error in the sudoers file will make it impossible for you to use sudo and you will have to use recovery mode to edit the file again. Open a new terminal window and run the following command:

pkexec visudo

GNU nano will show the editor for the /etc/sudoers.tmp file. Add the following line at the end of the file to add hduser into sudoers (see Figure 4):

hduser ALL=(ALL) ALL

Figure 4: Editing /etc/sudoers.tmp with GNU nano.

Then, press Ctrl + X to exit. Press Y and then Enter because you want to overwrite /etc/sudoers.tmp with the changes. Then, you will see a What now? question in the terminal. Press Q because you want to quit and save changes to the /etc/sudoers.tmp file. You will notice the word DANGER in the Quit and Save Changes to sudoers file option. As I mentioned before, this can be a dangerous configuration change, and therefore, you must be careful while editing the file.

Next, disable IPv6 to avoid problems with Hadoop and Pydoop in your single-cluster configuration. Run the following command to launch GEdit and open the /etc/sysctl.conf file:

There is going to be a new hadoop-1.1.2 folder within /home/hduser; therefore, the Hadoop home directory is /home/hduser/hadoop-1.1.2. Run the following command to move the /home/hduser/Hadoop-1.1.2 directory to /home/hduser/hadoop and make it the new Hadoop home.

sudo mv hadoop-1.1.2 hadoop
sudo chown -R hduser:hadoop hadoop

By default, Ubuntu uses Bash as the default shell. It is necessary to update $HOME/.bashrc for hduser to set Hadoop-related environment variables (Pydoop requires HADOOP_HOME). Run the following command to open /home/hduser/.bashrc with GEdit.

sudo gedit /home/hduser/.bashrc

Add the following lines to the aforementioned file. Then, select File | Save and close the GEdit window. This way, you will set HADOOP_HOME, JAVA_HOME, and add the Hadoop bin directory to PATH:

Run the following command to open the /home/hduser/hadoop/conf/core-site.xml configuration file with GEdit to set the value of hadoop.tmp.dir to the previously created temporary directory (home/hduser/tmp) and the value of fs.default.name.

sudo gedit /home/hduser/hadoop/conf/core-site.xml

Add the following lines between the <configuration> and </configuration> tags:

Then, run the following command to open the /home/hduser/hadoop/conf/mapred-site.xml configuration file with GEdit to set the value of mapred.job.tracker to the host and port where the MapReduce job tracker runs at (localhost:54311):

sudo gedit /home/hduser/hadoop/conf/mapred-site.xml

Add the following lines between the <configuration> and </configuration> tags (see Figure 6):

And save the file. Now, run the following commands to format the HDFS filesystem with NameNode. NameNode will initialize the filesystem in /home/hduser/hadoop/tmp/dfs/name. Press Y when NameNode asks you whether you want to reformat the filesystem.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!