Database

Pydoop: Writing Hadoop Programs in Python

Installed as a layer above Hadoop, the open-source Pydoop package enables Python scripts to do big data work easily.

It is time to start the single-node Hadoop cluster. Run the following command to execute the script that will start NameNode, DataNode, JobTracker, and TaskTracker on the system:

./start-all.sh

It is easy to check whether the Hadoop processes are running by opening another terminal window and running jps. You should see the following names listed as a result of running jps (see Figure 8):

DataNode

SecondaryNameNode

TaskTracker

JobTracker

NameNode

Jps

Figure 8: Checking whether the Hadoop processes are running with jps.

It is also important to check whether the Hadoop processes are listening on the previously configured ports: 54310 and 54311. Open a terminal window and run the following command. If the results show one java process that is listening to 54310 and another one that is listening to 54311, it means that Hadoop processes are running according to the previously set configuration.

sudo netstat -plten | grep 543

You can use your favorite Web browser to visit the addresses for the different Hadoop Web interfaces and monitor the following different daemons:

Installing Pydoop

Now that you have a single-node Hadoop 1.1.2 cluster up and running, you can install Pydoop. If you don't want to build Pydoop from source, you will have to install specific dependencies that will make the installation a bit complicated. You can choose to build from source and you will avoid many of the dependencies that will make you install outdated versions. Avoid installing unnecessary dependencies if you decide to build from source.

Because Ubuntu includes libboost-python1.50.0 and libboost-python1.49.0, you need a previous version to satisfy a Pydoop dependency to libboost-python1.46.1. You can download it from http://packages.ubuntu.com/precise/libboost-python1.46.1 and open it with Ubuntu Software Center. If you decide to avoid building from source, download the file and click Install in Ubuntu Software Center.

Another problematic dependency is hadoop-client because it requires CDH4 (Cloudera's Distribution Including Apache Hadoop). If you want to install CDH4, you need to execute the following commands in a new terminal window. The commands add the repositories for CDH4 that will allow you to install hadoop-client.

You will see a list of dependencies that are missing. Thus, run the following command and answer Y to each confirmation question:

sudo apt-get install -f

The next step if you don't want to build from source is to download the Pydoop Debian package and install it. You can download the latest version of Pydoop available in a Debian package here. Once you download the file python-pydoop_0.9.0-1_amd64.deb, you can install the package by running the following command (replace /home/gaston/Downloads/python-pydoop_0.9.0-1_amd64.deb with the full path to your downloaded python-pydoop_0.9.0-1_amd64.deb):

sudo dpkg -i /home/gaston/Downloads/python-pydoop_0.9.0-1_amd64.deb

If you want to build from source, the first step is to install the prerequisites to perform the build. Run the following command to install build-essential, python-all-dev, libboost-python-dev, and libssl-dev:

Go to the directory in which you decompressed pydoop-0.9.1.tar.gz and then go to the test subdirectory. For example, in my case, I executed cd /home/gaston/Downloads/pydoop-0.9.1/test. Run the following commands to execute the basic Pydoop tests that don't work with HDFS:

When you installed Hadoop, you defined the value for the fs.default.name property in /home/hduser/hadoop/conf/core-site.xml. Use the port number configured in that property to set the value for HDFS_PORT. Because the value is hdfs://localhost:54310, I set HDFS_PORT to 54310. Run the following commands to execute the 134 Pydoop tests that include HDFS tests (see Figure 10):

Running Pydoop MapReduce Scripts

Once you checked that all Pydoop tests executed without problems, you can run some of the sample Pydoop MapReduce scripts before creating your own scripts. This way, you will learn how you can interact with HDFS and check the execution status of Pydoop MapReduce scripts.

Go to the directory in which you decompressed pydoop-0.9.1.tar.gz and then go to the examples/Pydoop_script subdirectory. In my case, I executed cd /home/gaston/Downloads/pydoop-0.9.1/examples/pydoop_script.

The transpose.py script transposes a tab-separated text matrix. The code is very easy to understand and defines the mapper and the reducer functions. The calls to writer.emit generate the results for both the mapper and the reducer functions.

The pydoop_script folder includes a sample text file, matrix.txt, with a valid input for the transpose.py Pydoop MapReduce script.

00 01 02
10 11 12
20 21 22
30 31 32
40 41 42

It is necessary to upload your input data to HDFS. Run the following commands to upload matrix.txt. The ls command will display the directory listing for /user/hduser in HDFS and the new matrix.txt file should be displayed.

You can also go to localhost:50075/browseDirectory.jsp?dir=%2Fuser%2Fhduser&namenodeInfoPort=50070 by following these steps:

Go to localhost:50070.

Click on Browse the filesystem.

Click user under Name.

Click hduser under Name. The Web browser will display the files for /user/hduser in HDFS. In this case, you will see matrix.txt.

Now that you have the HDFS input (matrix.txt), you can run the Pydoop script to perform the MapReduce job. Run the following command that specifies matrix.txt as the HDFS input and t_matrix as the HDFS output:

pydoop script transpose.py matrix.txt t_matrix

Go to localhost:50030 to use the Web interface for the Hadoop JobTracker daemon and check the status for the new job. You will see a new entry for transpose.py with details about the progress of both the Map and Reduce processes (see Figure 11).

Figure 11: Checking the details about the new transpose.py Pydoop MapReduce script job with the Web interface for the Hadoop JobTracker daemon.

Once transpose.py finishes the MapReduce job, you can retrieve the results from HDFS. Run the following commands to retrieve the output (t_matrix):

Conclusion

By writing just a few lines of code, you can easily create simple Pydoop Script MapReduce programs and execute them in your single-node Hadoop cluster. When Pydoop scripts aren't enough, you can start working with the more complete object-oriented Pydoop API and take full advantage of its features.

In this article, I've focused on installing, running and discovering Pydoop in a single-node Hadoop cluster and with the latest available Ubuntu version. If you have basic Python skills, you will be able to take full advantage of Pydoop by diving deeper on its features and creating more-complex MapReduce jobs.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!