4. Install HIPI

Clone the HIPI GitHub Repository

The best way to get the latest version of HIPI is by cloning the official GitHub repository and building it along with all of the tools and example programs yourself. This only takes a few minutes and verifies that your system is properly setup and ready to begin developing your own HIPI applications:

$> git clone git@github.com:uvagfx/hipi.git

Build the HIPI Library and Example Programs

From the HIPI root directory, simply run gradle to build the HIPI library along with all of the tools and example programs:

If the build fails, first carefully review the steps above. If you are convinced that you are doing everything correctly and that you've found an issue with the HIPI distribution or documentation please post a question to the HIPI Users Group or file a bug report.

After the build finishes, you may want to inspect the settings.gradle file in the root directory and the build.gradle files in each directory to familiarize yourself with the various build targets. If you're new to Gradle, we recommend reviewing this tutorial. For example, to build only the hibImport tool from scratch:

HIPI is now installed on your system. To learn about future updates to the HIPI distribution you should join the HIPI Users Group and watch the HIPI GitHub repository. You can always obtain the latest version of HIPI on the release branch with the following git pull command:

Also, you can experiment with the development branch, which contains the latest features that have not yet been integrated into the main release branch. Note that the development branch is generally less stable than the release branch.

5. Setup Eclipse (Optional)

Next, we will walk you through the process of writing your first HIPI program. Be sure to also check out the tools and example programs to learn more about HIPI.

Your First HIPI Program

This section will walk you through the process of creating a very simple HIPI program that computes the average pixel color over a set of images. First, we need a set of images to work with. Recall that the primary input type to a HIPI program is a HipiImageBundle (HIB), which stores a collection of images on the Hadoop Distributed File System (HDFS). Use the hibImport tool to create a HIB from a collection of images on your local file system located in the directory ~/SampleImages by executing the following command from the HIPI root directory:

If this command fails, double check that you successfully built the HIPI library and the tools by following the directions above.

Note that importHib actually creates two files in the current working directory of the HDFS: sampleimages.hib and sampleimages.hib.dat. You can verify that this is the case with the command: hadoop fs -ls. (You can learn how the hibImport tool works here after finishing this tutorial.)

You can use the handy hibInfo tool that comes with HIPI to inspect the contents of this newly created HIB file:

Note that your specific output may vary from what is shown above since you will be working with different images and different paths. Run hibInfo.sh without any arguments to see a description of its usage.

Next, following the conventions of Gradle, create a source directory hierarchy for your program by executing the following command in the root directory:

$> mkdir -p examples/helloWorld/src/main/java/org/hipi/examples

Next, let's add a Gradle build task for our new program by creating the file examples/helloWorld/build.gradle with the following contents:

The entry point of every Java program is the public static void main() method. As in most MapReduce applications, the main method in our program uses the ToolRunner Hadoop class to call the run() method in this driver class.

Build this very simple program by running the command gradle jar in the examples/helloWorld directory:

If the build is successful, it will produce the JAR file examples/helloWorld/build/libs/helloWorld.jar directory. Run this program using the following command from within the examples/helloWorld directory:

$> hadoop jar build/libs/helloWorld.jar
Hello HIPI!

Congratulations! You just created a very simple MapReduce program. Now let's make our program do some image processing with HIPI.

MapReduce

Hadoop's MapReduce parallel programming framework is a powerful tool for large-scale distributed computing. If this is your first experience with MapReduce, we recommend reading the official Apache MapReduce tutorial, which gives a nice introduction to this programming model. Another great read is the seminal paper written by Jeffrey Dean and Sanjay Ghemawat at Google titled MapReduce: Simplified Data Processing on Large Clusters.

First, let's extend the run() method in HelloWorld.java to initialize and execute a MapReduce job and create stubs for our Mapper and Reducer classes:

Most of this code imports necessary Hadoop and HIPI libraries and configures and launches the MapReduce Job. This type of code will become somewhat boilerplate across the MapReduce/HIPI programs you develop, but it's still important to understand what is going on.

The first lines of code in the run() method validate the arguments passed to the program, create the Hadoop Job object and call setter methods on this object to specify the classes that implement the map and reduce tasks along with the types of objects that are passed between these processing stages. The remaining lines of code setup the path to the input file and the output directory and launch the program. The code descriptions on the tools and examples page give much more detail about these parts of a HIPI program, which we will skip for now. Instead we will focus on the higher-level algorithm that will go in the map() and reduce() methods.

Before proceeding further, test that your code still compiles and runs by repeating the steps above, but don't expect it to do anything yet.

Computing The Average Pixel Color

Now let's add some actual HIPI image processing code to our program. For this example, we will be computing the average RGB value of the pixels in the images in our input HIB. Our mapper will compute the average pixel color over a single image and the reducer will add these averages together and divide by their count to compute the total average pixel color. Because the map tasks are executed in parallel, if our Hadoop cluster has more than one compute node we will perform this entire operation faster than if we were using a single machine. This is the key idea behind parallel computing in MapReduce.

The first two arguments of the map() method are a key/value pair (often called a "record" in MapReduce terminology) that are constructed by the HibInputFormatHibRecordReader classes. In this case, these two arguments are a HipiImageHeader (the "key") and a FloatImage (the "value"), respectively. In HIPI, the first argument of the map() method must always be a HipiImageHeader, but the second argument can be any type that extends the abstract base class HipiImage. This gives you, the developer, control over how images are decoded into memory.

Note that this map() method produces a record for each image in the HIB which is sent to the reduce processing stage using the context.write() method. These records consist of an IntWritable (that is always equal to 1) and another HIPI FloatImage object that contains the image's computed average pixel value. These records are collected by the MapReduce framework and become inputs to the reduce() method as an Iterable list of FloatImage objects where they are added together and normalized to obtain the final result:

Whenever a MapReduce program successfully finishes, it creates the file _SUCCESS in the output directory along with a part-r-XXXXX file for each reduce task. The average pixel value can be retrieved using the cat command:

Feel free to play around with different image sets and see how it affects the average pixel color. (Note: you will need to remove the output directory before running the program a second time with the command: hadoop fs -rm -R sampleimages_average.)