TF-IDF in Hadoop Part 1: Word Frequency in Doc

My interest about parallel computing dates since my undergraduate school, just one or two years after Google’s paper was published about how to make efficient data processing. From that time on, I was wondering how they manage to index “the web”. As I started learning the API and the HDFS, as well as exploring the implementation of the TF-IDF algorithm, as explained by the Cloudera training. I started this implementation after I implemented the InvertedIndex example using both the Hadoop 0.18 and the 0.20.1 APIs. The parts of my experiences are defined as follows:

7 years passed and while writing my thesis project, I started dealing with the same questions regarding large datasets… How to process them on a database level? I mean, how to efficiently process with the computational resources you’ve got? Interestingly enough, my first contact with a MapReduce processing was with the mongoDB’s MapReduce API to access data in parallel in different shards in of a database cluster. If the data is stored in different shards depending on different properties of the data. And of course, one of the tools to process the distributed data is a MapReduce API. I learned how to use that API thanks to the Cloudera’s Basic Training on MapReduce and HDFS. This first documentation was produced after studying and completing the first exercises of the Cloudera’s InverseIndex Example using Hadoop, where I have downloaded the VMPlayer image and played with initial examples, driven by the PDF explaining the exercises. Although the source-code works without a problem, it uses the Hadoop 0.18 API, and if you get buzzed by the warnings on Eclipse, I have updated and documented the necessary changes to remove those and use the refactored version of InverseIndex using the Hadoop 0.20.1 API.

I finally found the Cloudera basic introduction training on MapReduce and Hadoop… and let me tell you, they made the nicest introduction to MapReduce I’ve ever seen :) The slides and documentation are very well structured and nice to follow (considering you came from the academic world)… They actually worked closely with Google and the University of Washington to get to that level… I’m was very pleased to read and understand the concept… My only need on that time was to use that knowledge on the MapReduce engine from mongoDB… I did a simple application and it proved to be interesting…

So, I’ve been studying the Cloudera basic training in Hadoop, and that was the only way I could learn MapReduce! If you have a good background on Java 5/6, Linux, Operating System, Shell, etc, you can definitely move on… If you don’t have experience with Hadoop, I definitely suggest following the basic training from sessions 1 – 5, including the InvertedIndex exercise. You will find the exercises describing the TF-IDF algorithm in one of the PDFs.

The first implementation I did with Hadoop was the implementation of the indexing of words on All the Shakespeare collection. However, I was intrigued and could not resist and downloaded more e-books from the Gutenberg project (all Da-Vinci books and The Outline of Science Vol1). The input directory includes the collection from Shakespeare books, but I had to put the new ones into the filesystem. You can add the downloaded files to the Hadoop File System by using the “copyFromLocal” command:

Note that the command “hadoop fs” proxies any unix program to its filesystem. “-ls”, “-cat”, among others. Following the suggestion of the documentation, the approach I took to easily understand the concepts was to device-to-conquer. Each of the jobs are executed in separate as an exercise, saving the generated reduced values into the HDFS.

Job 1: Word Frequency in Doc

As mentioned before, the word frequency phase is designed in a Job whose task is to count the number of words in each of the documents in the input directory. In this case, the specification of the Map and Reduce are as follows:

Map:

Input: (document, each line contents)

Output: (word@document, 1))

Reducer

n = sum of the values of for each key “word@document”

Output: ((word@document), n)

In order to decrease the payload received by reducers, I’m considering the very-high-frequency words such as “the” as the Google’s stopwords list. Also, the result of each job is the intermediate values for the next jobs are saved to the regular file, followed by the next MapReduce pass. In general, the strategy is:

Reduces the map phase by using the lower-case values, because they will be aggregated before the reduce phase;

Don’t use unnecessary words by verifying in the stopwords dictionary (Google search stopwords);

Use RegEx to select only words, removing punctuation and other data anomalies;

Note that the unit tests use the JUnit 4 API. The MRUnit API is also updated to use the Hadoop 0.20.1 API for the Mapper and the respective MapDriver. Generics are used to emulate the actual implementation as well.

Before executing the hadoop application, make sure that the Mapper and Reducer classes are passing the unit tests for each of them. Test-Driven Development helps during the development of the Mappers and Reducers by identifying problems related to incorrect inherited methods (Generics in special), where wrong “map” or “reduce” method signatures may lead to skipping designed phases. Therefore, run the test cases before the actual execution of the driver classes is safer.

Then, the execution of the Driver can proceed. It includes the definitions of the mapper and reducer classes, as well as defining the combiner class to be the same as the reducer class. Also, note that the definition of the outputKeyClass and outputValueClass are the same as the ones defined by the Reducer class!!! If not, Hadoop will complain! :)

As specified by the Driver class, the data is read from the books listed in the input directory from the HDFS and the output is the directory from this first step “1-word-freq”. The training virtual machine contains the necessary build scripts to compile and generate the jars for the execution of the map reduce application, as well as running Unit Tests for each of the Mapper and Reducer classes.

The execution generates the output as shown in the following listing (note that I had piped the cat process to the less process for you to navigate over the stream). Searching for the word “therefore” shows its use on the different documents.

Hi,
Have you ever encountered this problem:
When I run test with mrunit-0.5.0-incubating.jar, the testcase run fine. But when I changed it with mrunit-0.8.0-incubating.jar, the context.getInputSplit() returned null and Line 66 in WordFrequenceInDocMapper.java throwed the NullPointerException. What’s the problem? What changes the 0.8.0 version have?