Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Currently I am following the Coursera course ‘Mining Massive Datasets‘. I have been interested in MapReduce and Apache Hadoop for some time and with this course I hope to get more insight into when and how MapReduce can help to fix some real world business problems (another way to do so I described here). This Coursera course is mainly focused on the theory of the covered algorithms and less about the coding itself. The first week is about PageRanking and how Google uses this to rank pages. Luckily there is a lot to find about this topic in combination with Hadoop. I ended up here and decided to have a closer look at this code.

What I did was take this code (forked it) and rewrote it a little. I created unit tests for the mappers and reducers as I described here. As a testcase I used the example from the course. We have three webpages linking to each other and/or themselves:

This linking scheme should resolve to the following page ranking:

Y 7/33

A 5/33

M 21/33

Since the MapReduce example code is expecting ‘Wiki page’ XML as input I created the following test set:

The global way it works is already explained very nicely at the original page itself. I will only describe the unit tests I created. With the original explanation and my unit tests you should be able to go through the matter and understand what happens.

As described the total job is divided in three parts:

parsing

calculating

ordering

In the parsing part the raw XML is taken, split into pages and mapped so that we get as output the page as a key and a value of the pages it has outgoing links to. So the input for the unit test will be the three ‘Wiki’ pages XML as shown above. The expected output is the ‘title’ of the pages with the linked pages. The unit test looks like:

As the unit test shows we expect the reducer to reduce the input to the value of an ‘initial’ page rank of 1.0 concatenated with all pages the (key) page has outgoing links to. That is the output of this phase and will be used as input for the ‘calculate’ phase.

In the calculate part a recalculation of the incoming page ranks will be performed to implement the ‘power iteration‘ method. This step will be performed multiple times to obtain an acceptable page rank for the given page set. As said before the output of the previous part is the input of this step as we see in the unit test for this mapper:

As is shown the output from the mapper is recreated as input and we check that the output of the reducer matches the first iteration of the page rank calculation. Each iteration will lead to the same output format but with possible different page rank values.

The final step is the ‘ordering’ part. This is quite straightforward and so is the unit test. This part only contains a mapper which takes the output of the previous step and ‘reformats’ it to the wanted format: pagerank + page order by pagerank. The sorting by key is done by Hadoop when the mapper result is supplied to the reducer step so this ordering isn’t reflected in the Mapper unit test. The code for this unit test is:

So here we just check that the mapper takes the input and formats the output correctly.

This concludes all the examples of the unit tests. With this project you should be able to test it yourself and get bigger insight in how the original code works. It sure helped me to understand it! The complete version of the code including unit tests can be found here.