Analysis of Wikimedia Logs for Traffic Load and Popularity using Apache Hadoop

I took a class in Data Center systems this past Spring as a part of my Masters curriculum. My final project for that class was a project on analyzing public Wikimedia logs to determine different traffic load as well as popularity patterns using Apache Hadoop. I had a lot of fun doing this project and it was great learning experience so I figured I’d blog about it.

The goals of this project were:
1.) Perform temporal analysis on total number of requests per hour.
2.) Find the most popular Wikimedia project based on total views per hour per project.
3.) Find the top 10 most popular pages during a given day.
4.) Find the top 10 pages that returned the most content during a given day.
5.) Determine whether this data obeys Zipf’s law in terms of popularity.

The dataset for this project comprised of three days’ worth (January 1st, 2012 to January 3rd, 2012) of public Wikimedia log entries, which translated to about 5.6GB of compressed data and about 20GB after decompression. Each request of a page, whether for editing or reading, whether a “special page” such as a log of actions generated on the fly, or an article from Wikipedia or one of the other projects, reaches one of their squid caching hosts and the request is sent via UDP to a filter which tosses requests from the internal hosts, as well as requests for wikis that aren’t among the general projects. This filter writes out the project name, the size of the page requested, and the title of the page requested.

In the above, the first column “fr.b” is the project name. Projects without a period and a following character are wikipedia projects. The following abbreviations are used:

wikibooks: “.b”

wiktionary: “.d”

wikimedia: “.m”

wikipedia mobile: “.mw”

wikinews: “.n”

wikiquote: “.q”

wikisource: “.s”

wikiversity: “.v”

mediawiki: “.w”

The second column is the title of the page retrieved, the third column is the number of requests, fourth column is the size of the content returned and the fifth column is the date and time this record was logged. There is a separate log file for each hour. These are hourly statistics, so in the line:

en Main_Page 242332 4737756101 20120101-0000

We see that the main page of the English language Wikipedia was requested over 240 thousand times and 4737756101 bytes of data were transmitted from this page between 12am and 1am on Jan 1st, 2012. These are not unique visits.

The original logs downloaded from the public Wikimedia website did not contain date and time information in a fifth column but the file name did. This data was added to each record in each of the files using a shell script. The following code was used to accomplish this:

Adding this fifth column to every record made it easier to program the Hadoop jobs but, at the same time, it increased the size of the total dataset from about 20GB to about 27GB.

Depending on the type of output required, the Hadoop mapper routines were programmed using a combination of project name, page name and date-time as key, and either number of requests or content size as the value. The Reducers iterate through and sum up all the values with a common key. Results obtained from Hadoop were sorted and truncated for presentation using the linux sort and head commands.

Click here for the Github repository of the code for this project.Click here for the results I found after implementing the code above.
Here are the commands I used for running this code: