Harnessing Hadoop to Visualize Big Data Log Files

One of the demonstration projects that had been on the radar for the Big Data Practice here at Avalon Consulting, LLC was processing some large log data and visualizing it in an interesting way. A few days before the pope was going to be selected we got the idea that the traffic patterns at Patheos (the world’s leading website for religion and spirituality) surrounding the event might represent some interesting data. On the day of and day after the pope was selected, the respective log files were captured and I went to work to process the logs and create a visualization.

Patheos Log Data Visualization

I processed the logs in a couple of steps for which I’ll provide an overview and then some details. The log files were copied to Amazon Web Services (AWS) S3 so that I could easily analyze them using AWS Elastic MapReduce. I then created a Hadoop job that would process the log and output the timestamp, latitude, and longitude for each request. Finally I created the visualization using a combination of JavaScript and Google Maps JS API.

For the period surrounding the pope selection, the log data totaled around 4GB in two separate files. The most important part of copying this data to S3 is to choose a bucket name that abides by AWS recommendations. Specifically this includes no underscores in the bucket name. With the data in S3 I then started work on the Hadoop job.

Since the data already existed on AWS S3, it made sense to process it using AWS EC2 or AWS Elastic MapReduce. Since the data was rather large (4GB) and this process should be usable for longer periods of time, I choose AWS Elastic MapReduce. I wrote a Hadoop job that would parse the log, convert the IP address to latitude and longitude, batch the data into 15 minute sections, and remove duplicates in each section.

The log data was from HAProxy and was in the format detailed here. Using Hadoop’s TextInputFormat, I was able to read the log files line by line. I used a simple regular expression to pull out the accept_date and the client_ip. The accept_date field became the timestamp and I converted that to a time since epoch in milliseconds. Since I wanted to batch the log data into 15 minute chunks, I divided the timestamp by 1000*60*15 which converted the timestamp from milliseconds to one for every 15 minutes once I truncated it to an integer. I then focused on converting the client_ip field to latitude and longitude.

I chose to use Maxmind’s free GeoIP database and their corresponding Java api to convert the ip addresses to latitude and longitude. This required a few extra steps to get it to work on AWS Elastic MapReduce. The GeoIP database file is about 18MB and this needs to be accessed by each worker on the Hadoop cluster. I utilized the Hadoop distributed cache to share this file across the cluster. On AWS Elastic MapReduce, the distributed cache is used by adding the file first to S3 and then in the jar adding it to the distributed cache. I then used the file as if it were local from the distributed cache in the map setup step. From the mapper, I outputted the timestamp and the latitude and longitude from the GeoIP database.

The Hadoop job then used the reducer to remove duplicates from the values and outputs using the TextOutputFormat. The format looks like timestamp \t lat|long,lat|long,… and this is easy to process into a JavaScript file once the job completes. A Python script creates the JavaScript file and just adds some tags and makes the JSON object. At this point the data is ready for the visualization.

I decided to use Google Maps JS API V3 in order to create the map visualization. The map api provides a heatmap layer that you can add data points to. The data points are provided as latitude and longitude. I animated the data by looping through the available data points and putting those locations into the heatmap. The visualization therefore provides the ability to display the Patheos log data on the map over the period around the time of the pope selection.

In addition to the visualization, there were a few other ideas being brainstormed that didn’t make it into this demonstration. These ideas included being able to facet the log data and drill down into specific urls. This could include looking at specific Patheos channels and viewing the traffic overlaid together. Furthermore, being able to query different time segments would have provided some flexibility in the animation. Although due to this being just a demonstration of a technique, I determined that it would have been excessive but could be added for a client easily.

One of the impressive aspects of this project was the low cost achieved by using Amazon Web Services for the processing. I was able to process the logs in about 15 minutes on AWS Elastic MapReduce on 7 machines. I utilized spot instances for 5 of those 7 machines to reduce costs even further. The total costs came to $0.98. This breaks down to S3 being $0.38 for storing the 4GB of log data for a month and the AWS Elastic MapReduce cost of 2 small instances for $0.15 and 5 large spot instances for $0.45. Since the instances are charged per hour I could have processed about 4 times the data for the same price. By using Amazon Web Services I was able to process the data quickly and cost effectively.

I’d be happy to post code snippets in a subsequent blog post if enough people would be interested. Just comment if you’d like me to do that.

The results:

The video is best viewed fullscreen or on Youtube.

Share this:

Kevin Risden, an Apache Lucene/Solr committer, has been consulting on search and Hadoop for over 3 years at Avalon Consulting, LLC. He has helped organizations successfully transform their big data into business results.

About Avalon Consulting, LLC

Avalon Consulting, LLC transforms data investments into actionable business results through the visioning and implementation of Big Data, Web Presence, Content Publishing, and Enterprise Search solutions. We are the trusted partner to over one hundred clients, primarily Global 2000 companies, public agencies, and institutions of higher learning.