Other sites

Data distillation with Hadoop and R

We're definitely in the age of Big Data: today, there are many more sources of data readily available to us to analyze than there were even a couple of years ago. But what about extracting useful information from novel data streams that are often noisy and minutely transactional … aye, there's the rub.

One of the great things about Hadoop is that it offers a reliable, inexpensive and relatively simple framework for capturing and storing data streams that just a few years ago we would have let slip though our grasp. It doesn't matter what format the data comes in: without having to worry about schemas or tables, you can just dump unformatted text (chat logs, tweets, email), device "exhaust" (binary, text or XML packets), flat data files, network traffic packets … all can be stored in HDFS pretty easily. The tricky bit is making sense of all this unstructured data: the downside to not having a schema is that you can't simply make an SQL-style query to extract a ready-to-analyze table. That's where Map-Reduce comes in.

Think of unstructured data in Hadoop as being a bit like crude oil: it's a valuable raw material, but before you can extract useful gasoline from Brent Sweet Light Crude or Dubai Sour Crude you have to put it through a distillation process in a refinery to remove impurities, and extract the useful hydrocarbons.

Likewise, I think "data distillation" is a useful metaphor for getting useful information out of unstructured data in Hadoop (but with less environmental impact!). The oil tankers are our various data streams: log files, event streams, unstructured text (perhaps scraped from an app or the Web), all of which are loaded into Hadoop's HDFS file system. Hadoop is our "data refinery", where we use the Map-Reduce process to remove the "impurities" (duplicate information, redundant fields, obvious errors, syntactic fillers) and to convert the useful data into a structured format for analysis.

The distallation process, in its own right, can be fairly complex. Sometime's its as simple as pulling the appropriate field from a compound record like the one below (this is an iPhone diagnostic packet):

But it's just as likely that some deeper processing will be required in the data distillation process. If we're interested in how often someone uses the Weather app on the iPhone, we might want to calculate a difference in timestamps of just those packets related to the Weather app, rather than simply extracting a field. (Map-Reduce is great at operations like this.) We might also want to infer an action (or task, or behaviour) from a long sequence of transactions, and just export the action and associated variables (time, user, location, …) as structured data. Or we might want to convert unstructured text to a quantitative measure (number of instances of key words or phrases, or the "sentiment" expresssed in the text: happy, angry, upset, etc.).

That's why the R language is so great for writing map-reduce applications. (You can use the 'rmr' package from RHadoop to write map-reduce tasks in R.) It's not just that R has string-manipulation tools that are great for extracting fields from packets. You can also use many of the deeper data analysis tools in R during the distillation process: to extract signal from noise, to quantify unstructured text, to extract measures from social network graphs, and much, much more. Here are a few examples where R and Hadoop are used for data distillation:

Orbitz uses R and Hadoop to extract flights and hotels that will be presented during a travel search, based on previous transaction.

Using k-means clustering to extract simiar "groups" of transactions, which are then aggregated and used as the record level for structured analysis.

Another recent example comes from marketing analytics company UpStream Software, who used map-reduce to convert transactions from Omniture logs (web visits, emails clicked on, ads displayed) into customer behaviors: response to an offer, research into a product, purchases. (Even lack of data was part of the analysis: periods of lack of on-line activity coupled with store loyalty data can be inferred as bricks-and-mortar research or purchases.)

As with all data distillation processes, UpStream's Hadoop output is a structured data file, neatly organized by rows (records) and columns (the distilled data fields associated with each record). The next step is to use that structured data file for traditional predictive modeling: in UpStream's case, Generalized Additive Models are used to associate specific marketing campaigns with ultimate product sales. The structured output file is likely to be large (even though the distillation process means it'll be smaller than the source unstructured data, we're still talking Big Data here), so a high-performance architecture for the statistical modeling is important. UpStream uses a separate multi-core analytics server running Revolution R Enterprise for high-performance modeling of the data exported from Hadoop.

To summarize: if you have potentially useful data streams that are being generated but not stored today, consider setting up a process to store and archive them in Hadoop. Next, use the rmr package to write map-reduce tasks in the R language to generate a structured data set from the "crude oil" of unstructured data in Hadoop. Use the structured data to build predictive models in a high-performance analytics environment. Lastly, automate the entire process, to ensure your models are being kept up-to-date with the latest data stored in the Hadoop cluster. That's what many data-driven companies are doing today, and you can learn more about UpStream's revenue atttribution models in the replay and slides from the recent webinar hosted by Revolution Analytics linked below.