A place to read about topics of interest to data miners, ask questions of the data mining experts at Data Miners, Inc., and discuss the books of Gordon Linoff and Michael Berry.

Friday, December 18, 2009

Hadoop and MapReduce: Method for Reading and Writing General Record Structures

I'm finally getting more comfortable with Hadoop and java, and I've decided to write a program that will characterize data in parallel files.

To be honest, I find that I am spending a lot of time writing new Writable and InputFormat classes, every time I want to do something. Every time I introduce a new data structure used by the Hadoop framework, I have to define two classes. Yucch!

So, I put together a simple class called GenericRecord that can store a set of column names (as string) and a corresponding set of column values (as strings). These are stored in delimited files, and the various classes understand how to parse these files. In particular, the code can read any tab delimited file that has column names on the first row (and changing the delimitor should be easy). One nice aspect is the ability to use the GenericRecord as the output of a reduce function, which means that the number and names of the output can be specified in the code -- rather than in additional files with additional classes.

I wouldn't be surprised if similar code already exists with more functionality than the code I have here. This effort is also about my learning Hadoop.

This code is analogous to the word count code, that must be familiar to anyone starting to learn MapReduce (since it seems to be the first example in all the documentation I've seen). Instead of counting words, this code counts the occurrence of values in the columns.

The code reads input files and produces output records with three columns:

A column name in the original data.

A value in the column.

The number of times the value appears.

Do note that for data with many unique values in many columns, the number of output records is likely to far exceed the number of input records. So, the output file can be bigger than the input file.

The input records are assumed to be in a text file with one record per row. The first row contains the names of the columns, delimited by a tab (although this could easily be changed to another delimiter). The rest of the rows contain values. Note that this assumes that the input files are all read from the beginning; that is, that a single input file is not split among multiple map tasks.

One irony of this code and the Hadoop framework is that the input files do not have to be in the same format. So, I could upload a bunch of different files, with different numbers of columns, and different column names, and run them all in parallel. I would have to be careful that the column names are all different, for this to work well.

Examples of such files are available on the companion page for my book Data Analysis Using SQL and Excel. These are small files by the standards of Hadoop (measures in megabytes) but quite sufficient for testing and demonstrating code.

Overview of Approach

There are four classes defined for this code:

GenericRecordMetadata stores the metadata (column names) for a record.

GenericRecord stores the values for a particular record.

GenericRecordInputFormat provides the interface for reading the data into Hadoop.

GenericRecordTester provides the functions for the MapReduce framework.

The metadata consists of the names of the columns, which can be accessed either by a column index or by a column name. The metadata has functions to translate a column name into a column index. Because it uses a HashMap, the functions should run quite fast, although they are not optimal in memory space. This is okay, because the metadata is stored only once, rather than once per row.

The generic record itself stores the data as an array of strings. It also contains a pointer to the metadata object, in order to fetch the names. The array of strings minimizes both memory overhead and time, but does require access using an integer. The other two classes are needed for the Hadoop framework.

One small challenge is getting this to work without repeating the metadata information for each row of data. This is handled by including the column names as the first row in any file created by the Hadoop framework, and not by putting the column names in the output for each row.

Setting Up The Metadata When Reading

The class GenericRecordInputFormat basically does all of its work in a private class called GenericRecordRecordReader. This function has two important functions: initialize() and nextKeyValue().

The initialize() function sets up the metadata, either by reading environment variables in the context object or by parsing the first line of the input file (depending on whether or not the environment variable genericrecord.numcolumns is defined). I haven't tested passing in the metadata using environment variables, because setting up the environment variables poses a challenge. These variables have to be set in the master routine in the configuration before the map function is called.

The nextKeyValue() function reads a line of the text file, parses it using the function split(), and sets the values in the line. The verification on the number of items read matching the number of expected items is handled in the function lineValue.set(), which raises an exception (currently unhandled) when there is a mismatch.

Setting Up The Metadata When Writing

Perhaps more interesting is the ability to set up the metadata dynamically when writing. This is handled mostly in the setup() function of the SplitReduce class, which sets up the metadata using various function calls.

Writing the column names out at the beginning of the results file uses a couple of tricks. First, this does not happen in the setup() function but rather in the reduce() function itself, for the simple reason that the latter handles IOException.

The second trick is that the metadata is written out by putting it into the values of a GenericRecord. This works because the values are all strings, and the record itself does not care if these are actually for the column names.

The third trick is to be very careful with the function GenericRecord.toString(). Each column is separated by a tab character, because the tab is used to separate the key from the value in the Hadoop framework. In the reduce output files, the key appears first (the name of the column in the original data), followed by a tab -- as put there by the Hadoop framework. Then, toString() adds the values separated by tabs. The result is a tab-delimited file that looks like column names and values, although the particular pieces are put there through different mechanisms. I imagine that there is a way to tell Hadoop to use a different character to separate the key and value, but I haven't researched this point.

The final trick is to be careful about the ordering of the columns. The code iterates through the values of the GenericRecord table manually using an index rather than a for-in loop. This is quite intentional, because it allows the code to control the order in which the columns appear -- which is presumably the original ordered in which they were defined. Using the for-in is also perfectly valid, but the columns may appear in a different order (which is fine, because the column names also appear in the same order).

The result of all this machinery is that the reduce function can now return values in a GenericRecord. And, I can specify these in the reduce function itself, without having to mess around with other classes. This is likely to be a big benefit as I attempt to develop more code using Hadoop.

1 Comments:

Anonymous said...

Hey Gordon,

During your explorations, you may want to check out Avro (http://hadoop.apache.org/avro/), which is a serialization and RPC framework that will be replacing the Writable system for data serialization in Hadoop.