Building an Inverted Index with Hadoop and Pig

Note: For some reason, this post appears to be pretty popular. Here's the thing. This was the first thing I wrote when learning Pig. Literally -- I wrote it down the evening I sat down to play with Pig's syntax. You wouldn't really ever construct an inverted index this way. The point was that you can, not that you should. It is, however, kind of neat.

Pig is a system for processing very large datasets, developed mostly at Yahoo and now an Apache Hadoop sub-project. Pig aims to provide massive scalability by translating code written in a new data processing language called Pig Latin into Hadoop (map/reduce) plans.

In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin.

Pig offers SQL-like data processing instructions (select, project, filter, group), while being both more flexible by allowing simple integration of user-defined functions, and more straightforward by allowing users to issue command proceduraly, rather than declaratively, as in SQL. Now, I am a big fan of declarativity, but experience does show that expressing complex rules in SQL is cumbersome.

There are several other projects with similar goals in the Hadoop universe — Hbase and Hive, both also Hadoop subprojects, being the more famous ones; both support variants of SQL.

The Pig feature that makes it stand out is the easy native support for nested elements — meaning, a tuple can have other tuples nested inside it; they also support Maps and a few other constructs. The Sigmod 2008 paper presents the language and gives examples of how the system is used at Yahoo.

Without further ado — a quick example of the kind of processing that would be awkward, if not impossible, to write in regular SQL, and long and tedious to express in Java (even using Hadoop).

Let’s say we have a (very large) collection of (very long) text files, and we want to index it so that we can quickly find the documents that contain certain words. Generally, this kind of problem is solved by making an inverted index — a structure that lists all words in a collection, and for each word, all the documents it occurs in.

There is kind of an odd thing going on there with loading — in order to know which file which line comes from, I load the files one by one and inject their names into the read in tuples. Pig does know how to read whole directories natively, but unfortunately it does not provide any information about which file is being read to the Loader function (a programmer can build his own Loading function — as well as filtering functions, ordering functions, etc). There is an interface called “Slicer” that a Loader can implement that would give it this kind of access, but that’s a bit messy too.. Anyway, that’s not the fun part. The fun part is what happens later.

Let’s step through it.words = FOREACH text GENERATE fname, FLATTEN( TOKENIZE(string) );
For each line of text, we have a “data bag” of two fields — fname and string. The built-in TOKENIZE function splits the string (we can also provide our own tokenizer/stemmer/what-have-you). Just calling “GENERATE fname, TOKENIZE(string)” would give us all the same rows, with the second field now being another “DataBag”, this one with a word in each column. The “FLATTEN” command flattens this nested structure — it generates a new row for every element of this data bag. So { (‘foo.txt’, (‘bar’, ‘baz’, ‘bam’))} becomes { (‘foo.txt’, ‘bar’), (‘foo.txt’, ‘baz’), (‘foo.txt’, ‘bam’) }.

word_groups = GROUP words BY $1;
This is the inversion part. We group this set of words by the second column — the word.
So now, if we had { (‘foo.txt’, ‘apple’), (‘foo.txt’, ‘pear’), (‘bar.txt’, ‘apple’) }, we get { (‘apple’, ( (‘foo.txt’, ‘apple’), (‘bar.txt’, ‘apple’))), (‘pear’, ((‘foo.txt’, ‘pear’))) }
In other words, we get a set of tuples where the first column is the value of the column we grouped by, and the second column is a ‘DataBag’ of the relevant tuples.

Now we generate the index — for every word group, we find the distinct filenames, and write out the word followed by the number of files in which it is found, and the filenames themselves.

STORE index INTO '/data/inverted_index';
This is what makes the code actually run. The index is computed (partitioning the data across Hadoop servers and all that jazz is taken care of automatically), and the index is written out.

Note that the index is written out in lexically sorted order, as are the files inside each posting list.