bashreduce: A Bare-Bones MapReduce

In late 2004, Google surprised the world of computing with the release of the paper MapReduce: Simplified Data Processing on Large Clusters. That paper ushered in a new model for data processing across clusters of machines that had the benefit of being simple to understand and incredibly flexible. Once you adopt a MapReduce way of thinking, dozens of previously difficult or long-running tasks suddenly start to seem approachable–if you have sufficient hardware.

If you’ve managed to somehow miss most of the MapReduce revolution, Wikipedia describes it pretty well:

MapReduce is a framework for computing certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured).

“Map” step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.

The worker node processes that smaller problem, and passes the answer back to its master node.

“Reduce” step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output – the answer to the problem it was originally trying to solve.

In fact, the MapReduce model has proven so useful that the Apache Hadoop project (an Open Source implementation of the infrastructure described in the Google paper) has become very popular in the last few years. Yahoo, which employs numerous Hadoop committers, recently hosted their annual Hadoop Summit which attracted over 500 users and developers.

More than just a toy project, bashreduce lets us address a common scenario around these parts: we have a few analysis machines lying around, and we have data from various systems that are not in Hadoop. Rather than go through the rigmarole of sending it to our Hadoop cluster and writing yet another one-off Java or Dumbo program, we instead fire off a one-liner bashreduce using tools we already know in our reducer: sort, awk, grep, join, and so on.

It sounds almost comical but this makes a lot of sense, really. Like most of the Unix shell tools, bash is nearly everywhere. So why not build up enough of a bash script to facilitate basic MapReduce style processing for periodic or one-off jobs? It’s really quite handy.

bashreduce is new enough that it’s not packaged up for popular distributions yet, but you can pull a copy from github easily enough:

As you can see, br needs a few arguments and possibly a config file setup before it’s useful. First, you need to specify the list of hosts (nodes) which distribute data to and run on. You can either list them as a quoted -m argument, like "host1 host2 host3" or list them in a /etc/br.hosts file, one host per line. If you have multi-core hosts, you can list them more than once to take advantage of additional CPU cores.

That will take your /etc/passwd file, chop it into two pieces, sort them, and them merge and sort the results. Nobody needs a sorted /etc/passwd file but if you had a much larger file in need of sorting or, preferably, a more CPU-intensive bit of processing, this approach would make some sense. The point is that you’ve just distributed this work among both CPU cores on your machine without having to do a lot of extra work.

Suppose you have a multi-field whitespace delimited log file and wanted to extract a single column, count up the occurrences of each value in that column, and see the results.

The choice of /var/log/messages here is primarily motivated by the fact that you’re likely to have it on your system. A multi-gigabyte Apache log or application server log would lend itself to this type of processing.

These examples are fairly trivial but serve to show you how to get started. The real power comes when you’re using your own code instead of a uniq command.

bashreduce Enhacements

Since the release of bashreduce, developer Richard Crowley has extended bashreduce, adding several useful features:

the ability to pass a filename (rather than the actual file data via an nc pipe) to each process. This assumes that each machine has a local copy of the data or access to a shared filesystem. This will greatly reduce the network bandwidth required.

supports processing a directory full of files rather than a single file. Any of the files may be compressed using gzip and bashreduce will detect that and transparently handle decompression.

the -M option allows you to specify your own merge program instead of the default (sort -M)

Whether you use Erik’s original bashreduce or Richard’s fork, you end up with the ability to extend the basic Unix philosophy of standard tools speaking to each other on stdin/stdout to many hosts all doing work in parallel. Not bad for a little bash scripting, huh?

Comments on "bashreduce: A Bare-Bones MapReduce"

stevenworr

There\’s a huge amount of cleanup that can be done here. I question whether the -m option works. It is supposed to allow multiple hosts, but it looks like it only gets one. No? If you really want it to work you have to either allow multiple -m options or you have to have the multiple hosts separated by commas.
e.g., -m h1,h2,h3
and then break the list up.

Instead of using basename all over, I just set prog globally using

prog=\”${0##*/}\”

-z $hosts will be a syntax error if it\’s null and there are no quotes around it.

There was a comment at the end about killing a negative. That works because the signal is sent to the process group.

The multi-host capability is simply a documentation issue. As you can see in the example utilizing two local cores, the -m switch is passed a single parameter containing a whitespace-delimited list of target hosts.

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.