Sunday, 30 March 2008

One of the things that struck me about this week's new Amazon EC2 features was the pricing model for Elastic IP addresses:

$0.01 per hour when not mapped to a running instance

The idea is to encourage people to stop hogging public IP addresses, which are a limited resource, when they don't need them.

I think one way of viewing EC2 - and the other Amazon utility services - is as a way of putting very fine-grained costs on various computing operations. So will such a pricing model drive us to minimise the computing resources we use to solve a particular problem? My hope is that making computing costs more transparent will at least make us think about what we're using more, in the way metered electricity makes (some of) us think twice about leaving the lights on. Perhaps we'll even start talking about optimizing for monetary cost or energy usage rather than purely raw speed?

Saturday, 22 March 2008

Update: I've posted my answers to the exercises. Let me know if you find any mistakes. Also: Tamara Petroff has posted a write up of the session.

On Wednesday [19 March], I ran a session at SPA 2008 entitled "Understanding MapReduce with Hadoop". SPA is a very hands-on conference, with many sessions having a methodological slant, so I wanted to get people who had never encountered MapReduce before actually writing MapReduce programs. I only had 75 minutes, so I decided against getting people coding on their laptops. (In hindsight this was a good decision, as I went to several other sessions where we struggled to get the software installed.) Instead, we wrote MapReduce programs on paper, using a simplified notation.

It seemed to work. For the first half hour, I gave as minimal an introduction to MapReduce as I could, then the whole group spent the next half hour working in pairs to express the solutions to a number of exercises as MapReduce programs. We spent the last 15 minutes comparing notes and discussing some of the solutions to the problems.

There were six exercises, presented in rough order of difficulty, and I'm pleased to say that every pair managed to solve at least one. Here's some of the feedback I got:

Some struggled to know what input data formats to use. Perhaps I glossed over this too much - I didn't want people to worry about precisely how the data was encoded - but I could have emphasised more that you can have the data presented to your map function in any way that's convenient.

While most people understood the notation I used for writing the map and reduce functions, it did cause some confusion. For example, someone wanted to see the example code again so they could understand what was going on. And another person said it took a while to realise that they could do arbitrary processing as a part of the map and reduce functions. It would be interesting to do the session again but using Java notation.

It was quite common for people to try to do complex things in their map and reduce functions - they felt bad if they just used an identity function, because it was somehow a waste. And on a related note, chaining map reduce jobs together wasn't obvious to many. But once pointed out, folks had an "aha!" moment and were quick to exploit it.

The fact that you typically get multiple reduce outputs prompted questions from some - "but how do you combine them into a single answer?". Talking about chained MapReduce helped here again.

Everyone agreed that it wasn't much like functional programming.

You can find the slides on the Hadoop wiki. They include the six exercises, which I've reproduced below, in rough order of difficulty. (I'll post my answers next week.)

Find the [number of] hits by 5 minute timeslot for a website given its access logs.

Find the pages with over 1 million hits in day for a website given its access logs.

Find the pages that link to each page in a collection of webpages.

Calculate the proportion of lines that match a given regular expression for a collection of documents.

Sort tabular data by a primary and secondary column.

Find the most popular pages for a website given its access logs.

Is this a good list of exercises? Do you have any exercises that you've found useful for learning MapReduce?

Finally, thanks to Robert Chatley for being a guinea pig for the exercises, and for helping out on the day with participants' questions during the session.

Tuesday, 18 March 2008

MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at the seek rate of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data).

So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20% [1]. Seek time is growing more slowly than transfer rate - so it pays to use a model that operates at the transfer rate. Which is what MapReduce does. I first saw this observation in Doug Cutting's talk, with Eric Baldeschwieler, at OSCON last year, where he worked through the numbers for updating a 1 terabyte database using the two paradigms B-Tree (seek-limited) and Sort/Merge (transfer-limited). (See the slides and video for more detail.)

The general point was well summed up by Jim Gray in an interview in ACM Queue from 2003:

... programmers have to start thinking of the disk as a sequential device rather than a random access device.

Or the more pithy: "Disks have become tapes." (Quoted by David DeWitt.)

But even the growth of transfer rate is dwarfed by another measure of disk drives - capacity, which is growing at about 50% a year. David DeWitt argues that since the effective transfer rate of drives is falling we need database systems that work with this trend - such as column-store databases and wider use of compression (since this effectively increases the transfer rate of a disk). Of existing databases he says:

But this applies to transfer rate too (or if it doesn't yet, it will). Replace "seeks" with "transfers" and "transaction processing" with "MapReduce" and I think over time we'll start seeing Hadoop installations that choose to use large numbers of smaller capacity disks to maximize their processing rates.

[1] See Trends in Disk Technology by Michael D. Dahlin for changes between 1987-1994. For the period since then these figures still hold - as it's relatively easy to check using manufacturer's data sheets, although with seek time it's harder to tell since the definitions seem to change from year to year and from manufacturer to manufacturer. Still, 5% is generous.

Sunday, 2 March 2008

There's a class of MapReduce applications that use Hadoop just for its distributed processing capabilities. Telltale signs are:

1. Little or no input data of note. (Certainly not large files stored in HDFS.)2. Map tasks are therefore not limited by their ability to consume input, but by their ability to run the task, which depending on the application may be CPU-bound or IO-bound.3. Little or map output.4. No reducers (set by conf.setNumReduceTasks(0)).

This seems to work well - indeed the CopyFiles program in Hadoop (aka distcp) follows this pattern to efficiently copy files between distributed filesystems:

1. The input to each map task is a source file and a destination.2. The map task is limited by its ability to copy the source to the destination (IO-bound).3. The map output is used as a convenience to record files that were skipped.4. There are no reducers.

Combined with Streaming this is a neat way to distribute your processing in any language. You do need a Hadoop cluster, it is true, but CPU-intensive jobs would happily co-exist with more traditional MapReduce jobs, which are typically fairly light on CPU usage.