MapReduce on App Engine made easy

MapReduce on App Engine made easy

The other day, I wanted to iterate over all entities in our App Engine datastore. Naturally, MapReduce came to mind, so I googled “app engine python mapreduce”. And I got the same things you did: the project source code on GitHub with convoluted examples and links to an outdated, gazillion-page-long tutorial.

I started following the tutorial; but, the first link in it pointed to a non-existent SVN repository :facepalm: This shouldn’t be so agonizing, right? I mean, someone took the time to write the library, and already did all the heavy lifting: this should be easier. And indeed, easier it is. I am here to tell you all about it.

This post deals with the mapper aspect of MapReduce. I might write a follow-up dealing with filters, callbacks and reduce if there’s demand 🙂 I assume you’re familiar with Python, App Engine and some basic MapReduce concepts. It should take around 15-25 mins to complete from scratch. Let’s get to it.

Step 1 – The Handlers

First things first: let’s look at app.yaml. There are three entries you need to add in order to make the example work. In your production environment, it could be as few as one.

Items 1 & 2 are custom (i.e. app-dependent). The first populates the datastore (which you won’t need if the entities are already in your datastore). The second kicks off the MapReduce job, which is also custom. You could do that with cron or trigger it via some business logic. The third is critical and should be copy-pasted as is, because it handles all the inner calls that MapReduce generates while running a job.

Step 2 – The Data

So that MapReduce will have data to map (i.e. iterate over), we’re going to populate the datastore with entities of the Dummy model. The Dummy model is very simple: it contains only one field (counter). We set it to 0 by default, and the mapper will increase it every time an entity is processed. More on that later. This is the Dummy model and it resides in dummy.py:

Simple, elegant, dumb. Exactly what we need.

Now for the PopulateHandler (in tasks.py), which populates the datastore with entities we can map (i.e. iterate over):

No Nobel Prize for mathematics here either. Simply takes a number and creates as many Dummy entities in the datastore as requested. Note: if you provide a very large number you will get a request timeout, as App Engine limits request times to 60 seconds.

You’ll notice at the bottom of the file the url->handler configuration. Specifically ‘/populate_db’. So, to populate your datastore, run the App Engine app (locally or in the cloud) and navigate to: http://you_example_app_url/populate_db

Each time you open this URL, 100 Dummy entities will be added to your datastore. After you’ve populated your datastore using “populate_db”, take a look at the Dummy entities. Note that their counter is zero.

Step 3 – The Mapper Function

The mapper function gets called for every datastore entity that the MapReduce job will find. In our case the mapper is very simple: it increments the Dummy entry’s counter by 1, saves it, and logs the new counter value.

Mapper (found in tasks.py):

Again, very simple. Note that it is not part of a class. It could be; but, for the sake of simplicity, it’s a standalone function.

Step 4 – The Glue

Now all that’s left is to glue our puzzle pieces together and actually run the MapReduce job (which in our case only maps and doesn’t reduce). Let’s review the function that configs and starts a job. You can find it in tasks.py under StartHandler:

control.start_map() kicks off the MapReduce job (don’t forget it’s import: from mapreduce import control). This is the beefiest function until now and it’s also the last. It really is much simpler than it looks. Let’s review the arguments:

name – An arbitrary string to described the job to the coder.

handler_spec – This is important: it defines the mapper function (see above) that processes the entities.

reader_spec – The type of the input reader. If you keep using datastore, you should keep using that input reader.

mapper_parameters – A dictionary with more params for the mapping process:

entity_kind – This is important: the path to the entity kind you’re mapping.

processing_rate – How many entities each shard will process.

shard_count – How many shards will run simultaneously.

queue_name – You can define a queue for any of your mapreduce jobs. For simplicity’s sake, I use the default.

To run the mapreduce job, go to http://you_example_app_url/start_job

You can monitor the progress of the mapreduce job in the logs (if you’re on a local machine) or in the Task queses tab in your App Engine project dashboard in the dev console.

Was this tutorial easy to follow? Did you deploy your first mapper job successfully? I love feedback and especially the kind that I can act on 🙂

Written by Yuri Shmorgun

Yuri is WiseStamp’s head of development. He loves jogs on the beach and server-side coding.

Company

Email Signatures

The WiseStamp email platform offers a unique, smart way of interacting with customers and audience in their daily emails, letting our customers (B2B/B2C) easily promote and market themselves using their own customized professional email signature.

WiseStamp works with most email providers, including Gmail, Outlook & Apple Mail. No HTML needed.