Monthly Archives: January 2010

There are some defensible reasons for not allowing developers to look up users by email addresses, but claiming that spammers will use that facility to validate email addresses is pretty weak. I was reminded of this today when I added MySpace to the services supported by FindByEmail, and came across LinkedIn using the same old justification for not opening up their API. Twitter made the same claims when they pulled their existing API.

On the surface it sounds completely reasonable, but that horse is not only out of the barn, it's been galloping so long it's over the horizon. For years, Yahoo, Amazon, MySpace and AIM have all let developers look up their users by email address, so any spammer who wanted to go that route has had plenty of opportunity.

The real reason is that companies benefit from having their users inside walled gardens, and anything that makes it easier to integrate across sites is a threat to their business model. You might notice the more open companies are those in second place, who have less to lose. This leads to ridiculous situations, like Google refusing to open up a proper Gmail API so that migration to other services is harder, and then paying TrueSwitch to enable migration from other ISPs. TrueSwitch is the de facto proprietary API that all the big ISPs use to help users switch, a market opportunity that wouldn't even exist if they just opened up access to each other, and a situation that favors big-pocketed incumbents who can afford to hire them.

As you can probably tell, I've never met a data silo I liked. I'm just an external trouble-maker who doesn't have responsibility for protecting sensitive user information, but I'm going to scream if I hear another developer relations guy claim that their business decision to keep their users in a wall garden is all about keeping them safe!

Amazon's Elastic MapReduce service is a god-send for anyone running big data-processing jobs. It takes the pain and suffering out of configuring Hadoop, and lets you run hundreds of machines in parallel when needed, but without having to pay for them while they're idle. Unfortunately it does still have a few… quirks…, so here's a brain dump of lessons I've learnt while using the service.

Don't put underscores in bucket names. The rest of S3 is quite happy with names like mailana_data_2010_1_25 but EMR really doesn't like those underscores and will fail to run any job that references them. You also can't rename buckets, and moving the data to a new bucket involves a copy that maxes out at about 20 MB/s, so fixing this can take a while.

Invest in some good S3 tools. All your data and code has to live in S3, so you'll be spending a lot of time dealing with buckets. S3cmd is a great command-line tool for working with S3, but I'd also recommend Bucket Explorer for a GUI view.

Start off small. You're charged per-machine, rounded up to the nearest hour. This means if you fire up 100 machines and the job fails in 30 seconds, you'll still be charged 100 machine hours. If you have a job you're not sure will work, start off with a single machine instead. You'll also have a lot fewer log files to sort through to figure out what went wrong!

Use the log files. It's a bit hidden, but on the third screen of the job setup process there's an 'advanced' section that you can reveal. In there, add a bucket path and you'll get your jobs' logs copied to that S3 location. These are life-savers when it comes to figuring out what went wrong. I'm mostly doing streaming work with PHP, so I often end up drilling down into the task_attempts folder. In there, each run on each machine will have a numbered sub-folder, and you'll be able to grab the stderr output from each of them. If a reduce step has gone wrong, I'll usually see a missing number in the output file sequence, and you can use that number to find the job attempt that failed and look at the errors. You can also see jobs that were repeated multiple times because they failed by looking at the final number in the folder name.

GZipped input. A lot of my input data had already been gzipped, but luckily if you pass -jobconf stream.recordreader.compression=gzip in the extra arguments section Hadoop will decompress them on the fly before passing the data to your mapper.

Multiple input folders. My source data was also scattered across a lot of different folders in S3, but happily you can specify multiple input locations by adding -input s3://<your data location> to the extra args section.

Make sure PHP has enough memory. By default PHP scripts will fail if they use more than 32MB of RAM, since it's designed for the web server world. If your input data might be memory intensive, especially on the reducer end, use something like ini_set('memory_limit', '1024M'); to ensure you have enough headroom.

With help from Sid Anand, Kevin Marshall (buy his book) and David Kavanagh, along with Brett Taylor, Siva Raghupathy and the rest of the SimpleDB team, I've managed to improve my loading performance by an order of magnitude. I've also added in support for loading from arbitrary CSV or JSON files, so you can use the simpledb_loader tool to do fast uploads of your own data too.

If you just want to dive in, grab the source, make sure you've got java, cd into the directory and run

./sdbloader help

to bring up the options and a mini-tutorial. You'll be able to setup a cluster of domains, and then either run a synthetic benchmark, or load data from a file.

The biggest performance improvement came from fixing a problem in my original code that caused my requests to get serialized rather than running in parallel. With that out of the way, I started hitting the throttling that Amazon starts applying if you send too many requests too soon. They're trying to penalize 'bursty' writers, so you need to start off with a comparatively low number of requests per-domain, per-second and ramp to your full rate over a few minutes. After some advice from the SimpleDB team followed by experimentation, I started off at 1 request per-second, and over the course of two minutes I ramp that up to 3 requests per-second, per-domain. Since each request can have 24 items inside it, that works out to a theoretical maximum of 72 items per-second for each domain. You can tune these values yourself by setting -minrps, -maxrps and -ramptime on the command line.

That led to the next change, tweaking the number of domains being used. The SimpleDB team recommended around 20 or 30 as a maximum, I'm guessing because that roughly corresponds to the actual number of machines they're hosted in. I actually see a performance increase with higher numbers than that, my 1000 item/second maximum was achieved with 100 domains. However I think this is likely to be a loophole in their throttling code, so I wouldn't recommend going that far. You can alter the number of domains used with the -domaincount argument, make sure you specify the same number for both setup and your loading.

The final important performance tip is to ensure that you're running from within Amazon's network, by running your data upload from an EC2 server. This makes a massive difference, I get half the speed when I'm running over my broadband connection at home.

Those will set up the domains you need, and then try to upload 20,000 items from the test CSV file, each with multiple attributes, and a pretty typical representation of my workload. I see this taking around 19 seconds to complete, or just over 1000 items a second.

I know from Sid's work at NetFlix that this isn't the end of the road, he's getting over 10,000 items/second, but it's starting to become usable for the 210m item data set I need to upload. The main hurdles I'm hitting with the full data set are failed loads, either because of repeated 503 errors that exhaust the retries, or socket timeouts. If you want to dig deeper, the code is all fully available on github with no strings attached, just fork and go, and let me know if you make any improvements!

I recently discovered a new startup in the contacts world, Flowtown, and I'm very impressed! Their starting point is a little like Gist, you upload your contact information and they match up those email addresses with people's Facebook, Twitter and other social network accounts. Incidentally, I believe they're using Rapleaf for this matching process, it's a great demonstration of the possibilities of their API.

Once that data's been matched, Flowtown's goal is to help marketers create much better targeted email campaigns for their existing mailing lists. Sadly the old tagline "Give those emails some pants and a shirt" has vanished from their home-page, but I think that idea of dressing up and personalizing your marketing emails is very valuable. You've already built up relationships with these customers, you have permission to contact them, and everybody wins if those emails are better targeted. The example in the demo video ensures that an email asking customers to follow you on Twitter only goes to people who actually have Twitter accounts. You can imagine this getting much more detailed, maybe identifying influential Twitter users who are already your customers, or using the geographic information to target only Twitter-using customers in a particular area.

I like their approach because they have a very clear value proposition and target market; if you're an email marketer who wants to improve her click-through rates, it's an obvious win. They're also up-front about asking for money; you 'll only get 50 contacts imported for free, the rest are 5 cents each, and you'll need to upgrade from the free plan to run proper campaigns. It may sound perverse to applaud them for charging early and often, but it's refreshing to see someone with enough belief in the value they're offering to do that.

Great work by Ethan and the team, I foresee a lot of success in their future!

I'll admit it, I was intimidated by MapReduce. I'd tried to read explanations of it, but even the wonderful Joel Spolsky left me scratching my head. So I plowed ahead trying to build decent pipelines to process massive amounts of data without it. Finally my friend Andraz staged an intervention after I proudly described my latest setup: "Pete, that's Map Reduce".

Sure enough, when I looked at MR again, it was almost exactly the same as the process I'd ended up with. Using Amazon's Elastic Map Reduce implementation of Hadoop, I was literally able to change just the separator character I use on each line between the keys and the data (they use a tab, I used ':'), and run my existing PHP code as-is.

The first thing to understand is that MapReduce is just a way of taking fragments of information about an object scattered through a big input file, and collecting them so they're next to each other in the output. For example, imagine you had a massive set of files containing the results of a web crawl, and you need to understand which words are used in the links to each URL. You start with:

How do you do it? If the data set is small enough, you loop through it all and total up the results in an associative array. Once it's too large to fit in memory, you have to try something different.

Instead, the Map function loops through the file, and for every piece of information it finds about an object, it writes a line to the output. This line starts with a key identifying the object, followed by the information. For example, for the line <a href="http://foo.com">Bananas</a&gt; it would write

foo.com Bananas

How does this help? The crucial thing I missed in every other explanation is that this collection of all the output lines is sorted, so that all the entries starting with foo.com are next to each other. This was exactly what I was doing with my sort-based pipeline that Andraz commented on. You end up with something like this:

…
foo.com Bananas
foo.com Bananas
foo.com Mangoes
…

The Reduce step happens immediately after the sort, and since all the information about an object is in adjacent lines, it's obviously pretty easy to gather it into the output we're after, no matter how large the file gets.

None of this requires any complex infrastructure. If you download the project you'll see a couple of one-page PHP files, one implementing a Map step, the other Reduce, which you can run from the command line simply using:

./mapper.php < input.txt | sort | ./reducer.php > output.txt

To prove I'm not over-simplifying, you can take the exact same PHP files, load them into Amazon's Elastic Map Reduce service as-is and run them to get the same results! I'll describe the exact Job Flow settings at the bottom so you can try this yourself.

The project itself takes 1200 Twitter messages either written by me, or mentioning me, and produces statistics on every user showing how often and when we exchanged public messages. It's basically a small-scale version of the algorithm that powers the twitter.mailana.com social graph visualization. One feature of note is the reducer. It tries to merge adjacent lines containing partial data in JSON format into a final accumulated result, and I've been using this across a lot of my projects.

– As you go through the creation panel, copy the settings shown below. Make sure you put in the path to your own output bucket, but I've made both the input data and code buckets public, so you can leave those paths as-is:

Run the job, give it a few minutes to complete, and you should see a file called part-00000 in your output bucket. Congratulations, you've just run your first Hadoop MapReduce data analysis!

Now for the bad news. Google's just been awarded a patent on this technique, casting a shadow over Hadoop and pretty much every company doing serious data analysis. I personally think if a knucklehead like me can independently invent the process, it should be considered so obvious no patent should be possible!

I'm really keen to use Amazon's SimpleDB service to store my data, but the upload process is just too damn slow. A naive implementation of a loader lets me upload about 20 rows a second, and since I've got over 200 million rows, that would take around 6 months! Sid kindly shared his experiences with Netflix's massive data transfer to SimpleDB over at practicalcloudcomputing.com, and he achieved rates of over 10,000 items a second. He's been very generous with advice, but obviously can't share any proprietary code, so I've set out to implement an open-source data loader in Java to implement his suggestions.

It uploads 10,000 generated rows using these optimizations:
– Calling BatchPutAttributes() to upload 20 rows at a time
– Multiple threads to run requests in parallel
– Leaving Replace as false for the overwrite behavior

Despite that, I'm still only seeing around 140 items a second, which is a long way off Sid's results. I'm going to be doing some more work on this, but I'd love it if anyone from Amazon could jump in and help put together an example that implements all their best practices. Judging from the forums there's a lot of people stuck on exactly this problem and it would making porting over existing services a lot easier.

I've been trying to upload around 210 million items to Amazon's SimpleDB service, which has been quite an adventure! Sid Anand's advice has been invaluable (he's done an even larger migration of data for Netflix), and I'll be blogging in more depth on the details, but one of the early problems I hit was the lack of any easy way to interact with the store. With MySQL you at least get a console you can use to sanity check your results, but SimpleDB was a black box.

Eventually I discovered a handy solution, SimpleDB Explorer. It's a commercial product, but comes with a free 30 day trial and only costs $35. I loathe Java for GUIs, and it does have some quirks like over-enthusiastic dialogs that pop up willy-nilly, but it does run on Windows, Linux and OS X. It's got the functionality you'd expect, you can edit the overall structure of the store, run queries or just browse the data to make sure it looks reasonable. It's saved me a lot of time, if you're doing any serious work with SimpleDB I'd highly recommend buying it.