I had the privilege of attending Hadoop World 2009 on Friday. It was amazing to meet, listen to, and pick the brains of so many smart people. The quantity of good work being done on this project is simply stunning, but it is equally stunning how much farther there remains to go. Some interesting points for me include:

Yahoo’s Enormous Clusters

Eric Baldeschwieler from Yahoo gave an impressive talk about what they’re doing with Hadoop. Yahoo is running clusters at a simply amazing scale. They have several different clusters, totally some 86 PB of disk space, but their largest is a 4000-node cluster with 16 PB of disk, 64 TB of RAM, and 32,000 CPU cores. One of the most compelling points they made was that Yahoo’s experiences prove that Hadoop really does scale as designed. If you start with a small grid now, you can be sure that it will scale up – way up.

Eric made it clear that Yahoo uses Hadoop because it so vastly improves the productivity of their engineers. He noted that, though the hardware is commodity, the grid isn’t necessarily a cheaper solution; however, it easily pays for itself through the increased turnaround on problems. In the old days, it was difficult for engineers to try out new ideas, but now you can try out a Big Data idea in a few hours, and see how it goes.

A great example is the search suggestion on the front page. Using Hadoop, they cut the time to generate the search suggestions on the front page from 26 days to 20 minutes. Wow! For the icing on the cake, the code was converted from C++ to Python, and development time went from 2-3 weeks to 2-3 days.

HDFS For Archiving

HDFS hasn’t been used much as an archival system yet, especially not with the time horizons of someplace like my employer. When I asked him about it, Eric told me that the oldest data on Yahoo’s clusters is not much more than a year old. Ironically, they tend to be concerned more about removing data from the servers due to legal mandates and privacy requirements, rather than keeping it around for a Very Long Time. But he sees the need to hold some data for longer periods coming soon, and has promised he’ll be thinking about it.

Facebook, though, is already making moves in this area. They currently “back up” their production HDFS grid using Hive replication to a secondary grid, but they are working on (or already have – it wasn’t quite clear how far along this all was) an “archival cluster” solution. A daemon would scan for least-recently used files and opportunistically move them to a cluster built with more storage-heavy nodes, leaving a symlink stub in place of the file. When a request for that stub file comes in, the daemon intercepts it and begins pulling the data back off the archive grid. This is quite similar to how SAM-QFS works today. I had a chance to speak with with Dhruba Borthakur for a bit afterwards, and he had some interesting ideas about modifying the HDFS block scheduler to make it friendly for something like MAID.

Jaesun Han from NexR gave a talk on Terapot, a system for long-term storage and discovery of emails due to legal requirements and litigation. I asked him about whether they were relying on HDFS as their primary storage mechanism, or if they “backed up” to some external solution. He laughed, and said that they weren’t using one now, but would probably get some sort of tape solution in the near future. He also said that he believed HDFS was quite capable of being the sole archival solution, and I believe he was implying that it was fear from the legal and/or management folks that was driving a “back up” solution. At this point, the Cloudera CTO noted that both Yahoo and Facebook had no “back up” solution for HDFS, except for other HDFS clusters. It certainly seems like at least a couple multi-million dollar companies are willing to put their data where their mouth is on the reliability of HDFS.

What’s Coming

There is a tremendous sense that Hadoop has really matured in the last year or so. But it’s also been noted that the APIs are still thrashing a bit, and it’s still awfully Java-centric. Now that the underlying core is pretty solid, it seems like a lot of the work is moving towards making your Hadoop grid accessible to the rest of the company – not just the uber-geek Java coders.

Doug Cutting talked about how they’re working on building some solid, future-proof APIs for 0.21. Included in this is switching the RPC format to Avro, which is intended to solve some of the underlying issues with Thrift and Protocol Buffers while opening up the RPC and data format to a broader class of languages. It’s worth noting that Avro and JSON are pretty easily transcoded to one another. Also, they’ll finally be putting some serious thought into a real authentication and authorization scheme. Yahoo (I think) mentioned Kerberos – let’s hope we get some OpenID up in that joint, too.

There is a sudden push towards making Hadoop accessible via various UIs. Cloudera introduced their Hadoop Desktop, Karmasphere gave a whirlwind tour of their Netbeans-based IDE, and IBM was showing off a spreadsheet metaphor on top of Hadoop called M2 (I can’t find any good links for it). I hadn’t thought about that before, and it seemed so simple it was brilliant; Doug Cutting mentioned the idea, too, so it has some cachet.