Year: 2016

Apache Hadoop turned ten this year. To celebrate, Karthik and I gave a talk at USENIX ATC ’16 about open problems to solve in Hadoop’s second decade. This was an opportunity to revisit our academic roots and get a new crop of graduate students interested in the real distributed systems problems we’re trying to solve in industry.

This is a huge topic and we only had a 25 minute talk slot, so we were pitching problems rather than solutions. However, we did have some ideas in our back pocket, and the hallway track and birds-of-a-feather we hosted afterwards led to a lot of good discussion.

I gave a presentation titled Happier Developers and Happier Software through Distributed Testing at Apache Big Data 2016, which detailed how our distributed unit testing framework has decreased the runtime of Apache Hadoop’s unit test suite by 60x from 8.5 hours to about 8 minutes, and the substantial productivity improvements that are possible when developers can easily run and interact with the test suite.

The infrastructure is general enough to accommodate any software project. We wrote frontends for both C++/gtest and Java/Maven.

This effort started as a Cloudera hackathon project that Todd Lipcon and I worked on two years ago, and I’m very glad we got it across the line. Furthermore, it’s also open-source, and we’d love to see it rolled out to more projects.

What makes this paper special is that it is one of the only published papers about a production cloud blobstore. The 800-pound gorilla in this space is Amazon S3, but I find Windows Azure Storage (WAS) the more interesting system since it provides strong consistency, additional features like append, and serves as the backend for not just WAS Blobs, but also WAS Tables (structured data access) and WAS Queues (message delivery). It also occupies a different design point than hash-partitioned blobstores like Swift and Rados.