Big Data DC #3

Two enterprise big data consulting companies presented about the architecture they use for processing and storing at the third Big Data DC meetup. Much like the first and second meetups, the common thread seemed to be the decisions that the engineers made to optimize certain aspects over others.

First up, Joey Echeverria who works for Cloudera, talking about using HBase in the real world. Joey’s presentation covered the basics of Hadoop, and then dove into HBase, the database for Hadoop. He talked about the benefits of HBase, including having a variable schema in each record and it being atomic per row. He then gave a few examples of real life applications including Lilly, an open source project content repository, OpenTSDC, a distributed, scalable Time Series Database from stumbleupon and Socorro, the crash report database used by Mozilla. Peruse Joey’s slides for more information on HBase.

Next up, Ted Dunning from MapR spoke about the Hadoop distribution his company sells. Ted spoke of the bottlenecks in Hadoop that they try to solve with the implementation they built. These bottlenecks include Read only files, many copies in I/O path, shuffle based on HTTP, and spills go to local file space. Ted spent a large amount of his talk on maprfs, the file system they built to solve these bottlenecks.

This meetup had the largest turnout of all the Big Data DC meetups so far. I can’t wait for the 4th meetup.