Link List

Dec 6, 2011

Down the rabbit hole

"This is your LAST CHANCE. After this, there is no turning back. You take the blue pill, the story ends. You wake up and believe whatever you want to believe. You take the red pill and you stay in wonderland, I show you just how deep the rabbit hole goes" - Morpheus, The Matrix

Image Credit: CC - timsnell (Flickr)

I personally define Big Data as the challenges, solutions and opportunities around the storage, processing and delivery of data at scale. Clearly its a broad and hot topic and vendors from startups to the enterprise are actively providing solutions throughout the space. As a community and industry we're also doing a good job of moving the ball further up the field making this easier for everyone. However, I see one aspect that does not appear to be progressing at the same rate. Namely, the general understanding required to effectively apply Apache Hadoop.

I get this. I started working on Apache Hadoop a few years ago, somewhere around the 0.16.4 release. It took me awhile to wrap my head around exactly how to use it and what use cases it supported. Eventually there was an "Aha!" moment when it sunk in, and then I started to get excited. Very excited. It was as if all data was the matrix, Hadoop was the red pill that made it accessible and now I could fully explore this new world of data and all the opportunity that came with it. Hadoop is an amazingly flexible architecture for doing analysis of arbitrarily structured data. The fact that the Apache Nutch Web Crawler runs on Hadoop means the potential for analyzing the enormous amount of Data on the web is also there. One can run all kinds of interesting analyses on private and public web repositories (like Wikipedia) alike.

So, back to the issue of struggling to apply Hadoop. Most conversations I have with people both working with and in the enterprise appear to be dealing with this problem. The story I both see and hear is that they can wrap their heads around getting the platform up and running and might even be able to use it for an obvious use case such as Extract-Transform-Load (ETL) but as the cluster gets turned into a service, the business and the technical side of the house struggle to match Hadoop with the opportunities they have with private and public data. The people responsible for information discovery (whether they be engineers or business intelligence analysts) are standing looking at the Matrix unable to properly digest the red pill.

So we have a problem here. For those that find this exciting and long to see people crawling WAY down the rabbit hole and coming up with all kinds of awesome discoveries, how can we deliver a red pill for those that don't see the matrix? I think this goes beyond companies making the platform easier to run and companies providing tooling that make the queries easier to express. What do you do when your audience struggles to envision the query (or job) itself?

Formally training Data Scientists might be one approach. I think in its most canonical description a data scientist is someone that can (among other things) bridge that gap in applying Hadoop. Most folks I know that I would classify as Data Scientists sort of stumbled into the field and have non-traditional backgrounds like Linguistics, Physics and Psychology and their skill-set is largely self taught. Due to the fact that Hadoop is so new the definition of the term "Data Scientist" is a little nebulous. James Kobielus has written an article series on what it means to be a Data Scientist that explores various aspects of the role. Another approach might be to create and socialize a wide array of use cases for Hadoop and some easy on-ramps to exploring them. Either way, this is something that as a community, we need to fix.