The room was filled with very smart engineers, scientists (one guy had over 50 patents to this name) and an abundance of “expert witnesses”. So I’m sure they had no problem keeping up with Chris’ presentation which I found very enlightening. However, it did take a significant combination of leveraging my Computer Science degree where I majored in Artificial Intelligence, writing systems in Lisp and Prolog, together with my years of working with Oracle RDBMS and a healthy dose of Java and UNIX/LINUX terminology comprehension for me to get the most out of the presentation. If you’d rather dig technically deeper into Hadoop, MapReduce and Cascade, I’d suggest you take a look at Chris’ excellent site www.cascading.org.

For those of you still with me here, let me try to sum up in non-technical terms what I think Hadoop and MapReduce is all about.

Hadoop implements MapReduce, which is a software framework introduced by Google for distributed computing. As Chris described in this presentation, MapReduce is actually Map, Group then Reduce. Without getting too technical, think of the Map phase as the feeding in of the data together with the keys and values. For example assume you have 2 records. First record has a key of first name ‘Ramon’, with a value of ‘1’. The second record looks exactly the same with another key of ‘Ramon’ with a value also of ‘1’. The Group phase would then reorganize the information into ‘Ramon’ with two 1’s. Then the Reduce phase would perform aggregation (count) and the stored result would be ‘Ramon’ and ‘2’. This is of course a trivial example and the most common one typically shown in MapReduce examples. But the basic premise is that the data can be efficiently distributed, stored and retrieved at a high rate of performance within the cluster. (Update: For a great way to explain MapReduce using a worker/visual analogy see http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/ and also visit Kristina Chodorow’s (of MongoDB fame) blog for an entertaining Star Trek analogy.

All sounds good right? Some of you as old as me might feel that this smacks a little of pre-RDBMS file systems back in the early days of computing where you had to write and manage all of the routines to retrieve and store data through low-level programming. For example, there is no “schema” in Hadoop/MapReduce, nor any transactional boundaries (commit processing). Chris’ Cascading project aims to make using MapReduce easier by allowing you to focus on the fields you want to store and retrieve, and not the heavy lifting of having to visualize concepts in MapReduce. However there are still major opportunities for improving ease of use, judging by the number of times Chris made mention of how to “game” or “hack” Hadoop to do what you want.

Since Hadoop is open source, there are a number of initiatives to improve the system to overcome issues that prevent Hadoop from being used in more traditional RDBMS style processing scenarios. But is this the wise thing to do? IT and Computing, like most things in life, are cyclical in nature. Mainframes were outdated and on their way out in the 90s with the rise of Client Server computing, then a few years later the mainframe or server base computing (with thin client) came back into vogue, now cloud computing is the rage. Similarly it appears to me that primative low level file storage with programmatic manipulation was succeeded by RDBMS systems with SQL, now the lower level of file system storage through Hadoop with “roll your own SQL queries” is back.

It will be interesting to see how everything plays out. Certainly there is no disputing that apps such as Facebook and Sharethis (another Aster Data customer) could only exist today with Hadoop and/or MapReduce style implementations. Oracle RDBMS simply isn’t up to the task and is more suited for its current business usage. Which brings me back to the context of the original title of my post, will Oracle just sit around and watch Hadoop and MapReduce takeoff? On the high performance analytics side they do offer Oracle Exadata, but will they make a move to acquire one of the many startups out there? Sybase already has a columnar DB offering through their IQ database, in fact they have sued Vertica claiming patent infringement. Certainly RDBMS’ are better at real-time while Hadoop is more batch oriented, but will Oracle or any of the big RDBMs vendors offer anything themselves in the area of MapReduce? Will they get in the game? Or will they stay “Irelephant” in this area and allow Greenplum, Aster Data or any of the columnar database vendors to become the next Oracle for non RDBMS big data?

Hopefully you got a sense of Hadoop and MapReduce from this post. Like the plush elephant that Hadoop is named after, I’m sure you’ll never forget

Nice post, Ramon. Just a note that Aster Data is a relational database that has created tight coupling of SQL with MapReduce to provide the expressive flexibility and performance of MapReduce, while making it “callable” through standard SQL. This native integration enables a whole new class of powerful SQL/MR functions that can be written and then used by “everyday” business analysts through traditional BI tools. One example of this is a SQL/MR function we’ve developed called “nPath”, which enables elegant time-series analysis of data (used a lot in click-stream analysis) with a single pass of the database, expressible in SQL. Lots of resources here for people wanting to know the business applications of MapReduce in the real world, how we’ve integrated SQL with MapReduce, how to write the functions, and more: http://www.asterdata.com/mapreduce/index.php

There is no argument MapReduce/ Hadoop is and has already proven itself to be a highly scalable & fault-tolerant mechanism over the cloud for data intensive operations; to compute, to aggregate; at the end of the day it’s same old distributed computing via grid jobs that’s showing it’s magic.

Whole NoSQL moment and everything related to it isn’t about shedding everything existing and go new way on, it’s about making people aware, to let people out of local maxima and help them see the world beyond which is to realize “There is a more than one way to do it” (Perl mantra), and there always is.

Sticking always to a traditional approaches or systems to do some next generation or a different sort of job isn’t surely a way to go, we got to change the solution space when problem space changes.

I think future is about hybrid technology, when it wouldn’t even be require to be called hybrid anyways… We already see combinations of technology working complementary with each other: when Bradford’s mentioned: [Hadoop + Hbase + Hive] thing is in work, when Facebook’s [Hadoop + Casandra + Hive] is in work, when Linkedin’s [Hadoop + Voldemort + RDBMS (Oracle, MySQL)] is in work. (Reference: http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html)

[…] of information, though in truth they act more like a distributed file system than like a DBMS. There is some belief that you won’t need or want those centralized DBMS’s once you have …. However, I believe a hybrid model will reign for at least the foreseeable future, since businesses […]