RDBMS Vendors Embrace Hadoop

In years to come, we might remember October 2011 as the month the big database vendors gave in to the dark side and embraced Hadoop. In October, both Microsoft and Oracle announced product offerings which included and embraced Hadoop as the enabler of their "big data" solution. The last of the big three database vendors - IBM - embraced Hadoop back in 2010.

Hadoop has become virtually synonymous with the processing of "big data." Big data has a variety of definitions, but generally refers to the proliferation of high volume, loosely structured data that is increasingly being generated as the "data exhaust" of modern web-based business, and which increasingly is analyzed to create competitive advantage. Hadoop is an open source platform capable of economically scaling to handle virtually any volumes of unstructured "big" data.

Hadoop and the big data movement are not strictly disruptive to the RDBMS - even the most zealous Hadoop advocate admits that Hadoop is a supplement to, rather than a replacement for, the relational database. But until recently both Microsoft and Oracle were proposing solutions for data management - including big data - that relied on their proprietary software solutions. It was, therefore, something of a surprise to find each endorsing the open source Hadoop stack.

Oracle, of course, has been heavily promoting its Exadata database machine as a general purpose solution for virtually all high end database processing requirements - whether they're OLTP or data warehouse in nature. Exadata certainly is capable of handling very large data volumes, but the price per terabyte of Exadata storage is at least 10 times that of a commodity Hadoop cluster and - being an RDBMS - data needs to be loaded into a fixed schema, unlike Hadoop, where the schema can be defined when read.

Oracle appears to have realized that Hadoop would become a significant part of the enterprise landscape no matter what, and decided to embrace, rather than resist, by announcing their "big data appliance." The appliance consists of 18 servers in an analyzed rack powered by Apache Hadoop.

Meanwhile, Microsoft was developing a proprietary alternative to Hadoop, called Dryad. Like Hadoop, Dryad was a framework optimised for massively parallel data flows. Unlike Hadoop, Dryad was optimised for Microsoft platform - including Azure - and aligned with Microsoft's LINQ Data Processing language and High Performance Computing (HPC) platform.

In November, however, Microsoft announced that the Dryad project would effectively be discontinued and simultaneously announced that Hadoop would be made available both on Windows server and the Azure cloud under the name, "Project Isotope."

Isotope involves the development of a distribution of Hadoop optimized to run on the Windows operating system - and in Microsoft Azure - together with integrations to Microsoft's analytic tools such as Excel and PowerPivot. These integrations will leverage the SQL-like interface to Hadoop known as Hive, and will allow BI tools to access Hadoop in a similar manner to the integrations provided for other databases.

Isotope also aims to integrate Hadoop into the Microsoft management framework. In particular, authentication and authorization will be integrated with Microsoft Active directory. This should be attractive for many enterprises where Active Directory is the standard for identity management. Microsoft also will offer integrations to SQL Server, System Center and other components of the Microsoft enterprise stack.

The sudden and simultaneous embrace of Hadoop by Microsoft and Oracle should help eliminate any lingering doubts about the relevance of Hadoop to the enterprise. Hadoop is already firmly established as a critical technology in many larger enterprises, and the endorsement of the leading database vendors likely will encourage more widespread adoption.