Why Hadoop-like ‘Dryad’ Could Be Microsoft’s Big Data Star

Microsoft’s HPC division opened up the company’s Dryad parallel-processing technologies as a Community Technology Preview (CTP) last week, a first step toward giving Windows HPC Server users a production-ready big data tool designed specifically for them. The available components — Dryad, DSC and DryadLINQ – appear to be an almost part-for-part comparison with the base Hadoop components (Hadoop MapReduce, Hadoop Distributed File System and the SQL-like Hive programming language) and hint that Microsoft wants to offer Windows/.NET shops their own stack on which to write massively parallel applications. Dryad could be a rousing success, in part because Hadoop — which is written in Java — isn’t ideally suited to run atop Windows or support .NET applications.

When Microsoft broke into the high-performance computing market in 2005 with its Windows Compute Cluster Server, it affirmed the idea that, indeed, HPC was taking place on non-Linux boxes. Certainly, given the prevalence of large data volumes across organizations of all types, there’s equal, if not greater, demand for Hadoop-like analytical tools even in Windows environments. Presently, however, Apache only supports Linux as a production Hadoop environment; Windows is development-only. There are various projects and tools floating around for running Hadoop on Windows, but they exist outside the scope of Apache community support. A Stack Overflow contributor succinctly summed up the situation in response to a question about running Hadoop on Windows Server:

What might give the Microsoft developer community even more hope is the Dryad roadmap that Microsoft presented in August. ZDNet’s Mary Jo Foley noted a variety of Dryad subcomponents that will make Dryad even more complete, including a job scheduler codenamed “Quincy.”

There’s no guarantee that the production version of Dryad will resemble the current iteration too closely. As Microsoft notes in the blog post announcing the Dryad CTP, “The DryadLINQ, Dryad and DSC programming interfaces are all in the early phases of development and might change significantly before the final release based on your feedback.” A prime example of this is that Dryad has only been tested on 128 individual nodes. In the Hadoop world, Facebook is running a cluster that spans 3,000 nodes (sub req’d). Dryad has promise as Hadoop for the Microsoft universe, but, clearly, there’s work to be done.