Archives

Tag: Dataflow

For several years I thought MapReduce was the only paradigm for distributed data processing. Only a few month ago, watching “Clash of the Titans Beyond MapReduce” (a really interesting talk at The Hive meetup) Matei Zaharia (@matei_zaharia), CTO of Databricks and co-creator of Spark, cited Dryad, a programming paradigm developed by Microsoft. I don’t know any open project which implement it but was widely used by Microsoft Research.

Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational “vertices” with communication “channels” to form a dataflow graph.

Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources.

Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.