SciPy + MapReduce with Disco

Description

MapReduce has become one of two dominant paradigms in distributed computing
(along with MPI). Yet many times, implementing an algorithm as a MapReduce job
- especially in Python - forces us to sacrifice efficiency (BLAS routines,
etc.) in favor of data parallelism.

In my work, which involves writing distributed learning algorithms for
processing terabytes of Twitter data at SocialFlow, I've come to advocate a
form of "vectorized MapReduce" which integrates efficient numerical libraries
like numpy/scipy into the MapReduce setting, yielding both faster per-machine
performance and reduced I/O, which is often a major bottleneck. I'll also
highlight some features of Disco (a Python/Erlang MapReduce implementation
from Nokia) which make it a very compelling choice for writing scientific
MapReduce jobs in Python.