As MapReduce fades, Apache Spark is now a top-level project

MapReduce was fun and pretty useful while it lasted, but it looks like Spark is set to take the reins as the primary processing framework for new Hadoop workloads. The technology took a meaningful, if not huge, step toward that end on Thursday when the Apache Software Foundation announced that Spark is now a top-level project.

Spark has already garnered a large and vocal community of users and contributors because it’s faster than MapReduce (in memory and on disk) and easier to program. This means it’s well suited for next-generation big data applications that might require lower-latency queries, real-time processing or iterative computations on the same data (i.e., machine learning). Spark’s creators from the University of California, Berkeley, have created a company called Databricks to commercialize the technology.

However, MapReduce isn’t yesterday’s news quite yet. Although many new workloads and projects (such as Hortonworks’ Stinger) use alternative processing frameworks, there’s still a lot of tooling for MapReduce that Spark doesn’t have yet (e.g., Pig and Cascading), and MapReduce is still quite good for certain batch jobs. Plus, as Cloudera co-founder and Chief Strategy Officer Mike Olson explained in a recent Structure Show podcast (embedded below), there are a lot of legacy MapReduce workloads that aren’t going anywhere anytime soon even as Spark takes off.

If you want to hear more about Spark and its role in the future of Hadoop, come to our Structure Data conference March 19-20 in New York. Databricks co-founder and CEO Ion Stoica will be speaking as part of our Structure Data Awards presentation, and we’ll have the CEOs of Cloudera, Hortonworks, and Pivotal talking about the future of big data platforms and how they plan to capitalize on them.

I get the impression MapReduce and Spark serve different use cases. MapReduce targeting batch and analytic processing while spark meeting real time needs. Both use HDFS with Spark having the ability to have several “streams” of real time input data while MapReduce goes against both static SQL/NOSQL big data.