The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).

Specific projects of note in all that include:

Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.

Spark, a replacement for MapReduce and the associated execution stack.

Shark, a replacement for Hive.

Mike Franklin* and his colleagues, who recently introduced me to all this, are focused on the database parts, including Spark and Shark. A recent slide deck gives details; Slide 11 in particular shows some of the project elements (I gather that everything on that slide is expected some time in 2013). A fuller accounting of project components may be found on the AMPLab website.

An alternate approach to fault tolerance, in which data doesn’t have to be written to disk between steps.

The most obvious improvements in Shark over Hive are:

It uses Spark, which performs better than MapReduce.

It has columnar, in-memory data structures.

Not spilling intermediate results to disk is an important point. We normally think of this as a big deal in complex query execution, for example as an aspect of the design of Impala or Hadapt. But it’s perhaps even more important in iterative machine learning algorithms, which seem to be top-of-mind as a design point for the AMPLab guys.

There seems to be quite a bit of interest in and even adoption of these projects. The AMPLab guys seemed more comfortable talking about that for the record via email, and so with permission I quote (lightly edited):

We’ve seen Spark used for a variety of analytics and statistical learning applications, mostly on Hadoop and Hive data. These range from replacing Hive or Pig for simple SQL queries, to anomaly detection, to interactive dashboards where users can drill into data. Two examples of companies that have talked publicly about their Spark use cases are:

Conviva (Ion Stoica’s video analytics company), one of our earlier users, which has used it to replace a large fraction of their queries.

Quantifind, a company that performs predictive analytics and text mining on social data to help marketers at large entertainment companies.

Several companies have also contributed to the open source projects. For example, Yahoo! has contributed a JDBC server to Shark, and is working on a bytecode optimizer.

We have a growing user community. Our meetup group is approaching 500 members. To date, meetups have been hosted by AirBnb, Groupon, Yelp, Palantir, Conviva, and Klout. More details at http://www.meetup.com/spark-users/.

Finally, we held a Big Data bootcamp for industrial practitioners back in August that offered two days of training using Spark and Shark. The bootcamp was sold out for on-site attendance and 5000 people attended via online live streaming. Details at http://ampcamp.berkeley.edu.

You can find the list of public contributors to Spark and Shark at the following two GitHub pages:

Thanks for covering AMPLab. The work on Shark/Spark is very innovative and look as if it has excellent potential. The other interesting team in Soda Hall is Joe Hellerstein’s group, who have been doing work around CALM. That work is more relevant to OLTP processing though it also has interesting application to analytics as well. I hope you will have a chance to interview them the next time you visit UC Berkeley, assuming you have not done so already.

As you mention, our discussions with Curt have focused on a few components of the BDAS stack that have had recent releases.

While AMPLab’s emphasis is on analytics, we do work on OLTP as well. Projects include the MDCC (Multi-Data Center Consistency) protocol, and the Probabilistically Bounded Staleness (PBS) framework – the latter of which is being done in collaboration with Joe and his group.