Search form

Earlier this year we open sourced Scalding, a Scala API for Cascading that makes it easy to write big data jobs in a syntax that’s simple and concise. We use Scalding heavily — for everything from custom ad targeting algorithms to PageRank on the Twitter graph. Since open sourcing Scalding, we’ve been improving our documentation by adding a Getting Started guide and a Rosetta Code page that contains several MapReduce tasks translated from other frameworks (e.g., Pig and Hadoop Streaming) into Scalding.

Today we are excited to tell you about the 0.8.0 release of Scalding.

What’s new

There are a lot of new features, for example, Scalding now includes a type-safe Matrix API. The Matrix API makes expressing matrix sums, products, and simple algorithms like cosine similarity trivial. The type-safe Pipe API has some new functions and a few bug fixes.

In the familiar Fields API, we’ve added the ability to add type information to fields which allows scalding to pick up Ordering instances so that grouping on almost any scala collection becomes easy. There is now a function to estimate set size in groupBy: approxUniques (a naive implementation requires two groupBys, but this function uses HyperLogLog). Since many aggregations are simple transformations of existing Monoids (associative operations with a zero), we added mapPlusMap to simplify implementation of many reducing operations (count how many functions are implemented in terms of mapPlusMap).

Cascading and scalding try to optimize your job to some degree, but in some cases for optimal performance, some hand-tuning is needed. This release adds three features to make that easier:

forceToDisk forces a materialization and helps when you know the prior operation filters almost all data and should not be limited to just before a join or merge.

Map-side aggregation in Cascading is done in memory with a threshold on when to spill and poor tuning can result in performance issues or out of memory errors. To help alleviate these issues, we now expose a function in groupBy to specify the spillThreshold.

We make it easy for Scalding Jobs to control the Hadoop configuration by allowing overriding of the config.

Algebird

Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm). It was originally developed as part of Scalding’s Matrix API, but almost all of the common reduce operations we care about in Scalding turn out to be instances of Monoids. This common library gives Map-merge, Set-union, List-concatenation, primitive-type algebra, and some fancy Monoids such as HyperLogLog for set cardinality estimation. Algebird has no dependencies and should be easy to use from any scala project that is doing aggregation of data or data-structures. For instance in the Algebird repo, type “sbt console” and then:

Future work

We are thrilled to see industry recognition of Scalding; the project has received a Bossie Award and there’s a community building around Scalding, with adopters like Etsy and eBay using it in production. In the near future, we are looking at adding optimized skew joins, refactoring the code base into smaller components and using Thrift and Protobuf lzo compressed data. On the whole, we look forward to improving documentation and nurturing a community around Scalding as we approach a 1.0 release.

If you’d like to help work on any features or have any bug fixes, we’re always looking for contributions. Just submit a pull request to say hello or reach out to us on the mailing list. If you find something missing or broken, report it in the issue tracker.

Acknowledgements

Scalding and Algebird are built by a community. We’d like to acknowledge the following folks who contributed to the project: Oscar Boykin (@posco), Avi Bryant (@avibryant), Edwin Chen (@echen), Sam Ritchie (@sritchie), Flavian Vasile (@flavianv) and Argyris Zymnis (@argyris).