Machine learning. Artificial Intelligence

Menu

Spark 1.6 Released

The Spark team announces release of Spark 1.6.0. For a full list of new features, review the release notes here. On the Databricks blog, Michael Armbrust, Patrick Wendell and Reynold Xin announce the release and summarize key enhancements. In November, the same authors announced a preview of the release.

The contributor base continues to grow, as shown in the chart below:

Key enhancements include:

Datasets API

Automatic memory management

Improvements to Spark SQL performance and API

Faster streaming state management

Expanded machine learning capabilities

Let’s review each of these in turn.

Datasets API

The most important new feature in Spark 1.6 is the Datasets API, which allows users to easily express transformations on domain objects, while also providing the performance advantages of the Spark SQL execution engine. The Datasets API supports static typing and user functions that run directly on existing JVM types (such as user classes, Avro schemas, etc). It leverages the Spark Catalyst optimizer and Project Tungsten’s fast in-memory encoding.

On the Databricks blog, Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia introduce you to Spark Datasets. The team expects to enhance the API over the next several releases.

Automatic Memory Management

In previous releases, Spark partitioned available memory into two static regions, one for execution and the other to cache hot data. Developers specified the desired memory configuration, but doing so required user expertise; operations requiring more than the specified amount of memory spilled to disk, sandbagging performance.

With Release 1.6, the Spark memory manager automatically expands and contracts memory regions as needed at runtime. Since developers no longer need to guess about how much memory to allocate, over-allocate to one region or another to support a single operation, more memory is available for other operations.

Spark SQL performs reasonably well already against other SQL engines. It will be interesting to see some benchmarks with the new TPC standards.

Spark Streaming: Improved State Management

In a post on the Databricks blog from July, 2015, Tathagata Das, Mahei Zaharia and Patrick Wendell describe how Spark Streaming works. In previous releases, updateStateByKey allowed the user to maintain per-key state and manage that state using an updateFunction, which called for each key, used new data and the existing state of the key to generate an updated state.

Based on user feedback, the Spark team defined a need for a new capability in the streaming API (called mapWithStateAPI) that tracks changes in the data rather than scanning the entire dataset. The team reports a 10X speedup in state management through this enhancement.

Expanded Machine Learning Capabilities

The team has added three new feature transformers for ML pipelines.

— The ChiSqSelector transformer operates on labeled data with categorical features. It performs a Chi-Squared test of independence on the class, then selects the top features.

— QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features.

Also new in ML: an algorithm for Accelerated Failure Time (AFT) analysis, a parametric survival regression model for censored data, also called log-linear model for survival analysis. AFT is easier to parallelize than the Cox Proportional Hazards model.

An additional new feature for ML is pipeline persistence, or the ability to save and load ML Pipelines.