Featured in
Architecture & Design

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Featured in
Operations & Infrastructure

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Featured in
Enterprise Architecture

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Spark Gets a Dedicated Big Data Platform

Spark users can now use a new Big Data platform provided by intelligence company Atigeo, which bundles most of the UC Berkeley stack into a unified framework optimized for low-latency data processing that can provide significant improvements over more traditional Hadoop-based platforms.

The UC Berkeley offers as part of its stack a number of different projects to manage data processing at scale. While Hadoop has historically been the leader in Big Data systems, Spark has started gaining a lot of traction in the recent months, which culminated in March when Atigeo announced the release of their xPatterns Big Data platform focused on Spark and other related projects. According to David Talby, SVP of Engineering at Atigeo, Spark has surpassed MapReduce as an execution framework and it is only natural to have a platform dedicated to it:

We use HFDS as the underlying cheap storage, and will continue to do so, and some of our legacy customers still use MapReduce and Hive – both of which are still available within xPatterns. However, for new customers & deployments we consider MapReduce a legacy technology and recommend all new code to be written in Spark as the lowest-level execution framework, given the substantial speed advantages and simpler programming model.

A common use cases when dealing with data at scale is to be able to query this data using SQL-like languages. Hadoop has Hive, Spark has Shark, and they both serve a similar purpose, but the performance considerations can vary. Hive has been historically slow, but has been going through a series of heavy improvements which can improve its speed up to 45 times. When taking this into account, as well as the very active community behind Hive, it is easy to understand Atigeo's decision to support both Hive and Shark as explained by David:

For SQL-like querying, we still support Hive side-by-side with Shark, since Shark does not yet fully support all the operators and edge cases that we require.

Spark is only one of the layers of the UC Berkeley stack, and there are other projects that can be used in enterprise-grade Big Data projects:

Atigeo's platform includes Spark, Shark, but also Tachyon to provide easy and fast data sharing of data between Hadoop and Spark. For the remaining projects, Atigeo doesn't have anything to announce at the moment, but David mentions that Atigeo is "evaluating these technologies and determining our plans to incorporate them in the future, as they mature and as our customers present concrete use cases that require them."

Also included in xPatterns is Apache Mesos, a tool used to manage and share cluster resources among various data processing frameworks such as Hadoop or Spark. This enables users to efficiently allocate resources regardless of the framework being used. Mesos is very similar in nature to YARN which is more often associated with the Hadoop stack, while Mesos was developed at UC Berkeley and so finds a more natural fit for Spark projects. David commented on why Atigeo decided to favor Mesos over YARN in their platform:

Mesos was available earlier and more mature, and to date is more technically capable. Today, Spark on YARN only runs in static mode (coarse grained) – you allocate a fixed number of cores in memory from the cluster for each execution framework, which can only be used by that framework. In order to have better utilization, we use Spark on Mesos in dynamic mode (fine-grained), where the number of cores is allocated dynamically by Mesos. So for example, today we have MapReduce, Spark, and two Shark Servers running on Mesos – and any of these frameworks can get the cluster’s full resource capacity if the other frameworks are idle or under-utilized. Additionally, Mesos already supports other execution frameworks – Storm, Aurora, Chronos and Marathon are concrete examples that are of interest to us. As YARN matures or adds these capabilities and is able to support our customers’ needs, we expect to add support for it too.

The Spark community is going strong today, and even surpassing Hadoop MapReduce in terms of number of contributors, so having a new Big Data platform giving more traction to Spark is good news, as other projects are slowly shifting towards the Spark model.