Deep Dive Into Databricks’ Big Speedup Plans for Apache Spark

Alex Woodie

Apache Spark rose to prominence within the Hadoop world as a faster and easier to use alternative to MapReduce. But as fast as Spark is today, it won’t hold a candle to future versions of Spark that the folks at Databricks are now developing.

Last week Databricks shared its development roadmap for Spark with the world. The company has near-term and long-term plans to boost the performance of the platform; the plans are grouped into something called Project Tungsten.

In a telephone briefing, Databricks co-founder Reynold Xin gave Datanami the low-down on the changes coming to Spark, why they’re necessary in light of enhancements made to the underlying hardware, and how they’ll impact Spark users.

One of the first changes that Databricks has planned is to improve how Spark utilizes memory. As a Java program, Spark currently relies on the underlying Java Virtual Machine (JVM) container–as well as the JVM’s garbage collection routines–to manage the memory that applications need.

While this approach works and keeps Spark programmers out of the memory management business (which can be quite tedious), the folks at Databricks feel that, going forward, the JVM and its associated garbage collection routines will be too computationally expensive to continue to use in the long run.

“So what we’re doing as part of the Tungsten initiative is to sidestep the JVM garbage collection and try to manage memory efficiently ourselves,” Xin says. “We don’t want the overhead of garbage collection, and in particular we don’t want our users to worry about the overhead and having to trim them.”

Databricks is planning two ways to improve the memory management of Spark under the Tungsten initiative. The first involves letting Spark pre-allocate a large chunk of space in the JVM’s managed memory for applications. While this doesn’t get the JVM out of the memory management business for Spark apps, it should bring incremental improvements, because the JVM is only managing one (large) object, instead of many objects, Xin explains.

The second plan is to bypass the JVM completely and go entirely off-heap with Spark’s memory management, an approach that will get Spark closer to bare metal, but also test the skills of the Spark developers at Databricks and the Apache Software Foundation. “I think this is the optimum approach in the long run,” Xin says, “but it’s a riskier approach because it’s relatively untested. You need to worry about whole new things.”

While Xin frets about the prospect of gray hairs, the risk posed to the user will be minimal. The API will not change, and programmers won’t have to do anything different. “It’s basically a major engineering investment,” he says. “I don’t think there will be a lot of downside. I think if it’s done well, I don’t think users will notice any regression. A lot of workloads very likely will see order of magnitude gain.”

SQL query processing in Spark stands to get a major speed-up, Xin says. “That’s actually a very important goal of this project,” he says. “The other thing is a lot of the advance machine learning workloads will also get faster because they are… heavily CPU bound.”

As Xin explains, these changes are needed to help Spark and Hadoop apps get more out of today’s faster hardware. “Back in the day, Hadoop was so [poor performance-wise]…that anything we did was much better,” he says. The reliance on spinning disk and 1Gb Ethernet networks meant that Spark still had headroom when it came to the CPU itself. Just moving the data in and out was the bottleneck, as Moore’s Law kept plenty of processor capacity in reserve.

That dynamic has changed with the advent of very fast SSDs, speedy 10Gb Ethernet networks, and the decay of Moore’s Law, Xin says. “The underlying hardware is actually becoming much better, compared with the CPU and memory subsystems,” he says. “So as a result, before it was fairly easy for a Spark program to saturate the network and I/O, and now when we look at it, it’s actually harder because now it’s underutilizing the I/O and memory. So the goal is to squeeze as much as possible out of the new hardware.”

Databricks has a few other tricks up its sleeve with Project Tungsten besides bypassing the JVM to boost memory management, including cache-aware computation. As Xin and Databricks engineer Josh Rosen explain in last week’s blog, cache-aware computation will enable Spark to take advantage of today’s L1, L2, and L3 on-chip caches.

“When profiling Spark user applications, we’ve found that a large fraction of the CPU time is spent waiting for data to be fetched from main memory,” Xin and Rosen write. “As part of Project Tungsten, we are designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work.”

The company is also looking into code generation to further accelerate Spark. There is already some code generation for SQL and DataFrames in Spark. But with future releases, Databricks will be broadening the code generation coverage to most built-in expressions, the company says. “In addition, we plan to increase the level of code generation from record-at-a-time expression evaluation to vectorized expression evaluation, leveraging JIT’s capabilities to exploit better instruction pipelining in modern CPUs so we can process multiple records at once,” Xin and Rosen write.

But wait, there’s more! Databricks is also exploring the potential to use GPUs to accelerate certain types of workloads, such as deep learning algorithms and some graph analytic workloads, Xin says. The company will be using the OpenCL library to enable Spark to leverage GPUs in clusters when they are available. Also on the far horizon is the potential use of LLVM compiler technologies to take advantage of the Single Instruction, Multiple Data (SIMD) and Streaming SIMD Extensions (SSE) instructions in modern X86 chips.

Databricks will start to expose the JVM memory management enhancements with the upcoming release of Spark version 1.4. That release, which will ship in June, will contain the enhancements but not run them by default. They will start to be used by default in version 1.5, which is slated for September. Some of the other stuff on the roadmap, such as code generation and cache-award computing, will be added to future editions.

Spark has come a long way in a short amount of time. But judging by Project Tungsten and the product’s roadmap, it has a long way to go to fulfill its creators vision of helping developers to easily build fast data-intensive applications.