Spark ML Runs 10x Faster on GPUs, Databricks Says

Alex Woodie

Apache Spark machine learning workloads can run up to 10x faster by moving them to a deep learning paradigm on GPUs, according to Databricks, which today announced that its hosted Spark service on Amazon’s new GPU cloud.

Databricks, the primary commercial venture behind Apache Spark, today announced that it’s now supporting TensorFrames, the new Spark library based on Google‘s (NASDAQ: GOOG) TensorFlow deep learning framework, on its hosted Spark service, which runs on Amazon Web Services (NASDAQ: AMZM). The deep learning service will be generally available within two weeks, the company says.

TensorFrames, which was unveiled this March as a technical preview, lets Spark harness TensorFlow for the purpose of programing deep neural networks, the primary computational method powering so-called “deep learning” algorithms. TensorFrames is also available to on-prem Spark users as a GitHub project, but it’s not yet available for download in the Apache Spark project, which limits its usefulness for the time being.

Common Spark machine learning tasks, such as image processing and text analysist, run up to 10 times faster using TensorFrames running on GPUs, Databricks said in a blog post today. What’s more, the code behind a simple numerical task like kernel density estimation was 3 times shorter using TensorFrames compared to using optimized Scala code, and it was four times less expensive to run, in terms of AWS resources (CPU vs. GPU).

The addition of deep learning to the super-popular Spark framework is important, Databricks says, because it allows Spark developers to perform a range of data analysis tasks—including data wrangling, interactive queries, and stream processing—within a single framework. That helps avoid the complexity inherent in using multiple frameworks and libraries.

Practical uses for Spark-based deep learning include image recognition, handwriting recognition, and language translation. Medical researchers could use TensorFrames and GPUs to better detect tumors in pathology images, the company says, while linguists would benefit from language translation that’s nearly on par with humans.

Databricks is preconfiguring TensorFrames on AWS. The software side of the setup includes Apache Spark, the TensorFrame library and initiation scripts, NVIDIA’s CUDA and cuDNN libraries (users can also use other deep learning libraries, such as Caffe). The hardware is composed of Amazon EC2 g2.2xlarge (1 GPU) and g2.8xlarge (4 GPUs) instance types. Databricks says that p2 (1-16 GPUs) instance types are coming soon.

Spark machine learning jobs run considerately faster and cheaper as TensorFrames on GPUs compared to optimized Scala code on CPUs, according to Databricks (image source: Databricks)

Databricks says the preconfiguration work saves each customer about 60% compared to configuring the setup themselves. The company further tweaks the Spark instance on the GPUs to prevent contention. “GPU context switching is expensive, and GPU libraries are generally optimized for running single tasks,” the company says in the blog. “Therefore, reducing Spark parallelism per executor results in higher throughput.”

While Databricks uses Apache Spark in its hosted service, the version of Spark that Databricks customers have access to is not available to the general public. So when will TensorFrames come to Apache Spark? That’s not yet clear.

At the Strata + Hadoop World conference last month, Databricks CEO Ali Ghodsi told Datanami that TensorFrames will eventually get its own library in the open source Apache Spark framework, right along with MLlib, SparkSQL, Spark Streaming, and GraphX. “It’s coming,” Ghodsi said.