This fall, Cloudera is intensifying its commitment to make Apache Spark a replacement for MapReduce in the Hadoop ecosystem as the core data execution engine for workloads. Cloudera execs speak with IDN about their next roadmap for improving Spark’s scale, security, management and more.

This fall, Cloudera is intensifying its commitment to make Apache Spark a replacement for MapReduce in the Hadoop ecosystem as the core data execution engine for workloads.

Earlier this month, Cloudera unveiled a roadmap for how the company intends to make Hadoop and Spark better together. It was included as part of Cloudera’s One Platform Initiative document, which detailed Cloudera programs and investments to help Spark overcome some key enterprise-class limitations – especially in the areas of scalability, security, management and streaming, according to Cloudera execs.

“We have many customers committing to Spark. At the same time they tell us that for Spark to take the next step, we need to fill in some crucial gaps,” Cloudera’s Jai Ranganathan, director for product management, told IDN. “We’re making that commitment to deliver these capabilities with our partners and Apache.”

Even prior to the release of its Spark improvement roadmap, Cloudera engineers have openly expressed concerns that continuing to rely on MapReduce as the go-to data execution engine for Hadoop would holdback adoption and thwart exciting new use cases.

“While MapReduce plays an important role for Hadoop, especially for I/O intensive workloads, it remains hard-to-program. Further, MapReduce also doesn’t handle real-time or interactive workloads very well at all,” Ranganathan said. “We’re committed to helping Spark deliver more performance for solutions that are now in high-demand, such as real-time analytics,” he added. One metric of Cloudera’s commitment to Spark is the fact that the company has more Spark committers on staff than all the other Hadoop vendors combined, Ranganathan noted.

He highlighted two other big data trends that affirm Spark’s huge upside.

User adoption is exploding. At present, Cloudera reports it has more than 150 customers using Spark in production. Particularly exciting is the trend that new customers are showing especially high interest in using Spark within Cloudera’s Hadoop stack, according to Ranganathan.

Activity is taking off. Spark is capturing interest of developers and garnering a lot of committer interest. In the last six months, in fact, Spark is more active than the core Apache Hadoop project itself, he added.

“We’re convinced that Hadoop is the platform that will dominate the landscape in the next decade. Spark is a key part of that platform. Bringing it in more completely, and delivering the same security, governance, operational and other strengths Hadoop offers, is crucial,” Olson wrote.

“Spark dramatically outperforms MapReduce on latency and throughput, but today it simply can’t compete with MapReduce on scale,” Olson wrote, and revealed why increasing Spark’s scalability is so important. Some Cloudera customers are already handling petabytes of data across thousands of nodes using MapReduce and Hadoop, Olson said. “If Spark is genuinely going to replace MapReduce for general-purpose workloads, it has to scale that big, and bigger, in the Hadoop ecosystem,” he added.

To massively expand Spark’s scale, Cloudera is working on technologies to allow it to handle jobs on thousands of executors – each running simultaneously on large multi-tenant clusters with over 10,000 nodes, according to the One Platform Initiative roadmap.

Further, even while Cloudera has already improved Spark’s YARN and HDFS integration, Olson said Spark needs more – including dynamic resource allocation, better task-level elasticity, and improvements to the Spark “job history” server. Cloudera is also working with Intel to optimize Spark for next-gen CPUs to boost performance for common Spark workloads (i.e. machine learning and numerical analysis).

Security. Big data initiatives used to be all about scale and ease-of-use. Now, customer demand for stronger security is taking center stage.

Big data projects must now be able to guarantee data privacy, grant and revoke access privileges correctly, track data access accurately and report reliably when regulators ask, according to Olson.

As outlined in Cloudera’s One Platform Initiative, it intends to deliver encryption of data at rest and data in motion (during transmission over a network). To ensure this higher-level encryption won’t impact performance, Cloudera is teaming with Intel to integrate Intel’s advanced encryption libraries with Spark. These libraries can access CPU resources directly to dramatically speed up computations.

For long-running streaming jobs, Cloudera is also looking to provide Spark with abilities to conduct automatic credential renewal.

Cloudera even looks to secure one other popular use case. Because SparkSQL is often used to pull data from Hive tables, Cloudera is adding an Apache Sentry HDFS plug-in to ensure a Spark user can’t get around any of the security and privacy policies that are specified on those tables. Cloudera also says it will extend this table-level protection to column- and view-level security, according to Olson.

Management. To put more workload management into the hands of users, Cloudera is committing to Spark-on-YARN improvements for better multi-tenancy, performance and ease of use. For better visibility, Cloudera is working on ways to help Spark send better reports on resource consumption and utilization. It is even working on software that would automate configuration on an on-going basis, which would keep clusters well-tuned and humming through all types of operations.

Performance. Cloudera engineers are even working on ways to push the envelope on Spark’s current performance benchmarks to meet emerging use cases, Olson noted.

“Even though Spark is fast, there’s room for improvement in stream processing. Performance will continue to be a focus area across the platform, but in Spark Streaming in particular, there are some obvious changes we can make, in persistent mutable state management and elsewhere, that will deliver some big benefits,” he wrote.

Cloudera’s Spark Efforts Come as Apache Releases Spark 1.5

Cloudera’s focus on beefing up Spark comes as Apache has released Spark 1.5, an update that sports notable features to boost performance for many key use cases – including data processing, machine learning and real-time analytics. Spark 1.5 also sports improved cluster management.

According to Apache, the new work in Spark 1.5 represents more than 1400 patches from some 230 contributors across 80 separate organizations. Engineers from Cloudera and DataBricks were among the most active contributors in this latest release.

Many of the bottlenecks to Spark's performance had stemmed from its use (dependencies, in fact) of Java Virtual Machine (JVM) garbage collection and memory management. One of the most impactful updates in Spark 1.5 are innovations that allow Spark to sidestep these JVM limitations with direct access to CPU cache memory.