Cloudera Launches Effort to Unite Apache Spark and Apache Hadoop

New One Platform Initiative to Advance Spark, and Enable the Next Generation of Analytic Applications

PALO ALTO, Calif., – September 9th, 2015 –Cloudera, the leader in enterprise analytic data management powered by Apache Hadoop™, today announced the One Platform Initiative, an effort to accelerate Apache Spark development for the enterprise. Spark is already the most popular open source project in the Hadoop ecosystem, and this initiative will enable it to become the successor to Hadoop’s original MapReduce framework for general Hadoop data processing. By embedding Spark deeply and broadly across the platform, in areas spanning management, security, scalability, and streaming, this initiative will help make the next generation of analytic applications possible.

Over the past 18 months, Spark has seen wide adoption, with over 200 of Cloudera’s customers - including Allstate, Avvo, Barclays, Concur, DigitalGlobe,RelayHealth, and Santander UK - running Spark across diverse industries and for multiple use cases. Recognizing Spark’s potential to become the next general processing framework for Hadoop, due to its developer ease-of-use, modular flexibility, and performance, Cloudera invested ahead of the market in core engineering, support, services, and training to make customers successful with Spark.

"Spark is rapidly becoming a popular choice to complement Hadoop as businesses want a friendly, fast, and versatile engine to cover analytics needs around streaming, graph, and even machine learning," said Nik Rouda, senior analyst, ESG. "Cloudera is making big investments in developing and supporting Spark as a full-fledged component of their robust offerings. The big data market will continue to evolve rapidly, but this ensures Cloudera will be not only relevant, but remain a leader going forward."

As the first Hadoop vendor to ship and support Spark, Cloudera has been a leader in the Spark community and, in particular, in integrating Spark and Hadoop. With over 5x the Spark engineering resources of other Hadoop vendors, Cloudera has contributed over 370 patches and 43,000 lines of code to Spark and has made its development a key initiative with its partner, Intel. As a result, Spark is a deeply integrated and widely used component of Cloudera’s Hadoop platform. This production experience has provided considerable insight into the challenges of running Spark in customer environments at scale, and extensive knowledge of engineering and analytics teams’ requirements.

"Spark is well on its way to succeeding MapReduce in enabling jobs with hundreds of executors each, running simultaneously on large multi-tenant clusters with tens of thousands of nodes, but there is still some heavy lifting to do," said Mike Olson, founder and chief strategy officer, Cloudera. "It’s an ambitious goal, but with the community of committers and supporters, and our leadership, we think that’s highly achievable."

However, for Spark to fulfill its potential, there remain several core areas that need work. The One Platform Initiative will focus community efforts on these four needs: Security, Scale, Management, and Streaming.

Strengthening Spark Security
Many enterprises, particularly in highly regulated industries such as financial services, government or healthcare, have extensive security and compliance needs when deploying and using new tools, such as Spark. As the provider of the industry’s only distribution to have passed a PCI compliance audit, Cloudera has long been focused on comprehensive security. There has already been progress in making Spark secure, including adding Kerberos integration for authentication and role-based access controls through HDFS Sync with Apache Sentry. The One Platform Initiative will work to ensure that Spark meets strict regulatory guidelines and is fully integrated with Hadoop security, with development efforts focusing on governance, encryption including integration with Intel’s Advanced Encryption libraries, and fine-grained security controls.

Spark at Hadoop Scale
For Spark to succeed MapReduce, it will need to match or exceed the scale of MapReduce jobs that are running today, which often involve petabytes of data across thousands of nodes. Cloudera already supports the world’s largest Spark deployments, and these need to continue to grow. The One Platform Initiative will help ensure Spark can handle jobs across tens of thousands of nodes in multi-tenant clusters, requiring improved reliability, stability and performance.

Managing Spark
Making Spark easier to manage is necessary for wide enterprise adoption and supporting mission-critical production applications. Cloudera has led this effort by integrating Spark with Hadoop YARN for shared resource management, connecting to other Hadoop frameworks including Impala and Apache Solr, and adding useful metrics for diagnostics. The One Platform Initiative will continue to make Spark even easier to manage with automated configurations; improved multi-tenancy, performance, and ease of use for Spark-on-YARN; more visibility into resource usage; and improved PySpark installation for Python access.

Streaming
Streaming workloads are some of the most popular for Spark, especially with the exponential growth in Internet of Things (IoT) data and the desire for real-time analytics. To meet customer production needs, Cloudera has already worked to ensure zero data loss with Spark Streaming and developed integrations with the most popular data ingestion tools, Kafka and Flume. For the future goal of ensuring Spark Streaming can support most common stream processing workloads, performance will be a key focus area, as well as enabling new users audience to access streaming capabilities through higher-level language extensions.

Cloudera is revolutionizing enterprise data management by offering the first unified Platform for big data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera's open source big data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 40,000 individuals worldwide. Over 1,700 partners and a seasoned professional services team help deliver greater time to value. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production.

Cloudera, Cloudera's Platform for Big Data, Cloudera Enterprise Data Hub Edition, Cloudera Enterprise Flex Edition, Cloudera Enterprise Basic Editionand CDH are trademarks or registered trademarks of Cloudera Inc. in the United States, and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners.