Apache Spark is revolutionizing the big data industry due to performance advantages and the inclusion of standard SQL support on Hadoop.

But all data is not equal -- and getting insights in a latency-sensitive fashion can sometimes mean millions of dollars -- or lives. For example: real time fraud detection in the financial services industry can mean vast sums of money. Similarly, for the health care industry, detecting cardiac arrests before they happen could save a significant number of lives.

Disruptive technology with this level of potential benefit deserves to be tested. To that end, IBM, Lenovo, Intel and Mellanox joined forces to address this need and serve as industry first ready reference solution for spark deployment both on scale and capacity. The goal of this project was not only to highlight performance of a Spark cluster solution with excellent scalability benefits but also to provide infrastructure building blocks for spark deployment with troubleshooting, optimization techniques.

The architecture components of this particular solution stack are a balanced configuration of high performance compute, storage and networking components. For the compute element, the Lenovo X3650 M5 server is selected for Spark data worker jobs and the Lenovo X3550 server for Spark master function. Each of the x3650 M5 servers is configured with Intel E5-2697 V4 high performance processor and loaded with 1.5 TB of memory. Though Apache Spark is a fast, in-memory data processing engine which goes beyond memory footprint, we wanted storage closer to memory performance; for this reason we chosen Intel NVMe SSD’s which provides up to 450K IO operations per second at minimum latency. Networking is important when considering performance at scale: for this, the Mellanox 100G network Interface card was selected.

The Hadoop-DS (derivative of TPC-DS) benchmark was chosen which requires many of the SQL: 2003 features; Spark 2.0 supports those. Spark SQL has been one of the primary interfaces Spark applications use, these extended SQL capabilities drastically reduce the effort needed to port legacy applications over to Spark.

The total solution resulted in 30 3650 M5 servers lined up in 2 racks, with 100G Mellanox switch managing the data traffic flow between these servers. For the OS, Red Hat and Hadoop HDFS served as distributed data storage and Spark 2.0 for data processing.

The infrastructure building blocks, software stack and Hadoop-DS are illustrated below:

To learn more about the Spark SQL at scale benchmark results, scalability benefits and optimization techniques here are two conference sessions to watch out for