Bridging the Gap with Spark and SAP HANA

While Hadoop can store and access vast amounts of detailed data at lower costs, businesses are still encountering high-performance demands and they face a bevy of business questions.

Emerging in-memory frameworks, such as Apache Spark and SAP HANA Vora, are enabling enterprises with the tools to overcome the limitations of batch-oriented processing and achieve real-time, iterative access to data on Hadoop clusters.

These frameworks also make it possible to work in a unified way that offers a pipeline of transformations while ensuring the security, governance, and operational administration necessary for production environments.

During a recent DBTA webinar John O’Brien, CEO and principal advisor at Radiant Advisors, and Amit Satoor, senior director of product and solution marketing at SAP, discussed how to bridge the divide between big data and enterprise data to improve decisions through greater context.

O’Brien explained the first generation of big data requirements were driven by internet companies that needed to store and crunch big datasets, affordably scale petabytes, handle all types of data, and couldn’t be limited to SQL.

Over time, the requirements have changed and now capabilities that are required by enterprise analysts include optimized SQL-on-Hadoop engines, fast performance, iterative capabilities require by enterprise analytics, in-memory (distributed for scalability) to avoid going to disk I/O, and data science applications that iteratively work on datasets and create temporary data sets.

SAP HANA Vora is another solution that can help users through this process. The platform is an in-memory query engine which leverages and extends the Apache Spark execution framework to provide enriched interactive analytics on Hadoop, according to Satoor.Spark can help in all these areas, O’Brien said. Users can get data via Spark SQL or Streaming, and gain access to a machine learning library and graph engines, use high performance, iterative, and programmable data handling, use API-based programming that speeds, and support Java, Python, and Scala (natively written in), and Spark R.

The solution offers a HANA-Spark Adapter for improved performance between distributed systems. Users can gain business coherence with business data and big data, compile queries to enable applications and data analysis to work more efficiently across nodes, and more.