Using SparkR to Scale Data Science Applications in Production. Lessons from the Field

R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.

Heiko Korndorf is the CEO and Founder of Wireframe Ltd in Liechtenstein. Wireframe focuses on Hadoop and Spark based solutions for enterprises in various industries. Heiko has worked in the last 20 years on large-scale IT projects across Europe for clients such as Mercedes-Benz/Daimler, JLR Jaguar Land Rover, BP, Deutsche Telekom, SAP, ABN AMRO, and British Gas. He holds a M.S. in Computer Science from the University of Zurich.

Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.
The Apache Software Foundation has no affiliation with and does not endorse the materials provided at this event.