CERN’s Next Generation Data Analysis Platform with Apache Spark

The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.

In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.

In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.

The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.

Enric Tejedor received his Ph.D. in Computer Science from the Technical University of Catalonia (UPC, Spain) in 2013. He conducted his doctorate research at the Barcelona Supercomputing Center, where he focused on parallel programming models for distributed infrastructures and participated in several EU research projects. As part of his Ph.D., he was also an intern at the IBM T.J. Watson Research Center (NY, USA). In 2015 he joined CERN, where he works on the parallelisation of high-energy physics analysis software, the development and operation of cloud-based analysis services and the development of automatic Python bindings for C++.

Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.
The Apache Software Foundation has no affiliation with and does not endorse the materials provided at this event.