Grid Monitoring at CERN with Elastic

The Worldwide LHC Computing Grid (WLCG) project [1] is a global collaboration of more than 170 computing centres in 42 countries, using over half a million processor cores, and linking up national and international grid infrastructures. It gives a community of over 10,000 physicists near real-time access to Large Hadron Collider (LHC) data, with an average of two million jobs running every day.

Mission & Background

The mission of the WLCG project is to provide global computing resources to store, distribute and analyze the ~30 petabytes of data annually generated by the Large Hadron Collider (LHC) at CERN on the Franco-Swiss border. The accomplishment of this mission currently requires more than 300 petabytes of disk and 200 petabytes of tape.

The WLCG Grid Monitoring team at CERN is in charge of providing tools and services that allow the monitoring and understanding of the complex WLCG infrastructure. Without this understanding, an efficient use of the system would be impossible. The team has developed multiple applications to retrieve, display, and analyze the monitoring data. The different applications are tuned to the expected audience: high-level views for management, detailed views for service administrators and individual users and illustrative views for the general public. Most of these applications are based on web servers on top of relational databases. Recently, the increased volume and complexity of monitoring data required the move to modern, state-of-the-art technologies.

The next section describes three monitoring tasks which we use as pilot applications for our Elasticsearch evaluation. They cover various areas of the computing activities on the WLCG infrastructure: data access and distribution, data processing, and services health and performance. There are dedicated dashboards for each of these areas.

The WLCG Data Collection

The data collected by the WLCG experiments is replicated to several sites. This increases the speed at which it can be processed, and it provides a failover mechanism in case any site is temporarily unavailable. On average, there are 25 million files transferred every day at an average speed of 10 GB/s. The WLCG Transfers Dashboard, shown on figure 1, uses multiple filters for time, experiments, countries, and sites to allow the creation of detailed views of the data movements.

Figure 1. WLCG Transfers Dashboard [2]

The raw data collected by the LHC detectors has to be processed. It then becomes available to the physicists for their analysis. On top of that, there is also a lot of simulated data created, to compare with data obtained from the LHC detectors. All data is distributed within the WLCG infrastructure. The Job Monitoring Dashboard application, shown on figure 2, follows the status of the data processing, analysis, and simulation. On average, there are more than 2 million jobs per day, with 250,000 concurrent jobs at any given time distributed among more than 170 sites.

Figure 2. CMS Job Monitoring application [3]

Finally, each site provides different services that are continuously being tested to detect any issues as quickly as possible. In total, there are more than a thousand metrics being used to determine if services and sites are working properly. The Site Status Board, shown on figure 3, stores, displays, and combines all these metrics. The metrics have different granularity, ranging from minutes to years. For example, one of the metrics shows whether a file can be written in the storage service every 60 minutes. A different metric provides the amount of annual storage that the sites pledge to the collaboration.

Figure 3. Site Status Board (SSB) application for one of the WLCG experiments [4]

Relational Database → Elasticsearch

The current applications in these three areas use the Experiment Dashboard Framework [5]. The web frontends use open source javascript libraries, like jQuery, DataTables, and Highcharts, and the server side is Apache and Python on top of relational databases. To cope with the increase of scale and complexity we are evaluating Elasticsearch as an alternative to relational databases.

The team has already gained production experience with Elasticsearch. In particular, the Messaging service [6] uses Elasticsearch to monitor the status of their services. At the moment, there are more than two billion documents. A Kibana dashboard has been deployed, and, for security reasons, restricted to the service managers. The information is also fed into an Esper engine [7] that is responsible for generating alarms in case of issues.

We started our Elasticsearch journey with an evaluation of Elasticsearch 0.90.0 in 2013. The insertion and querying time was promising. At the same time, the lack of multifield grouping prevented the full migration [8].

The evaluation was resumed with Elasticsearch 1.x, in order to expand the usage of Elasticsearch to the three following areas:

Data transfer movements between the sites

Job processing

Status of the sites and services

Future Elastic Projects at CERN

We are currently evaluating how we can further benefit from the Elasticsearch components. The current evaluation contains several independent aspects:

Adopt Logstash to insert data into the cluster

Use a single Elasticsearch cluster for all the different use cases, ensuring that the authorization of each of the tenants can be fulfilled

Describe the different type of documents and indices needed for each of the applications

Evaluate Kibana for visualization. This particular area has lower priority, since the team had already implemented the necessary web interfaces, and the effort is being put on modifying the web interfaces to read from Elasticsearch

While it is too early for a final verdict, the experience with Elasticsearch has been very positive so far, and we will continue to expand our evaluation.

Pablo Saiz is a computer scientist from Spain who has been working at CERN for the last 15 years. During that time, he has been involved with several areas related to distributed computing for the WLCG (Worldwide LHC Computing Grid). He was the main developer and project leader of AliEn, the ALICE Environment on the GRID. AliEn is the system used by ALICE (one of the LHC experiments) to distribute the data and workload among the more than seventy sites participating in the experiment. Currently, he leads a team that is responsible for the monitoring applications used by the WLCG. The team provides applications to monitor data distribution, job processing and service status. Pablo is the main designer of tools like the Site Status Board (SSB) and the Service Availability Monitoring (SAM3).

Want to hear Pablo speak? Then join us for the Elastic{ON} Tour in Munich on November 10. He will be there talking about all things Elastic @ CERN. We are currently sold out but you can sign up for the wait list and we'll let you know when more tickets become available.