Icinga at CERNTechnology & Science

Not too long ago, CERN and their Large Hadron Collider (LHC) made scientific history when they found the elusive Higgs Boson. It just so happened to be, that behind the scenes Icinga was quietly monitoring away.

Deep under the Franco-Swiss border, the 27km LHC ring collides subatomic particles travelling at 99.9999991% the speed of light; producing more than 25 petabytes of data annually in the process. Along the ring, detectors at four sites capture this data for experiments – ones which seek to discover the origins of matter and anti-matter, extra dimensions and of course the famed Higgs Boson.

At three of these four sites, Icinga has replaced Nagios to monitor thousands of hosts and services, improving check latency with the help of Mod Gearman along the way. We’re proud to say, Icinga helps make sure that CERN’s LHC is on track.

Challenge

100m deep under the Franco-Swiss border lies a 27km ring known as the Large Hadron Collider (LHC) that crashes subatomic particles with 14TeV of power. Stationed at 4 sites, detectors weighing up to 12,000 tonnes record data for experiments to discover the origins of matter and anti-matter, verify the existence of Higgs Boson, and extra dimensions among others. To help ensure that all runs smoothly, Icinga monitors three of these: LHCb, CMS and ATLAS (Figure 1).

A control system and data acquisition chain forms the backbone of the experiment, running on Linux and Windows machines as well as embedded processors. Initiallly, monitoring began with a single Nagios instance. However as the IT team tried to scale, issues started to surface. The average service check latency of 328 seconds was simply too slow. A new solution was called for, and the administrators came across Icinga and its lively community.

Solution

A new solution was called for, and the administrators came across Icinga and its lively community. Thanks to Icinga’s configuration compatibility, migration was relatively easy. Though to make future maintenance even easier, configuration files were reorganised to make full use of groups and inheritance. This way, adding a new machine into an existing category such as database, computing node, storage etc. simply meant changing one configuration file. Today, the LHCb experiment is monitored by a single Icinga instance in a failover setup combined with Mod-Gearman, NRPE and NSClient++. This includes few customised checks such as GPFS and file system speed, beside SNMP checks and other special performance measurements.

The central Icinga server schedules monitoring checks, which 60 distributed Mod-Gearman workers fetch from one of its queues, execute and push back their results to another queue (Figure 2). With this new setup, a single Icinga instance is able to monitor a massive environment of 2000+ hosts and 40,000+ services. To top it of, service check latency improved from an average of 328 seconds to less than 1 second.

CHECKING ON HIGGS BOSON

At a second and third site, detectors at the CMS (Compact Muon Solenoid) and ATLAS (A Toroidal LHC Apparatus) experiments look into the existence of Higgs Boson, extra dimensions and dark matter.

At CMS, Icinga monitors 3000 hosts and 70 switches with one central instance. Combined with a single Mod-Gearman worker, NRPE and check_multi, Icinga handles 90,000 checks every 2 minutes. Ranging from basic network usage, errors and disk utilisation to RAID arrays, equipment temperature and other special services, Icinga keeps an eye on the entire environment.

ATLAS on the other hand, uses a couple Icinga instances run on virtual machines alongside Nagios. Of the total 3000 hosts, these monitor about 90 critical ones on two networks. The monitoring setup helps ATLAS maximise use of LHC beam time, and harvest as much data as possible for physicists to analyse.

The Future

EXPANDING INTO THE FUTURE

Plans are however underway to completely migrate ATLAS to Icinga, Mod-Gearman and Ganglia, to cover the total of 3000 hosts and 100,000 checks there. These include hardware monitoring via IPMI, and will most likely run on a single central Icinga instance with ModGearman workers, in line with other sites.

Extension of the Icinga environment at CMS is also in the works with more dedicated services to monitor the majority of the experiment’s software currently being added. In expanding Icinga’s monitoring reach, CERN’s IT teams can be sure that they’re on top of LHC operations and the experiments can get on with the real rocket science.

Funnily enough, Icinga was monitoring behind the scenes, as the elusive Higgs Boson was found. Indeed, as long as the LHC and its experiments continue to collide and collect unhindered, Icinga will be there for many more discoveries to come.