Monitoring the GOV.UK infrastructure

We recently started publishing Incident Reports when things go wrong on GOV.UK. These reports recognise all technologies inevitably run into problems and sometimes these problems may affect users. But while publishing these reports, we’re also keen to minimise them by spotting and fixing technical problems before they become serious disruptions to users.

As a Web Operations Engineer on GOV.UK, I’m keen to ensure we constantly consider how we can improve our service monitoring. Ideally, monitoring is carefully tuned to ensure the right balance between alerting too much and too little. We want to know what’s happening only if it has potential to cause impact on users so we can focus all our time on these issues.

This post covers some recent steps we’ve taken to improve our monitoring techniques, and how we've tied these techniques into our GOV.UK disaster recovery strategy.

Our GOV.UK disaster recovery strategy

Our GOV.UK disaster recovery (DR) strategy ensures all data gets replicated in real time over a virtual private network (VPN), and the instances are then located in a separate datacentre.

This means if the very worst happened and GOV.UK lost connectivity to its main datacentre, we can rest easily knowing we have the very latest data from before the disaster, as well as our nightly offsite backups. The VPN has to be stable to ensure the replicated data is protected.

Monitoring our whole technology stack with Icinga

We use the open source monitoring tool Icinga to keep track of everything that’s going on across our technology stack, including both our VPN and DR datacentre.

To make monitoring easier, we treat our machines in the DR datacentre as if they are location transparent. Icinga does not need to know whether the VMs reside in the DR datacentre or in our VPN, and this transparency avoids us having to maintain a list of external IP addresses.

The way Icinga checks machines is through a ping reply check. If Icinga does not receive a ping reply within a configured timeframe, then it determines the server unavailable. When network issues arise with the VPN, the tool deems all the servers on the other side of the VPN tunnel as unavailable and notifies us.

We’re confident in the resiliency of our platforms and we only want our engineers disturbed when there’s a serious problem that’s within their control to remediate. Recently, when the VPN had an issue, Icinga thought there were several machines down and called to let us know. Unfortunately this was at 4am, and there was no way to resolve the issue.

To prevent such specious alerts in the future we decided it would be best to establish a child/parent relationship between the VPN and the dependent hosts. Icinga makes setting this up relatively straightforward, allowing a defined host to be dependent on another defined host.

Configuring the parent/child relationship in Icinga

We used a configuration management tool called Puppet to enforce our desired state of IT infrastructure.

Puppet let us create a resource for each host in our VPN, which Icinga refers to when monitoring. Puppet also established a number of parent resources, and ensured each host can only be reached if its associated parent resource is available.

When Icinga detects a parent host as being down, all dependent hosts go into an “unreachable” state. Unreachable means that Icinga is aware, but knows not to notify Pagerduty.

The end result

After a bit of testing, breaking and prodding to ensure everything worked as expected, we rolled out the new monitoring configuration. Now GOV.UK montioring works as required because Icinga can distinguish between 'down' and 'unreachable' states.

If the VPN goes down, our monitoring system lets us know. Icinga then automatically hides any ‘unreachable’ notifications that are from hosts relying on the VPN, drastically reducing the amount of noise we hear. This ensures we can continue to work more efficiently without having to sort through alerts that aren’t relevant. We can keep GOV.UK ticking over without being distracted by issues out of our control.