Get in touch

OK: Monitoring enabled

To run a website like Booking.com you need a lot
of servers. Not just webservers, but also mail servers, database servers, name
servers, proxy servers and a lot more. How do we make sure they are all
operational and doing what we like them to do? We monitor them.

What? Where?

The first decision we must make is what to monitor about our server health
and where. Running out of disk space
is an issue we want to watch out for on every machine, but we don't really
care about Replication Delay on anything but database servers in a
replication chain. So each server has a set of core checks, on top of which we
add checks based on the role of the machine.

Who? Where?

After we've decided what to check for, we obviously want to inform our staff of
any failures. We use multiple methods for this. Email, XMPP (Jabber) and SMS
text messages to two lucky sysadmins who carry the phone 24/7.
To make this easier on our staff, we've deployed
a "follow the sun" model.
That means our team in Singapore kicks the day off with monitoring. When they are done,
the European team takes over. By the time the (West Coast) US team walks into
their office, they take over to hand it back to Singapore when they walk out.

This only leaves us with weekends and public holidays. Since we're an Amsterdam
based company, most of the staff works in European offices and thus we
distribute the weekend shifts amongst these sysadmins.

How?

Hardware

We use Nagios for our monitoring. But since our system is
rather big and complex (did I mention we monitor roughly 5000 physical
servers?), we can't have all monitoring done by one machine. That would be
silly, for what would happen if that machine went down? "Quis custodiet ipsos
custodes"? We always run our monitoring servers in pairs so that if one goes
down, the other can alert us of that event.

But two would only harden our setup a little. It wouldn't be powerful enough to
monitor all our machines, due to the scale of operations. Besides the scaling
factor, we would run into a lot of firewall issues. So we have a set of
monitoring servers in all of our network segments. That includes a set outside
of our own network.

Software

As mentioned, we rely on Nagios. But we also rely on Puppet for distributing
the checks around our machines and a thing we like to call ServerDB. Some
people might call this a CMDB (we don't).

In ServerDB, we store information about all our physical assets and here it is
that we assign roles to them. A role simply defines what the server is
supposed to do. Is it routing emails? Is it serving XML traffic? Is it a
sending out faxes? All our servers are grouped by at least one role (multiple
roles are possible).

ServerDB is also the place where we define in what state the machine is
supposed to be. Is it in production? Or is it currently being set up? Is it in
maintenance mode? Based on this state, we decide whether we want to wake
sysadmins up when something goes bad.

Now that we have groups of servers (based on roles), we can add checks to them.
Since we use plain old Nagios, the scripts can basically be in any language (we
try to stick to Perl and Python) and can check basically anything you can dream
of, as long as it has a threshold that can be detected.

All that is left now is to instruct the monitoring servers to fetch this
information and use it. A few cronjobs query our calendar (who is supposed to
be "on call" ?) and query ServerDB (what should we check and where?).

Guys! We're almost running out of space!! ZOMG@!

When a machine misbehaves badly or simply has a component failure, the system
administrators are informed by SMS messages. But we don't want to have this
much attention for all warnings we get. A warning might tell us that a disk is
filling up, but there's still enough space left for now. These warnings (as
well as the critical errors) are usually best monitored through the Nagios
webinterface. But as I mentioned, we have quite a few monitoring hosts. To give
us a centralised view of our entire setup, we use
Multisite which also allows
us to create custom dashboards for different teams/views.

The future of monitoring

As with most of our sub-systems, scaling the business up always makes it fun
and interesting. With monitoring this is no different. In the past we've used
and tested other tools and Nagios forks but so far we haven't been able to find
anything that works for us. For one, we would like to use our data a little
smarter. We already throw a huge amount of data at our
Graphite servers so why check again with
Nagios?

Another problem we have to tackle is to make failing over to the secondary
monitoring server (remember we operate them in pairs?) automatic.

And by the time we're all done with that, the next issue will probably present
itself. We just need to make sure we read the warnings before they become
critical.