Pages

Saturday, 24 September 2011

I need a replacement for Nagios. It's been around for over ten years and is possibly the most widely deployed system monitoring software. But it doesn't tick every box for me. There are people who will hark on about the power of Nagios and it's massive user base with billions of plugins - but this doesn't cut it because there are inherent weakness that are difficult to overcome with hacking it into some kind of Frankenstein-ish beast. Here's my main gripes:

File based configuration doesn't scale well - Managing Nagios config files by hand will drives me mad. It's not that the files are particularly hard to read, it's just that as number of checks grows, it becomes difficult and time consuming to manage these files. As a result, some checks just don't get created. Yes, there are tools with web interfaces that help, but I've never found these to be as flexible as I would like.

Lack of integrated graphing - Decent graphs save so much time when I'm trying to investigate a problem. There are some good graphing tools like Cacti and some monitoring systems that try and integrate them together, but the result is less than optimal. At some point I just know that despite the interface, these are separate component working in different ways and aren't really aware of each other.

No integrated support for industry standards like SNMP or IPMI - You can get a lot of info via SNMP and it's especially useful on hardware (where you may not be able to install a nagios agent). Using an array of scripts is not a flexible or efficient way collecting data on thousands of items.

Need better checks with less effort - While you can script the collection of virtually any piece of data, I don't really want to spend that much of my life creating scripts. I would also like more complex checks that can more precisely define a condition that you want to be notified of. For example, I don't really care if a web server is receiving 50 requests a second. But I do care if it's receiving 50 requests a second and other web servers in the load-balanced cluster are only getting 10 hits per second.

So my key requirements for a new monitoring system were as follows

Easier configuration

Integrated graphing

Built in support for SNMP (IPMI is a bonus)

The ability to make more complex checks

While commercial solutions are an option, they're not cheap - they can easily pass the £100k mark for a few hundred servers, so I started looking at open source solutions first. A friend of a friend recommended Zabbix. I had a look at the feature set; web based configuration, integrated graphing, SNMP and native agents, complex checks - looks like we're in business. A year on and overall, I'm happy to recommend it. Here's my likes and dislikes:

Like

Easy configuration of checks - Monitoring new items is easy once you understand how Zabbix works and making changes to many machines is a doddle.

Integrated graphing - You can create graphs easily and mix data from different hosts, which is great. Better yet is the ability to generate a graph of any numeric data collected on demand, so you don't even need to configure anything.

Complex checks - Zabbix has a range of mathematical functions to build an expression which evaluate whether to trigger an alert. So lets says I'm checking the load on a system. I can set up a trigger to alert if the load is five times larger the average load of the last hour, and stays at this level for at least ten minutes. This is a very powerful capability.

Maps - I've haven't really used these in anger yet but you can create a maps of selected host, host groups or individual checks. This is really useful to visually show dependencies between system especially when you have a chain of connected processes and you're tryin to diagnose a problem

IT Services - Zabbix 'IT Services' are used to show a high level view of an IT service which may consist of multiple hosts and checks. When an alert is triggered, this view shows how a entire service may be affect. I deal with systems with many interconnected components and when one fails it's hard to remember all the other things that might be affected. IT services are good for problem diagnosis and showing a high level view of what client services are affected.

Zabbix has a business model to make money - Even though I haven't paid for support yet (my rollout is far from complete), I'm reassured to see a strategy that can work in the business world. The truth is while some developers will work for free, it's very difficult to compete with commercial competitors because most developers like to pay their rent as well. Zabbix has a fast development cycle, commerical training and a conference a few days away. There's also a book on Zabbix which I found really useful.

Dislike

Logical, but unergonomic interface - It seems to take one click too many to get where you want. The interface needs to better reflect the workflow of an engineer investigating a problem. Although with each release, improvements have been made.

If you're considering moving up from Nagios, Zabbix is well worth a look.