Why I Enthusiastically Switched from Cacti to Zabbix for System Monitoring

Cacti is a “complete network graphing solution” according to their website. It has also been a thorn in my side for a long time.

See what I did there? Thorn… because it’s a cactus… never mind.

When Cacti is in a steady state–when I could get it to a steady state–it was good. Not great, because there was a lot of effort to get it into what I consider “steady state”, but good. The rest of the time… thorny.

There are five major things that have driven me up the wall. In no particular order:

Round Robin Database (RRD) sucks

The concept behind RRD is cool: a fixed-size, circular database (oldest data overwritten by the newest data) makes good sense for the type of data that a network graphing solution collects. In practice, using RRD means:

Another software dependency that needs to be updated, patched, and integrated in the Cacti ecosystem

Manually managing all of the RRD files that are generated for all of the data sources you’re collecting. RRD stores its data in individual files on the file system, you see, and the more data sources you collect with Cacti, the more RRD files you have to store and manage.

Your data is fragmented. The data you’ve collected from the network–and only that data–is in the RRD files. The information about the devices you’re collecting from (IP address, SNMP community string, when the poller last ran) are all stored in a separate, relational database (which is almost always MySQL). To back up your Cacti install, you need to backup both groups of data. Using different tools. And of course they have different restore methods. And you’re almost guaranteed different restore points.

Using SNMP for data collection is tedious

And I feel like I should know. I’ve written multiple SNMP MIBs and implemented those MIBs in an SNMP agent.

Getting even simple data out of SNMP like “how full is my disk?” can differ between operating systems.

Getting simple data out of SNMP like “how many messages has my SMTP server processed today?” can be downright awful if that simple bit of data isn’t part of an existing, well-known MIB.

And using multiple operating systems means dealing with multiple SNMP agents: OpenBSD has their snmpd(8), FreeBSD has bsnmpd, and pretty much everything else uses Net-SNMP. Each one has their specific nuances and features.

Code quality

Do. Not. Get. Me. Started.

The presentation and markup code is entangled with the application code. Reading the code is like trying to pull that CAT5 cable out of the drawer you shoved it in and having the RJ45 connector getting caught on everything and then you just give up and open a new package. Except you can’t give up with Cacti. You’re forced to look at the code because…

There’s very, very little error checking! Because apparently in Cacti land everything runs perfectly all the time. I’ve had to look under the hood at Cacti far more times than I’d care to admit in order to debug some sort of problem or errant behavior.

Customization is a necessity

Due to the limitations around SNMP and getting all the data out that you want (eg, SMTP transactions), customization is almost mandatory when using Cacti.

Be prepared to deal with some or all of these in order to poll your custom data source: shell scripts, PHP scripts, XML files, data sources template, data queries, and graph templates.

It’s fragile

It feels like a house of cards.

Cacti itself is just a bunch of PHP files so of course there’s a dependency on PHP and then PHP needs a bunch of modules to do things like create images and those modules depend on shared libraries and of course don’t forget the RRD tool set oh and MySQL and those modules oh and course there’s font libraries too which you probably want so your graphs look nice and puuurdy.

If any of those pieces are missing, or gets upgraded and doesn’t quite work right with the rest, then it’s back peeking under the hood to try and figure out what’s going on.

The fragility was really the last straw. Every time I did an operating system upgrade and subsequently upgraded all of the third-party software on my Cacti machine, I would inevitably end up troubleshooting some aspect of the Cacti ecosystem.

And that is why I’ve switched to Zabbix. And with enthusiasm!

Introduction to Zabbix

The Zabbix website bills Zabbix as “the ultimate enterprise-level software designed for real-time monitoring of millions of metrics collected from tens of thousands of servers, virtual machines and network devices.” What I cannot convey to you, reader, from this quote is just how different the Zabbix website feels from Cacti’s site. The Zabbix site has the polish and pop of a site run by… a company. Well, turns out, that’s because it is. Zabbix is the product and Zabbix LLC is the company which leads development on the product and is in the business of selling support, training, and integration services to companies that use the product.

Now even though there is a commercial entity behind Zabbix, the product is:

Open source (GPLv2)

Developed in the open (publicly accessible bug/feature tracker and source code repository)

Free to download and use

The product is multi-platform (OpenBSD, FreeBSD, Linux, Solaris/Illumos/OmniOS, macOS, Windows), looks amazing, and was very, very easy to get it to a point where I was trending data. I’m so impressed with how professional and polished this software is that I cannot believe it was a free download. It’s just that good.

Zabbix Operating System TemplatesThe Zabbix server UI makes heavy use of templates which greatly helps in getting everything to a useful state. The software also includes a feature called “discovery” where it can automatically detect things like number of CPUs, file systems/drives, and network interfaces in a server or network device and automatically create data sources (which in Zabbix nomenclature are called “items”), triggers, and associated graphs for each one. After just a few clicks, Zabbix will discover all the items, begin polling and set up the graphs and so on. No extra fussing around. Vey clean. Very simple.

Discovery can even go a step further and automatically discover hosts on the network and add them to the inventory (and subsequently create items, graphs, etc) but I haven’t made use of that capability yet so cannot comment other than to mention that I know it exists.

Software Architecture

Zabbixes architecture…err, Zabbix’s architecture? Zabbixez? Aaaanyways, the architecture of this application has four basic building blocks:

The server

The web front end

The agents

The proxies

Zabbix ArchitectureWhat Zabbix calls “the server” is actually a couple dozen processes that run on a “server” machine and takes care of tasks such as talking to the relational database, executing the polls by talking to the agents/proxies, and sending notifications due to events. The server is the controller of the Zabbix environment.

The web front end is a discrete piece of the architecture and can be colocated on the same machine as the server or put somewhere else such as a dedicated web server box. If you’re thinking ahead, yes, this does strongly suggest that the Zabbix developers have thought about proper separation of presentation and application code! This separation of duties also supports a highly scalable deployment by allowing separation of duties onto their own, dedicated hardware and/or VM instance.

The agents are the bits of software that run on the devices that you want to monitor and relay telemetry back to the server for storage in the database. This is a key enabler of why Zabbix supports so many platforms. More on this in the next section.

Lastly, the proxy is just that, a proxy between the server and a bunch of devices that are to be monitored. The use case for a proxy is if the server sits in a different security zone than the devices to be monitored. It’s easier to punch a hole between the two zones for the server and proxy to talk to each rather than for the server to talk to multiple different devices in the other zone. I also understand that a proxy can help scale the Zabbix installation by offloading data collection duties from the primary server, but I don’t quite understand how well that scales since the primary server ends up having to handle all the data that the proxy collects anyways.

Regardless how the data gets to the server, once it’s there it’s all stored in a relational database (MySQL, PostgreSQL, Oracle, DB2, or SQLite); no RRDs or on-disk data storage. Aside from a very small .conf file on the file system, all of the configuration for the server is stored in the relational database along with all of the data that is collected from all of the hosts.

High Fidelity Data Without SNMP

A key differentiator between Zabbix and NMSes that rely soley on SNMP for data collection is that Zabbix can actually put an agent directly on the host that’s being monitored. This is a very powerful feature that enables capabilities that SNMP-based NMSes either don’t have or cannot do easily.

First off, since the agent is the one that will be doing the actual data collection from the host (ie, fetching the current CPU load, or number of logged-in users), the agent is the piece of software that has to be platform-dependent. For example, the OpenBSD agent knows how to talk to the OS and find the CPU load. Same for the Windows agent, and the Solaris agent. This alleviates the Zabbix server processes from having to know all of these platform-specific mechanisms; the agent takes care of interrogating the platform using its built-in knowledge of the platform and simply communicates back to the server using a common protocol.

Zabbix Data ExchangeIn this sort of architecture, the server software can stay very lean and relatively simple.

Another benefit this brings is that in order to extend the software to be able to gather additional or unique pieces of data that it’s not able to gather out of the box, all of the customization for collecting that data is done at the agent level. The agent natively supports running third-party commands or scripts which can be used to gather additional bits of data. Once the data is collected and fed back into the agent by the script, it’s sent back to the server over the existing communications channel using the common communications protocol; no jumping through hoops or any unnatural acts trying to figure out how to get the data off the host and into the database. This sort of thing is a very manual process with Cacti and involves some combination of SNMP MIBs, XML, and PHP scripts.

One thing that seems to be the norm with Zabbix is sub-five minute data collection intervals. I’m not certain if this is due to user demand or because their software can handle it. I have a feeling it’s probably a mix of both.

Cacti installs with a five minute default polling interval. There is support for going to sub-five minute but it involves multiple steps to configure and requires deleting the historical data for the data source(s) you want to poll more frequently. By contrast, a default install of Zabbix is setup to gather data for most items on a 60-second interval. These intervals are configurable on a per-item basis which allows a very high degree of control.

Another unique Zabbix feature is dynamic customization of the interval based on the current date and time. A common use case is to poll something more frequently during business hours (Mon-Fri, 0900 – 1700) and less frequently outside of that range. Again, this is configurable on a per-item basis.

Zabbix Flexible Schedule: Mon-Fri 0900 – 1700

Security Focused

In my opinion, this is one of the key ways to measure the quality of a piece of software: how present and integrated are the security features. In the case of Zabbix, they do a few things out of the box that protect the integrity and privacy of the system and harden it against attack.

The communication between agent and server can be encrypted using public key infrastructure (PKI) (ie, certificates issued by a certificate authority) or pre-shared key (PSK). In either case, all communication between an agent and the server is secured ensuring that:

The communications messages cannot be intercepted and read

The communications messages cannot be intercepted and altered

A rogue agent cannot attempt to impersonate the real agent

I can’t speak for the Windows agent (haven’t used it and don’t plan to), but all of the Unix-like agents as well as the server give up their root permissions and live their lives as a non-privileged user. This is table stakes these days as far as security goes, but all the same, it demonstrates thoughtfulness and forethought to engineer the software such that it does not require root privileges to perform its duties. As you can see below, there isn’t even a priveleged parent process; all process are running as the regular user _zabbix.

Community

This is another strong measure of the quality of software: the strength of the community surrounding the software often parallels the strength of the software. Zabbix appears to have a very high quality and deep community.

Besides these official resources, there are also plenty of blogs and articles across the web that talk about using Zabbix and offer instructions or help on upgrading or creating new templates.

One thing I’ve struggled to find is a web-accessible version of their source code repository. It appears they use SVN as their SCM of choice but their repo (svn://svn.zabbix.com) does not appear to have a web-based front end for browsing the repo, history, diffs, etc. Even looking to Github fails because what looks like the official git mirror of the SVN repo is empty.

Other than a web-based SVN interface, I’ve found the community to be great and it has been able to answer all my questions and more.

Wrapping Up

Let me try and sum all of this up. How does Zabbix address the pain points I had with Cacti?

No RRDs; all data is stored in a relational database.

SNMP is an optional method for data collection; the primary method is using the native Zabbix agent which is multi-platform, provides dozens of different data items, supports highly granular data collection intervals, and is easily extendable.

The code is high quality, with a clear separation between application logic and interface/UI code. The software architecture is well thought out and promotes scalability and clear separation of duties. There is a public bug and issue tracker where defects are discussed, reviewed, and resolved.

Zabbix is highly functional out of the box thanks to sensible and wide-ranging templates for various operating systems and applications. The templating system makes customization rather easy and efficient, with many customizations being possible just by amending or adding to the existing OS/app templates.

I don’t have enough stick time with Zabbix to reasonably comment on its stability or resiliency, however based on a) my experience so far and b) everything else about Zabbix, I have little worry that the software will prove itself to be resilient across operating system and application upgrades.

Further Reading

In case you’re interesting in learning more about Zabbix, these are some of the links I referred back to regularly as I was getting my feet wet.

This was a phenomal article. It was very well written, gave an excellent overview and awesome practical experience feedback. I wish I had the time to write this up myself.

I have been testing and using zabbix the last year or two. After using nagios, solar winds, cacti, mrtg, Zenoss and so many more I didn’t think I could find the golden goose but I have with zabbix. It has yet to miss any of my wants and the recent addition of encryption prevented me from having to use stunnel in between zabbix entities. The use of proxies and encryption really make this work as a monitoring platform for cloud scale, hyper diverged, secure clouds, and RESTful API integration.

The only thing I am not a fan of in zabbix is the graph quality, however grafana has a plugin written by someone to graph zabbix item information via the zabbix server API. The plugin also supports triggers and warnings and gauges for zabbix data visualization. The plugin gives you all the sleek pretty graphing and visualization that comes with Grafana.

I would only add a couple of points.

First, I believe encryption is not supported natively between the zabbix server and the front end in the current 3.2.x, So I may use stunnel in between those two components still to encrypt that traffic. Note if your UI front end lives in the same machine as the server, this is probably not an issue.

Second, zabbix not only provides monitoring, alerting and notification but you can also implement self healing as part of zabbix ‘Actions’. Actions can be used for all kinds of things such as short term fixes for open development issues, e.g. Restart of a particular process if it crashes and doesn’t show up in the process list, or a reboot of a server via IPMI if the operating system crashes… a past vendor I used to work with could not figure out why their software rarely but consistently crashed the OS every couple of weeks on 1/50 hosts, when this happen I wrote a zabbix action to go to the IMPI on the node and force a restart.

Third, when I had custom templates and scripts and checks all configured I was happy. Once I finally learned the true power of zabbix’ low level discovery I was amazed. I cannot express well enough how much anyone reading this or using zabbix should try implementing zabbix low level discovery prototypes. Low level discovery almost seems like a hidden feature as it’s not overly obvious how to get to item prototypes in the UI.

Discovery will let zabbix automatically create n-number of items to check, as well as automatically create triggers for alerting and actions for any n number of items discovered that match. This means for example as soon as you spin up a new AWS instances, it could automatically detect your new instance and start monitoring it.

Zabbix low level discovery can even go a step further and run those low level item discoveries across network subnets and automatically discover the hosts for you, add them to monitored hosts with all those item checks.

For example An item prototype can be created for discovery the file systems on nodes and automatically creating items to check for free space, used space, read only flags, etc. The zabbix discovery rules is what creates an item for each file system that is found even though you only have to create one item prototype for file systems. A similar example is an index# item for each SNMP interface OID. For switches this means I only have to create one item prototype for interface.status for all 52 ports on the switch. Thank god it scales like this.

Lastly, An alternative to low level discovery, as scanning the network may be a security no-no, is that the active zabbix agents also support automatic registration. This if you include the active zabbix agent config in your virtual machine image or container image, once it starts it will automatically register the new virtual machine to zabbix and automatically start monitoring it

Thanks a lot for the feedback and thanks very much for the detailed insight and additional information. Some thoughts on a couple things I picked out from your comment:

– Good call on encryption from the web front-end to the server: since I do run them on the same machine, I wasn’t thinking of that scenario.
– Love the mention of actions. This isn’t a feature I’ve used yet, but I can definitely see the utility of it and I’m sure it won’t take long for me to find a use for them.
– Lastly, love your comment about LLD for things like switches because it reminds me of something I was thinking about the other day:

I was mentally comparing how Cacti does graphing of items in SNMP tables vs how Zabbix does it. Cacti, like Zabbix, will collect a list of interfaces (as an example) via SNMP but unlike Zabbix, Cacti then leaves it up to the operator to run through this list–which on any sort of business class switch is going to be at least 24 items long–and choose a) which interface(s) to graph and b) which metric to graph.

So imagine your example of a switch with 48 ports + 4 uplink ports. That’s 52 ports. Now you want to graph bytes, packets, errors, and interface status. In Cacti that’s 52×4 check marks you have to select, multiple pages of interfaces you have to click through and multiple discrete graphs you have to create. All manually.

As you say, this can all be done in Zabbix with a single template that contains discovery rules for each of those metrics.

Taking it a step further, the regular expression support on discovery rules is gold. For example, on certain models of Cisco switches, there are internal and/or software interfaces that do not represent front panel ports. These interfaces might not be interesting to graph. No problem. Just attach a regex to the LLD template that ignores interfaces with names like “null0” or “StackPort1”. The resulting interfaces that are graphed are now all relevant and interesting without any noise.

I’m not a big Cacti fan, but this is not really how anyone using Cacti for more than a few devices deploys a switch. With the Autom8 plugin, I get consistent graphs basically instantaneously for any freshly added switch, with just the same kind of matching behaviour that you describe here. I think autom8 is now built in to Cacti 1.x

I think that the thing about proxies giving you better scaling will be to do with the number of TCP connections from the monitored devices. As this number gets into the tens-of-thousands range you start having to do exotic things to cope, even if you can handle the data that’s arriving. By putting proxies in, the number of connections will drop to more manageable levels.

There’s also potentially less overhead if the proxy sends data from multiple devices in each packet.

I have found that Zabbix is often overlooked in the monitoring community. I have used Nagios (and its spinoffs, like Centreon, Shinken) for almost ten years quite extensively. When I took that new job, I decided to give Zabbix a go, and 5 years later I am never going back. I have a medium sized install (500 hosts, 21000 items, 8000 triggers, a couple of proxies) and I have to say Zabbix has always scaled without any problem.

What is really nice is the REST API, which gives even more power to the 2017 admin^H devops, in this age of containers. Even though you can’t run any agent (well really you can but IMHO you shouldn’t) Zabbix still delivers with LLD.

Finally I guess you got fed up with SNMP, which I can understand, but it still is a very convenient way to deal with network equipments and various appliances. Zabbix handles it better than lots of other “solutions” I came across.

Your blog post, at best, seems catty, and at worst comes off as a paid marketing hit piece for Zabbix because their marketing people didn’t want to dig on another free OSS project in an A-B comparison.

Let me take a moment to address some of your points that struck a sour note.

RRD. Yeah, it uses RRD, because when Cacti was designed that was pretty much state of the art for storing time series data and making graphs from it in an open source environment. There is a MySQL database for UI controls and to fill in the gaps where RRD might be lacking. Sure, there are a lot of things that are better a decade after this design decision.

Cacti has a problem leaving RRD because a lot of IT types on Cacti are externally tapping those files for other things such as context aware alerting — something Zabbix and the rest of the new kids on the block only started thinking about when people in the local DevOps gatherings started bitching about the stupid state of CPU and disk monitoring on thinly provisioned VMs and AWS instances.

Code quality. Yeah, it’s PHP and it isn’t very good in places. Partly because it is based on a 10 year old code base, and partly because the principles were moving quickly. I don’t excuse this.

Community. What, never visited cacti.net? The forums are open for indexing so every search engine can index messages going back to 2007. They are still issuing bug fixes, the code base was moved to GIT and people still using it have been doing their thing. But lets be real here, in its current form, it isn’t going much further because *surprise*, people have figured out better ways of doing things. The resistance to running agents on hosts has all but dissolved.

SNMP as optional monitoring. Yeah, that is spiffy and shiny and new. Except people have been using Cacti with Nagios NRPE for pretty much the entire time Cacti has been around. At no time was SNMP the only way to get data into Cacti.

I would certainly hope Zabbix is an improvement over Cacti. Every monitoring tool has had ample time to figure out what is good and what is bad with Cacti and build something better.

Do I use Cacti? Not any longer, I’ve moved onto Splunk and Prometheus. But I’m not about to point to Cacti, easily a ten year old project, and sneer because it is still doing some things in year 2007. It forgets that even parallel or competing software projects are built on the shoulders of those who work in the space before them. Especially, in free open source software.