Monitoring Our Network Infrastructure With Sensu

Cumulus Networks provides a service known as the Cumulus Workbench. This service is an infrastructure made of physical switches, virtual machines running in Google Compute Engine (GCE), virtual machines running on our own hardware and bare metal servers. It allows prospective customers and partners to prototype network topologies, test out different configuration management tools, and get a general feeling for open networking. The workbench is also utilized for our boot camp classes.

Right now, we are completely rewriting the workbench backend! Many of the changes that we’re making are to the technical plumbing, so they’re behind the scenes. Monitoring the various workbench components is critical, as any downtime can easily affect a prospective sale or even an in-progress training session. Since our infrastructure is a mix of virtual machines, physical servers and switches, I needed one place to help me monitor the health of the entire system.

We use Puppet for automating our internal infrastructure. I chose Puppet since it holds most of my operational experience, but I firmly believe that the best automation tool is the one that you choose to use! If you want more details on how we use Puppet for automation, I will be speaking in depth about it in this webinar. I paired up Puppet with Hiera, a key/value pair data store, which enables us to store much of our site-specific data such as passwords in a contained location. Check out this space for an announcement soon releasing our new workbench code to the public!

I decided against using Icinga or Nagios. In my experience, Nagios’s performance seems to suffer greatly after a few hundred nodes, which makes it difficult for me to use the smallest instance possible. Icinga is a popular fork from the original Nagios codebase, and much of the development effort is focused on performance or speed enhancements. Icinga has an excellent community and great performance, but I wanted to go for a more radical change. So I chose Sensu, which is normally a server monitoring tool, and am using it for monitoring Cumulus Linux.

Sensu uses a very distributed model, which scales well over a large number of nodes. It also has a great amount of flexibility in its check system since they introduced handlers. A handler is an action that is performed when a problem is noted. For example, if the web server is down, a handler could perform a service restart. It doesn’t solve the root problem, but when it’s late at night and I want the problem solved, the less work I have to do, the better!

Using these modules gives me more time for configuring checks and takes less time to configure the monitoring server itself.

We want to hear about your monitoring stories too! What are you using and why? What have you monitored? Do you have any custom checks that would help others? Join or start a conversation in the Cumulus Networks Community!