Categories

Meta

Redundant Monitoring and Alerting

Introduction

After over a decade of running my own servers (well, mostly virtual servers), I finally decided to look into uptime and service monitoring and alerting.

My servers host Wikis, Forums, Projects, and other personal stuff, none of which makes me any money, so I try to keep costs down.
One of the cheapest way to run your own personal servers is on an VPS (Virtual Private Server), usually that ends up being OpenVZ/Virtuozzo, KVM or Xen based.
While OpenVZ is typically the cheapest, it is not a good option when you want to run all services inside Docker containers, as it is basically a container itself. That leaves Virtual Machine based solutions like KVM or Xen (or VMs on any other hypervisor).

Back in the day, when Nagios was the go-to tool, monitoring and alerting (and often even stats) were part of the same tool.
Today, in the days of cloud, docker, and microservices, these are more and more split into smaller units, where one tool just does the monitoring while another excels at managing alerts.

Monitoring

Any server can go down for various reasons, thus it is important to monitor our server, and more importantly, our services.
Some services are as simple as a ping to an IP address, though most support http/https and support checking for certain content. Yet others allow for rendering HTML in a variety of actual browsers.

Alerting

Monitoring is good to have, but we also need to be notified if monitoring detected services that are down. An email from the monitoring service might be enough, that is, if your mobile phone is setup to check email regularly and any alerts don’t drown in the mass of emails you already receive.
Then there’s also apps for your smartphone that receive push notifications and pop up and possibly even have their own alert tone.

Redunant Monitoring and Alerting Setup

By combining (the free tiers of) two or more monitoring services and two or more alerting services, we can create a (free) redundant setup. I like the redundancy for multiple reasons:

free may stop: free tiers may go away at any time (maybe because the company running it folds, gets bought, … or just plain does no longer offer free services)

service redundancy: problems with the monitoring or alerting service itself

geographic redundancy: different services run from different locations

timing: free monitoring is often limited in frequency, eg. at most every 5min, so by having two services, the frequency might be 2.5min <= x <= 5min