A sysadmin thoughts about the Internet and technologies…

Smarter monitoring with monit

Monit is a popular and well known service for monitoring “mission-critical” services or applications. It is open source, free and relatively simple to install and use. Today, we’ll look into an aspect of the well less known side of Monit which I call the “smart” monitoring.

This is a not a “how to install” post but rather a reflection on how you can monitor service and still miss a warning when it could fails in some cases. My road to this started with some, as usual, unusual problems for custom applications that must always run on some servers with redundancy. These (legacy) application were originally monitored with custom monitoring software and maintained a basic monitoring. i.e This system only check if the application exist in the process list with a simple ps and grep command. If the result contained the name of the application, then this monitoring software assumed that everything was fine and that there was nothing to report – move along… This description also fit most of the installation that I’ve seen in the field with Monit. And I’ve never thought about it too much until I started getting support calls for these legacy application that would crash – thankfully not often – but that the monitoring would not alert us. What was going on here? The monitoring system was failing us and I couldn’t let that happened. In order to have more power, I decided to use Monit instead for monitoring theses applications. Here’s a very basic script of monitoring you could have with monit ;

This monitoring script with Monit will automatically make sure that the process ID of Postfix exist, if the process does not exist, Monit will start Postfix automatically. I call this “dumb monitoring”.

The thing is, you may have an application or service loaded into memory, it doesn’t mean it’s fully working. Let me give you an easy (and stupid) example ; Let say that we have an SMTP service running and working properly ; Postfix, it’s able to received and deliver email without any problem. You have enabled a monitoring of the service with Monit, you check if the process ID of Postfix/Sendmail exist on the system and each time that check is done, the result is positive. The service is running fine from the point of view of Monit. But alas, you have forgotten to open port 25 on your external firewall and the SMTP, even tough it’s running properly, it cannot receive, send or process any email at all. This is a “fail scenario”. You might say that the service is running but from my point of view, it is not, since no email can be processed from the SMTP service. Piece of cake, we simply need to add another line to our monitoring script which give us ;

Now this script check that the process ID of postfix exist AND that the service can receive connection on port 25. If Postfix is not running, Monit will try to start it and if there are no answer on port 25, Monit will send an email to let use know about it. End of story! Or is it?

Let’s push this further and let’s say that on this same server, the space of the partition where Postfix is running is full, meaning that there is no space available for Postfix to process email since it cannot write to the disk. Will the previous monitoring script warn us about this? The Postfix service is running and also answering when you speak to it on port 25, but the reality is that it won’t be able to process email because there are no space left on the disk to make it through. You could of course check with another monitoring script what is the current space available, but this is not the point of my example, I warned you, it’s a stupid example. Fortunately for us, Monit also support some generic protocol with 2 really useful command ; send and expect. We can modify our initial script and then send an email using our monitored SMTP with this. Of course, we will also spam a mailbox of our choice, we would need then to add a rule for the receiving mailbox to delete the email coming from Monit automaticaly in order to keep this clean.

The last expect “^250.*” in this script is a return confirmation that the Postfix service has taken care of the email and put it in its queue in order to send it. In this case, the “problem” that the partition is full would then ben fully tested. Now, a small warning, do not use this script in a serious production system, this is only for a demonstration purpose of this post. Even myself do not use this on my monitored production system.

Oh no you can’t see everything Bill…

The point here is that it’s easily possible to miss a failing service with basic monitoring. When you implement a system like this, you need to put some time of reflection into it in order to make sur that you cover a wide range of possible outcome when a system is failing. Of course, it’s almost impossible to cover all possible cause because our imagination isn’t big enough to think of all the possibilities associated with it. Only experience can show us more possibilities.

System Administrator and consultant for more than 14 years. I've always used computers since I was a kid. I' ve specialized in networking, servers and the inner workings of the Internet. My blog is aimed as a personnel point of view on some technologies, the web, sciences and the Internet in general. If you are wondering why this website is in French and English, that's because I'm a french Canadian who also speaks English and sometimes, when I'm drunk, dabble in Spanish.Consultant et administrateur de système informatisé depuis plus de 14 ans, DGhost est plongé dans l’informatique depuis son plus jeune âge. Spécialiste des réseaux et serveurs, le fonctionnement de l’Internet n’a plus de secret pour lui. Son blog se veut une réflexion sur le web, l'informatique, les sciences et la technologie.

Email notification of new post

If you would like to be notified by email when a new post is published, enter your email here. Note that your email address will never be published with anyone at all.Leave This Blank:Leave This Blank Too:Do Not Change This: