I spent a frustrating 10 days evaluating server/network monitoring packages like nagios, cacti, opennms, zabbix, hyperic and zenoss. All of them were complex, over-featured, cumbersome and not well organized. I am so happy with FreeNATS! It is simple, intuitive, does everything I need and the support is great. Thank you Dave for such a superb package!

Did I pay you to say that? And there was me thinking the UI was creaky and totally counter-intuitive!! Hmm... perhaps you should never ever upgrade to version 1 where I have attempted (badly) to slightly swishy-up the interface. It also has (the acceptidly overcomplex) nodeside agents and CLI discovery/import tools.. But you don't have to use these.

Thanks for your praise and all the bugs you've reported!

In an ongoing pathetic attempt to garner posts anything non-supporty is worthy of a prize. I'll search the archives and find something equally excellent to the previous high-value prizes.

Actually, I was quite looking forward to the node agents (in push mode; I have no desire whatsoever to install Apache on a bunch of clients). Is here any other reason not to upgrade to 1.0 from 0.04?

BTW - if you're taking votes on new features, at the top of my list would be authenticated SMTP. My cell carrier rejects messages from my local computer and my ISP requires authenticated SMTP. Sending an alert to my cell is really important.

I know what you mean about Apache - as well as having to install it half the time clever security messes with your results so you can only see processes visible to the httpd user etc! The push seems to work pretty well.

Dave, I decided to give 1.x a try (a new install rather than an upgrade). It's working beautifully. SMTP auth works fine and node-side push works great as well.

I did clarify one point of confusion on the wiki when configuring a node-side script, but otherwise a smooth install! Actually I think I may like the new 1.x interface better than the 0.04 version.

One nice feature would be a mechanism to deal with false positives. I have one server that I'm monitoring with a TCP connection which gives a false positive due to some unknown 'hiccup'. Perhaps a mechanism that requires 2 failures within a specific time before generating an alert?

I (and some others) have had problems with false positives before. To work around it I implemented the per-test retries and delay options. These seemed to solve most problems with "internet randomness". Another possible cause is if you're running lots and lots of nodes then it might be a load distribution problem.

With regard to the TCP port test - this does support retries but not currently timeout. Helpfully in the code I find:

I had totally forgotten I'd never implemented timeout for the TCP test. It's one of the last hard-coded tests so I'm going to rewrite it as a module and put in timeout.

Generally though with troublesome (usually far distant) web servers putting a retry of 2 or 3 will solve it. All it means in reality is that you do wait x test cycles longer to get a real failure report.

Another option... although putting in some sort of delayed-retry is not something I plan to do it would be perfectly possible to code this as event-based logic, for example:

If node fails then switch on alerting (as it's off to start with no alert is generated on this first cycle)
If node passes then close an alert (if open) and switch off alerting

In this way you would "see" the test fail in history but only receive an alert once it had failed for a second time.

I could also easily whip up a better version of the TCP test module (once I've done it) that could support funkier stuff like wait for x then retry in the actual test parameters (rather than being a system wide feature).

Thanks for any wiki edits! Any attempt to clarify my deranged mumblings will be much appreciated I'm sure.

Ok with 2 attempts and 30s timeout: the test will try and open the connection with a timeout of 30s i.e. it will fail after 30s. If the test fails it will be tried once more (a total of 2 attempts) again with 30s timeout.

If the http test is failing with no response (unlike for example httpd server just not running which most likely gives an immediate failure to connect) and times out twice then this would give a failure response after around 60s.

I actually ended up with a server in the US that also has major load around 1am our time as stuff runs to be 3 attempts and 120s timeout. Although this does mean the test sequence will take 6 minutes to fail (and send me an alert) I have found it works well.

There is no UDP test right now but I'll add it to the list of things to look at.