5 Answers
5

I'm not super familiar with Zenoss but when I used to used nagios for this sort of thing we'd make the c/c++ process listen on a socket and write a custom nagios plugin which would hand over diagnostic and status information.

First step is to choose the lib you want to use to make your process listen.. Something like C++ Socket Library will do for that. Nothing complicated there.. just make the process listen.

Then you have to define the response your process will send given a particular stimulus. This really meant (at least with nagios) defining the 'service' and then sending the process the signal that corresponded to that service. The simplest thing you can do is create a 'process ping' just see if you can successfully connect to the running process. If you do than the custom nagios plugin knows at least the process is still alive.

There's much more sophisticated stuff you can do but the idea is simple enough. You can write your own little lib of process listening code encapsulated within objects and pull it into your custom c++ stuff in a standardized manner whenever you build one (or all) your executables

i am not familiar with these products you name but for windows i monitor memory consumption using perfmon, there are some special counters, like non paged pool faults, who show you if your program contains memory leaks, they might be little and thus take a long time to monitor but in my opinion this a simple checking method.

On windows you can do a lot using perfmon, even remotely
Or make use of WMI to attach to the same counters, and do some automation on it (in wmi) to perform actions.

I'm picking up on this as we recently went through the very same process like you:
We were looking for a lightweight, non-blocking, open-source solution which allows exposing and subsequent remote monitoring of metrics from within C/C++ services (we have around ~3000).

SNMP came closest but the integration into the source and the monitoring system is a pain and not suitable for our real-time processes.

In the end, we decided to develop a new solution called CMX which uses shared memory technology and made it open-source. You can check it out here :
www.cern.ch/cmx.

I am not super familiar with the c++ side of things but in Java we extensively use CodaHale metrics in conjunction with Graphite. CodaHale stores metrics on a per instance basis in the local memory of the instance then uses a background thread to flush metrics to a graphite server every minute (configurable). In graphite we can aggregate across instances as well as identify faulty instances. If you do not want the complexity of maintaining a graphite cluster you can use HostedGraphite.

This setup means no single point of failure for metrics aggregation or reporting as (time based aggregation happens on the nodes themselves and the reporting aggregation across happens in a distributed graphite cluster (or hosted graphite).

Lastly, you can use Seyren to provide alerts on top of the monitoring data.

If you're on Windows you tend to write to the event log, and then use a WMI or similar process to read the events. If you want monitoring, you add performance monitor counters to your app and let perfmon read them. Both are system services in Windows.

On Linux, it obviously tends to be more flexible, but I've always seen nagios style monitors implemented, with a custom socket sending data to a nagios style server.

That all said, I have seen several places where SMNP is used, and frankly, I can't see a reason why you wouldn't use it - especially if you're running a completely heterogenous environment.