Tuesday, May 12, 2009

JMX for distributed application monitoring and rule based auto healing

One of our applications is a distributed application consisting of multiple machines spanning across different networks. We needed a framework to be able to configure and monitor the applications from a central location. While we were at it, we also dreamt of having some amount of rule based auto healing as well, essentially tuning through re-configuration.

Most of our applications were in Java, though there were external native components like a database, memcached, httpd and some python components. We also needed to monitor operating system statistics, memory, disk space, CPU (user, system & wait times and context switches). JMX with a few additional components fitted the bill nicely and satisfied most of our requirements.

Configuration of each process was also published as a MBean. This allowed us to view the configuration each process was running with, and modify them at run time as well. Notification handlers in the application would dynamically reconfigure the application when configurations changed.

Each machine in our network ran a process that used Hyperic Sigar to collect operating system statistics and publish it as MBeans. Sigar is a cross platform library that uses JNI underneath to get the job done.

Instead of having a central monitoring node looking at all the machines, we broke it up into an expandable fractal kind of structure. A cluster of machines, all lying in the same network, were assigned a JMX monitoring node. Such a monitoring node would know about and connect to all the processes running on all the machines in its cluster, including the process hosting the Sigar library. This cluster monitoring node would also embed the Drools Engine to be able to run rules locally. The rules go through MBean data and attempt to reconfigure the systems through the configuration MBeans to correct any such correctable errors. The rules also publish cluster specific compact (summary) statistics into a summary MBean in the JMX monitoring node.

A cluster of clusters would be similarly further monitored and summarized by another larger (and remote) JMX monitoring node. Since each of smaller clusters publish only summarized data, accessing that remotely is not a major problem. Such clusters would typically be decided upon based on administrative boundaries of system administrators. This model also lends well to federated application administration at different granularities.

The network and monitoring statistics gets reduced and compacted till all data converges at the central NOC. The user interface at the central NOC displays the status of the next level clusters and any alarms therein. Each cluster can be drilled into in stages till the last process.