Enterprise-Wide Network Management with OpenNMS

Network management is hard. Because the average end user considers "the network" to be anything on the other side of the keyboard, a good network manager needs to understand networking hardware, such as routers and switches, as well as server hardware, operating systems, and applications.

The work involved in keeping a network running increases exponentially with its size. While it is somewhat easy to manage 10 devices, it is much harder to manage 100 or 1,000. Enterprises often consist of tens of thousands of devices, and in the past only expensive commercial applications were up to the task. From day one of its conception, OpenNMS is an enterprise-grade network management solution developed under the open source model. Its aim is to provide a viable alternative to products such as Hewlett-Packard's OpenView, CA Unicenter, and Micromuse Netcool.

Why Open Source?

Because the scope of network management is so large, it is impossible for one product to do it all. Usually an enterprise must buy several applications and then pay a consulting company to glue them all together with scripts, web pages, and the like. The problem arises when one or more of those applications changes and the customizations no longer work. Rarely is there any code management for the glue, and companies must either hold off on upgrades or repeatedly pay for consulting.

Enter OpenNMS, a platform that allows users to add network management features over time. Anyone with a useful piece of management code is welcome to submit it to the project; the application design itself allows for expansion.

Its open source nature is also useful when problems arise. Anyone who has spent time managing a network knows that vendors don't always follow specifications properly. In one example, a switch vendor was sending an improperly formatted Simple Network Management Protocol (SNMP) trap. Basically, a zero had been left off an Object ID (OID), and this caused OpenNMS to discard the trap. Someone opened a support ticket with the vendor, including the Ethereal trace and the RFC showing the required information--but when the vendor did not respond, the OpenNMS developers developed a quick modification to work around the problem. The fix, which would have taken a commercial application vendor weeks to address, took only hours for OpenNMS.

Two versions of OpenNMS are always available--a production (stable) release and a development (unstable) release. Because networks differ so much, having OpenNMS freely available means that it gets tested in more scenarios than a commercial vendor could undertake in a lab. By the time the development release is ready to become production, the application has become very robust.

What Does OpenNMS Actually Do?

Currently OpenNMS focuses on three areas: service polling, data collection, and event management.

When OpenNMS began, the management buzzword was SLA (service-level agreement). People wanted to understand how available their network services were. The accepted method was to take data from their management system, such as whether the device responded to ping and perhaps some information gathered from SNMP, and then try to perform an SLA calculation. It was kind of like trying to determine the fuel economy of your car by measuring the temperature of the exhaust manifold and the speed of your fuel pump.

OpenNMS took a simpler and more accurate route. The application simulates a user, so if OpenNMS is testing the availability of a web server, it retrieves a web page. For a DNS server, it performs a lookup. In other words, it would determine how far the car went and how much fuel it used. Due to the low cost of OpenNMS (that is, free), it is possible to have multiple instances in remote locations measuring service levels. This means that the Geneva office can measure its the service level from Geneva's point of view, not that of the data center in New York.

Out of the box, OpenNMS monitors more than 25 services, including HTTP and HTTPS, DNS and DHCP, and even Citrix and Radius. Even a ping is the ICMP service. By default, it tests each service every 5 minutes, which is similar to other network management products. OpenNMS also has an interesting feature called a downtime model, in which it can change its polling frequency when it detects an outage.

Suppose you need the ability to measure a service level of 99.99 percent availability over a month. This equates to about 4 minutes and 20 seconds of allowable downtime. However, if your software polls the service only once every 5 minutes, even a single outage will detect an SLA violation (because the shortest outage measurable will be 5 minutes long). OpenNMS addresses this by temporarily increasing the polling interval to 30 seconds when it detects an outage. However, after 5 minutes it goes back to a 5-minute cycle, and it backs off even further the longer the outage lasts. (All of this is configurable.) Thus, OpenNMS could detect multiple 30-second outages that would fall within the SLA for the month.