About

£10,251

pledged of £5,000 goal

83

backers

Introduction

Observium is an Open Source, auto-discovering network monitoring platform written in PHP which supports a wide range of devices and operating systems. We collect data and status via SNMP and an optional Agent and present the information in a useful-to-engineers manner.

To keep things simple to manage, we try to discover everything that can be graphed or monitored on a device automatically. You usually don't know you need to graph something until after the event or outage! We even try to automatically discover neighbouring devices seen via CDP and LLDP tables or OSPF neighbour tables.

We believe that one of the key purposes of an NMS is to help engineers understand their networks. One of the very first features implemented was the ability to visualise a device's place in the network based on the devices it connected directly to, and the devices its interfaces share subnets with.

Ports page screenshot showing which devices are directly reachable on this port using either IPv4 or IPv6.

We originally began the project in 2006 with the intention of replacing labour-intensive monitoring tools like MRTG, Cacti and Nagios. We started out as network engineers with very little programming or development knowledge, but a definite idea of how we wanted to present the information so that it would make our day to day work easier, especially during an outage.

Since then we believe that we've succeeded in creating a unique network and server status visualization platform which is providing thousands of organisations with an easy to manage and pleasant to use platform for managing their network and server estates.

A device overview page for a Cisco router, showing a number of the metrics collected.

There is a limited live demonstration of the software on our Demo Site.

In the past 6 months our user-base has increased dramatically. We're usually one of the first suggestions on reddit and other sites when people ask for monitoring software suggestions. We were featured in Linux Format in 2010 and have appeared on the TWiT podcast show, FLOSS Weekly.

Why Alerting?

By far and away our number one requested feature is up/down and threshold alerting. It's the natural companion to the metrics and status visualisation, as we already collect all of the data we need.

Until now we've been hesitant as it's a fairly mammoth task which needs to be planned and implemented properly.

We now feel that the rest of the project has reached a state where we can turn our focus to adding a real alerting system to Observium.

We've helped a lot of people kick their Cacti habit, now we want to help them get off the Nagios for good.

We want to design the alerting aspect of the project along the same lines as the rest of the platform. We want as much autodiscovery and sane defaults as possible, so that new devices can be monitored and alerted with the minimum of human intervention.

We all know that when a new device is deployed it can take a few weeks before anyone gets around to braving the alerting system to add it, we want to make that less tiresome. Using Observium's existing auto-discovery features, a correctly configured device would be automatically discovered and added to the alerting system.

The Plan

We've decided on some basic parameters about about how an Observium-style alerting system should work:

Use the existing Observium database for host and entity information (an entity is a port, a drive, a sensor, etc)

Use the existing Observium pollers to collect metrics, no separate poller

These bring up a number of challenges, somewhat unique among alerting systems:

No other alerting system treats different “types” of entity in the way we do. Most have a generic list of entities that they check, we have a dozen different database tables in different formats

We need to know what to monitor and have sane defaults. We need to monitor almost everything someone would need to monitor automatically, out of the box

We need to have some method of easily defining general conditions that apply to an entire network of similar devices
We need to be able to override these general conditions both per-device and per-entity

We plan to have each poller module build and pass an array of metrics and states to a metric/state checker which checks the values against a series of conditions for that entity generated from the database.

This checker will put alerts into a queue which will be sent out via a separately executed alert dispatcher.

Overview of the alerting system.

The entity conditions will be generated from a series of database tables at poll-time, allowing the creation of checks with host, entity or global scope.

Alert conditions generation

The intention is to allow checks to also be limited to entities with specific attributes. For example, we could limit link-speed and duplex checks to only Ethernet interfaces.

Some examples of checks for the 'port' entity type might include

Bits/sec in/out

Bits/sec in/out as percentage of interface speed

Errors/sec in/out

Unicast/nonunicast/broadcast packets in/out

ADSL SNR/noise margin/sync speed

MTU

Interface link speed

Duplex mode

Promiscuous mode

We also intend to allow an alert to be delayed for a set period of time. For example, you might not want to be alerted if an interface is above 90% utilization unless it's been that way for 30 minutes.

A brief mock-up showing some example global alerts.

What are we funding?

The funding goal will pay for 3 months of our time to work on implementing the alerting framework, configuration interface and hook in to as many of Observium's polling modules as possible.

Until now, Observium development has been ad-hoc, squeezed in between paying jobs. To properly implement the alerting system we need to be able to spend a decent block of time working on it.

To do this expenses will have to be reduced to the bare minimum, and only ramen and the occasional piece of roadkill will be consumed. It'll be tough, but it'll be worth it!

Stretch Goals

Now that we've been very generously funded to our initial goal amount by a single contributor in the first hour, we need to start thinking beyond the alerting system.

Other things we have on drawing board include:

A defined plugin system to replace the existing "apps" system to allow graphing an alerting of *nix applications

A Cacti-a-like system for graphing arbitrary OIDs and data from scripts

Better data collection support for more complex device types like load balancers, storage arrays and firewalls

A daemon to proxy SNMP and Agent requests to reach hosts within private networks

Expansion of the ISP-specific feature set including VRFs and Pseudowires

Better support for routing protocol data collection from OSPF, EIGRP and IS-IS

Once the campaign is completed we'll allow backers to vote on which features they'd like to be prioritized after the alerting system.

Risks and challenges

The primary risk is that we don't manage to fully implement the alerting system within the time afforded to us by the funding.

Even in this situation, nothing will be wasted, any development work we've done will get us ever closer to having a finished, usable alerting system. We'll get there, it might just take a little longer!