In this blog we come to the final piece of setting up Event Grouping, namely setting the time window. The time window is important because the principle on which event grouping is based is that events that occur in the same place at the same time are likely to have the same underlying cause. If no consideration of time was made then the event grouping might easily group alarms that have different causes, and that will also be the risk if the time window is set too long. On the other hand if the time window is set too short the event grouping may fail to catch those alarms that take a little while to generate, such as performance thresholds being breached. Like ScopeID then, setting the time window requires a bit of domain knowledge.

This diagram shows how we initially saw time windows working

The actual implementation is slightly different but it still follows the principle of the first alarm setting a time window which following alarms can extend if necessary. As the container closes when the time window expires without further extension we call the OMNIbus field that contains the time window length QuietPeriod, the concept being thta the time window closes after a quiet period of that length

The default QuietPeriod is held as a property in the master.properties table and is set as 900 seconds at install. This may be too long on busy systems but can easily be edited. This default value is used when the QuietPeriod field in an event is set to the default of zero.

QuietPeriod can also be set on an alarm by alarm basis, and for that we need to consider how alarms are generated and published. These can be quite different. Some alarms can be generated through instrumentation detecting a change in state and then published immediately and unsolicited. The time between cause occurrence and reception in an event management application is short enough to be considered as near real time. Most are probably not as fast as the reporting standards required for IEC 61850 compliance in electrical substations, which requires an alarm to reach the target system 4 milliseconds after the condition arising but then few IT systems could cause things to literally melt as a substation short circuit can. However there are a large number of alarm types where arrival in OMNIbus is within seconds of the condition causing the alarm arising. Not all do though and the cause of delay varies:

alarms that are not solicited but need to be retrieved by polling the event table inside a device will be delayed by the length of the polling cycle which is typically one minute. Most OMNIbus Telco Service Monitors (TSM) included a polling application that ran every minute

alarms may be delayed because a system goes through an automatic retry or reset process before reporting an alarm, this too may add up to a minute's delay before appearing in OMNIbus

alarms that are generated by testing a sample of data, for example bit error rate test (BERT) alarms will be delayed by the length of the sampling period

alarms that are created by external performance monitoring systems reporting a counter or delta between counter values has exceeded a threshold will be delayed by the delta period. This can be significantly delayed with polling cycles of 15, 30 minutes or even an hour being common

QuietPeriod needs to be set so that a container does not miss these delayed alarms, though with the last case a different approach may ultimately be needed.

The other consideration is how long it might take for the impact of an alarm to be felt. Datalinks may fail but if there is redundancy the impact may only be congestion detected an hour later. Another common delayed impact is when a server or equipment rack switches to battery power when the mains power fails. Only when the batteries drain an hour or more later will any impact be detected. However given that in such a case the mains failure is almost certainly the underlying cause of the incident we want the battery back up alarm inside the event container and not on its own in a different one

A final consideration is the likelihood that a particular alarm is reporting a condition that will trigger other alarms. If the likelihood is high then the QuietPeriod should be set long enough to catch these symptom alarms, but an alarm that is clearly a symptom should not extend the time window. Nor should a Resolution event, any event where Type=2 should have a QuietPeriod = 1 (0 triggers the default of 900, remember)

Each environment will be different, but here I suggest this rule of thumb:

Likely cause alarms have QuietPeriod = 120

Possible cause alarms have QuietPeriod = 60

Symptom alarms have QuietPeriod = 1

Resolution events have QuietPeriod = 1

Environmental alarms where the impact is likely to be delayed (e.g. power fail, fan failure) have QuietPeriod = 900 or more

In the next blog I will describe how this EventGrouping has been implemented in the OMNIbus FixPacks.

In my previous blog I described how generic event grouping has been implemented in OMNIbus. OMNIbus v8.1 Fix Pack 4 has now been released so what I described is now generally available. In this blog I want to expand on Scope ID and the considerations needed to populate that field in OMNIbus events.

The ScopeID field has to be populated as the automations pass over events with a blank ScopeID. However ScopeID does not have to be populated on insert so post-insert event enrichment via Netcool Impact can be implemented. While the intention was to make this event grouping an out of the box feature, selecting a suitable entry for ScopeID is not that constant. It will vary from technology to technology, from vendor to vendor and from customer to customer. What I propose to do here is set out some guidelines.

Some networks are nicely hierarchical and here it is relatively easy to select a ScopeID. A GSM or 3G radio access network for example has cell sites logically grouped around a base station controller so the base station controller is a good candidate for ScopeID. If the network operator has adopted the 3GPP standard naming convention it is also easy to extract the BSC name from the node name of the alarming device.This gives some quite logical groupings

Here the twisties are opened, but initially all this would be listed under a single line.

Now we don't know whether the two sites are connected but there is a strong possibility that they are. Neither can we be certain that a PCM failure has caused those BERT alarms, but again there is a strong possibility that it did. The possibilities are certainly strong enough for a decision to cut just one ticket is the sensible one.

Networks that are not so hierarchical are a different matter though. To select a ScopeID we need to remember the principles. How far can the impact of an alarm be felt and can we have any symptoms or knock-on alarms grouped under the same ScopeID. It's also worth considering where the common root cause of a collection of alarms might lie.

If the main source of our alarms is customer premises equipment in domestic properties, such as cable modems or smart meters, then a postal or zip code lifted from the inventory will work well. Power and communications cables tend to go along streets, postal codes are allocated to streets therefore a break in a cable will be felt along a street rather than more diffusely over a district.

Enterprises might choose to scope alarms by the user service, for example grouping together the servers that deliver a service. That will allow a service manager to see whether a problem is common to a number of servers. Another approach, if the information is available, is to group by alarm class. Remember the grouping also involves a time window. A Communications group alarm with four site members may point at an underlying fiber problem, an Equipment group alarm may contain both the servers affected and the PSU that has failed, a Configuration group alarm may show the sites affected by a botched change.

Another possibility is to treat a large node as the ScopeID and sub-divide it by alarm class. The sub-division may use the SiteName field but there is no reasosn why SiteName has to be a location. For large routers and similar that are very chatty this provides a measure of summarising that can be very useful

Here the summary of the lower level containers reports the normalised alarm description of the highest cause code alarm.

Setting ScopeID requires a knowledge of the domain being managed. As does setting the length of the time window, which will be the topic next time

In my last blog describing a new approach to event grouping based on the assumption that alarms that occur at the same time in the same place probably have the same underlying cause, I ended by saying event grouping by alarm scope needed to define three things:

define what same time meant,

define what same place meant

select which of the group of alarms sharing the above two attributes is the most likely cause.

In this blog I will go into more detail how we implemented this in OMNIbus fix packs. However before I do that here is a link to the Object Server sql file included in Fix Pack 4

I will deal with "same place" first. As I said in an earlier blog, we started this work in response to a request from a major Asian telco. Most of their use cases revolved around correlating infrastructure and environmental alarms - power and air conditioning - to networking problems. One example was that if a cell site alarmed that it was going over to battery operation because of mains power failure and an hour later the cell site goes off air that these alarms were linked. It is after all a very reasonable assumption that the backup batteries were drained after an hour and that is the reason for the cell site being down. Similarly it was expected that a fan failure would see equipment cabinet temperatures rising and that as a result communications links might start clocking up framing errors or bit error rate test threshold alarms. So the first step was to populate alarms with the node location's site name. Ideally SiteName would be a unique identifier and would be included among the tokens sent by the element management system, as indeed it is in most cases. Failing that a lookup statement or an Impact policy could enrich the alarm from some inventory file.

Grouping by site name does not however bring in alarms from other sites that are related. One site might have a cabinet power failure and as a result there is a communications link failure. The site at the other end of that link will also generate alarms, for example Loss of Signal or Loss of Frame (or both) but as this is a different site those alarms won't be in the event grouping. This is where the concept of scope comes in. We define alarm scope as being the extent to which the impact of an alarm can be felt. Thus, as a comms link failure can be detected at the remote end of the link, the scope of that alarm should cover both A and B end sites. It should also cover the link itself, because that link might be based on transmission equipment that is supporting other links that are also in alarm. This means the choice for scope ID might be wider than just two sites.

Two new fields have been created in OMNIbus, @SiteName and @ScopeID. The diagram below represents how a GSM/3G network RAN might be presented in this way.

In practice then for a cellular network the ScopeID can be set to the BSC name, and in many instances that name can be extracted out of the Fully Distinguished Name that is in the Alarm itself. A typical 3GPP standards compliant DN might read: "GSM/BSC-43141/BCF-11/BTS-11/TRX-10". It's a simple task to extract "BSC-43141" out of there and use that to populate ScopeID

In real cellular networks though there might be multiple management domains using different inventories and naming conventions so we have provided a third level of scoping called ScopeAlias so that the same scope called by different names can be linked together, as in this example:

Scope Aliasing is implemented using a custom table in OMNIbus to link ScopeIDs

These examples are for GSM/3G networks. Other domains require a different approach. I will come back to determining ScopeID and Scope Alias in a later blog.

The next step is to define "same time". To do that let's reflect on how events and alarms are generated and sent. A network card may apparently send a "link down" event instantaneously after a cable is pulled, but in reality what has happened is that the card has detected certain control signals - which may be as simple as a voltage on a pin - are no longer present. That will take a millisecond or two to detect. At the other end of the link the network card has all its physical level indicators still working but the card has detected that the logical framing of the carrier signal is no longer present. It may be many seconds before the automatic resynchronisation processes have been tried and failed thus many seconds before the alarm is generated. Often though the physical problem is more of a dirty joint than a clean break and in those circumstances a distant end may detect the problem through increased errors in a background error rate test, a test that takes minutes to run. Or the errors may be detected by an performance management application collecting SNMP metrics every fifteen minutes. "Same time" therefore has to be a time window rather than a fixed time.

The way this is implemented in this Event Grouping automation is to define a quiet period. This is a period after the first alarm when new alarms can be added to the container. If that period is quiet, i.e. no new alarms are added, then the container is closed. Quiet period can be defined by alarm type and if no quiet period is defined then a default is held as a property. This is set to fifteen minutes in the initial installation but most users will want to reduce that.

If a new alarm comes in during the quiet period it can extend the time window if the alarm requires that.

As Quiet Period can be defined in the rules file it makes sense to set this according to the type of alarm. This can be fairly long, an alarm reporting a device has switched to battery power should be prepared to keep the container open for the hour or two it takes to drain the battery because that is how long it will take for other effects to be noticed. On the other hand low priority symptom alarms should not extend the quiet period and can be given a QuietPeriod of 1 second - not zero as that triggers the default to be applied.

The remaining question is which alarm is pointing out the underlying cause of the problem. In a previous blog I wrote about different techniques used historically and the upsides and downsides. What we are doing here is a simplified codebook approach. Rather than score all the possible alarms against each other or create loads of cause and effect relationships we have simply given each alarm a weighting, and as this is an integer determining which is the highest weighted alarm in a group is easy and efficient. And rather than do this for potentially hundreds of alarm types we have defined sixteen generic alarm types and in the rules file we map the vendor alarm codes to these. Our initial normalised alarm class list is as follows:

Normalised Alarm Code

Cause

Impact

Description

Physical

160

Control Shut Down

160

10

150

Power Loss

150

20

140

Catastrophic Failure

140

30

130

General Failure

130

40

Sensor

120

Environmental Warning, inc Door Open and similar alarms

120

50

110

Performance Failure

110

60

100

Performance Degradation

100

70

90

Performance Warning

90

80

Operational

80

Inoperative State, Change of State

80

120

70

Heartbeat Loss

70

90

60

Control Path Loss

60

100

50

Operational Warning, inc running on backup

50

110

Functional

40

Non-Functional

40

160

30

Missing Component

30

150

20

Workarounds in execution

20

140

0

Informational events

0

130

The recommended way of implementing the necessary rules file changes is as follows:

It's a common situation. A pile of alarms and events hit the event management console and the network operator is faced with the questions "what happened here?" and "what do I need to fix?". That information is probably in those alarms but will the operator have time to dig it out before the next wave strikes? In telcos and major enterprises that is rarely the case. Over the years a number of event management strategies have been developed to direct operators to those alarms that have the answers to those two questions.

The simplest approach was just filter out the crap. If an event is purely informational then discard it, if it is reporting a failed component then let it through but hide it from operators unless it is of the highest severity. Back in the mid-noughties I was speaking with a tier 1 telcos NOC manager who said their policy was that first line operators dealt with critical alarms first, then they started on the major alarms and finally the minor alarms, only the operators never finished dealing with the critical alarms. The biggest weakness of this approach is that alarm severity is defined by a software engineer working for the equipment vendor, and severity is only relevant in terms of that piece of equipment. A dev manager at a major NEP also told me once that budget was more important than technology when it came to instrumenting equipment, which is another factor to consider. I think we can conclude that simple filtering has had its day.

Another approach has been to try and identify cause and effect, what many people call root cause analysis. These days I consider that any sales person who utters the acronym RCA in front of a customer has committed a hanging offence, as that term now means so many different things to different people. At the very least we should qualify what sort of analysis we mean when we say the dreaded TLA "RCA". There are three basic types.

The first is Topology RCA, which is what Tivoli's Network Manager does. The principle is that if you can discover how a network is put together then you can determine which alarms are related to each other and from that which is the probable root cause alarm. Initially Network Manager worked solely with connectivity alarms - unsuccessful ping tests - and the analysis was to find the common point of failure, but the basic principle has been extended to other classes of alarms. This approach does require a topology model though, and these days the models can be very complex with different layers that need stitching together

Closely related to Topology RCA is Containment RCA. This is what the AdvCorr automations included in OMNIbus does. I've blogged about AdvCorr before ( What's in a name - correlation by exploiting naming conventions ) but to summarise here, AdvCorr applies two tests - could the alarm be a root cause or is it always just a symptom, and, does the reporting node contain other nodes in alarm or is it a member of a container headed by another node in alarm as the parent. The containment is usually physical - ports on a card in a rack - but can be logical such as the GSM network example of my earlier blog. AdvCorr is neat but alarms do need to hold the information needed to determine containment and that is not always the case.

The third approach is to attempt to tabulate cause and effect. If a cellular base station controller reports a control channel failure to a cell and an uplink failure the the same cell together we can say that the latter caused the former. This can be done for other pairs or trios of alarms, even for larger groupings. An entire "codebook" of alarm causation can be assembled. It will however be a very big book given the numbers of alarm types defined these days. There are over 6000 defined alarms for Alcatel 5620 SAM for example, Huawei alarms are also listed in thousands. Because of this codebook systems have fallen out of favour.

All these approaches aim for a high degree of certainty, but is that possible in cases where not all devices are fully instrumented, where not all alarms occur in the same managed domains, where the delivery of some alarms may be impeded by the very alarm condition they are trying to report.

Late last year we were presented with an alarm cause analysis problem by a large Asian telco. The customer was leaning towards a codebook type solution but we felt that while that was feasible for the three or four scenarios cited, it was not something that would scale and that adding more and more scenarios would make the system unwieldy. We therefore decided to go back to first principles

We started by deciding that our approach would be more one of guidance than direction, and that we would only alter alarms by adding some extra fields that we would use for setting up event relationships in the WebGui Event Viewer. Adding extra fields also meant we would not risk any automations a customer had already implemented. We also decided that amendments to rulesfiles would be in the form of include files so that the changes to existing rules files would be limited to one or two include statements at the bottom plus, possibly, a lookup table definition at the top. Customer's modifications to standard rules files would be safe.

We then made the following assumption - alarms that occur at the same time in the same place probably have the same underlying cause. If that was our assumption then we had to do three things, define what same time meant, define what same place meant and which of the group of alarms sharing those attributes is the most likely cause.

Same time could not mean exactly the same time because some alarms are not generated until a condition has existed for a while. An x in y policy might be in place for example, or more commonly an alarm is the result of a regular monitoring period exceeding a threshold for errors. Same time therefore means within the same time window.

Same place could mean within the same node, or within the same site but that would not cover communications alarms, most commonly when both ends of a circuit report a problem on the link connecting them. Therefore we developed the concept of alarm scope. The scope of an alarm extends as far as the impact of that alarm might be felt, for example the scope of that communications alarm would be the two sites connected together and the link between them.

One ticket per incident is a Holy Grail in Network Operations. Being able to correlate a stream of alarms sharing the same underlying cause into a single line entry and using that to drive the problem resolution systems is something telcos and major enterprises have looked for for years.

Consider the economics of event generation and management. If it costs $100 to instrument a router or access point and that device now send on average one event every ten minutes then the cost per event, averaged over a year, is a fifth of a cent.

If those events go to an event management system costing $200,000 amortised over two years with $20,000 pa staff costs and handling 15,000 events per day, then each event costs about two cents to handle

However each event that causes a ticket to be opened starts costing serious money. If we estimate an average of 30 minutes work by a level one tech on $9 an hour before being handed on to a level 2 tech on $18 an hour who spends two hours on the problem then we are talking of $40 per incident. That's acceptable if the ticket cites a real problem, however a ticket that is opened in duplicate or that requires no action can still cost $5 to process. A dozen of those a day and we could be talking of $20,000 a year spent on essentially useless work. Or to put it another way, one network technician wasting their time all year.

Clearly, cutting down on unnecessary ticket cutting is a money saver.

It's also a time saver and that might mean network operators have time to look beyond the red flood of critical events. Event severity is normally set by the equipment vendor and they set severity according to the demands of the hardware. However what is critical to the workings of one box may not be critical to the end user. Not only that but the underlying cause of a critical event may be reported as a lower priority event. Look at these events from a real 3G network (node names are anonymised)

Both critical events are indeed critical (for those unfamiliar with cellular networks "BCCH Missing" means the control channel to a cell is quiet, i.e. the cell is dead), but neither is reporting the underlying cause. The underlying causes are in each case the severity 4 alarm listed above them

Now I've blogged about event correlation before but now that a new automation for event grouping has been included with OMNIbus, in Fix Pack 3 for v8.1 and an improved version in FP4 out shortly, this is a good time to revisit the topic. Next time I'll cover historical techniques and introduce this new event grouping feature

Network status visualisation has come a long way since the days when an event list was the only way to view alarms. With the imminent arrival of new versions of OMNIbus and WebGUI the quality of visualisation available is about to take another big step forward and this brings with it a new problem. As operators' displays can now be highly sophisticated visualisations it becomes more important to demonstrate new offerings with potential users and test whether they meet requirements and are usable.

One of the first probes most of us come across, at least if we do a formal training course, is the humble simnet probe. This probe may serve no practical purpose in a real world operational set up, but it is a useful tool for testing and demonstrating visualisations. Or at least it would do if it could be configured with more realistic alarms than the rather 1990s datacentre ones it comes with. This blog describes how I configured the simnet probe to simulate smartmeter alarms.

A little bit of simnet probe 101 here. The probe automatically generates four alarm types and the nodes and the alarm type the simnet probe will generate for them is defined in a definitions file, simnet.def by default. The probe reads through this definitions file and generates events which are then manipulated by a rules file in the usual way before being inserted in the Object Server alerts.status table.

The four alarm types are Link Down/Up, Node offline/online, Diskspace alert and port failure. However for our purposes it is important to distinguish how each of these alarm types differs in execution. The table below sets things out

vtype

Alarm type

Alarm execution

0

Link Down/Up

An alarm is generated followed some time later by a clear alarm. If the rules file is set up correctly this will demonstrate the generic clear automation as well as put alarms into the system

1

Node offline/online

An alarm is generated followed shortly by a second alarm. Typically the second alarm is not treated as a clear

2

Diskspace Alert

An alarm is created with a random integer field of between 75 and 100 which is used to simulate % disk space utilisation

3

Port Failure

An alarm is generated with a random integer field of between 1 and 8, which is used to simulate port failures on the sort of semi-intelligent switches in use c1995.

The simnet probe generates tokens which are then assigned to Object Server fields by the rules file. This is where we can step in and change those assignments to achieve more realistic alarms for our demonstrations.

I wanted to create a simulation of a summary console of a electrical supplier whose customers have been equipped with smartmeters. I had the documentation from one of our partners to guide me to typical alarm types. From that I could select suitable alarms and map them to the ones the simnet probe was going to provide

vtype

Alarm type

Notes

0

Primary Power Fail/Primary Power restored

An alarm sent when the main electrical power has failed. Typically the fail alarm is a last gasp alarm and may or may not get through, so it is not unusual for there to be restore events with no corresponding fail alarm.

1

Security alarms, e.g. tamper, inversion

Other alarms that may clear.

2

Communications delays, power levels

Alarms that require a metric in them, for example round trip delays

3

System Alarms

Alarms that might be sent when the smartmeter's internal software detects a problem. Eight possible alarms were selected:

Clock Error Detected

Checksum Error

Config Error Detected

Fatal Error

Low Battery Detected

Temperature Threshold Breached

Demand Overload Detected

Measurement Error Detected

Having set out a plan, the first step was to create a definition file, smartmeter.simnet.def. As part of the pitch for a Netcool solution in the smartmeter space is its ability to scale, this definition file needs to be biggish, at least 500 lines. I created it in (ahem) Microsoft Excel. Excel has features such as fill, random numbers and sort which meant I could create a long list of Meter ID numbers, assign random vtype numbers and probabilities and then later sort them for adding enrichment fields for a lookup file. The spreadsheet file is then saved as a text file. Other spreadsheet programs probably have the same features.

A further refinement is to add a lookup table so the alarms can be enriched. In production systems I'd advocate using Netcool Impact for event enrichment but for a simple visualisation demo I prefer the simplicity of a lookup table. Again I used Excel with its sort capabilities. In particular I wanted to make power fail alarms happen in only one or two geographies. I also wanted to assign substation IDs and create customers for meters. The last requires a list of names. The US Congress has just over 500 members and lists of senators and house members are on the internet, so it wasn't difficult to get those names. The first few lines of the lookup file are these, and coincidentally include a name non-Americans may have heard of:

I made up some county names by following the history of some New York boroughs. Readers may or may not know that Harlem and Brooklyn take their names from the Dutch towns of Haarlem and Breukelen, a throw back to the days when New York was New Amsterdam. So I found more towns and villages from the same district in the Netherlands and anglicised the spellings. The curious might like to work back, but if it's a quiz there is no prize.

The final piece of work is to create a props file that tells the probe which def and rules file to use, and then, while we're about it, we can slow down the operation of the probe so there is time to talk about things when giving the demo.

And the result? My first effort was a WebGUI 7.4.1 offering:

Wires are set up so that clicking on an icon on the map or in the county summary changes the event list window so that only the events relevant to the icon are shown

I have uploaded the rules and other files to this blogsite, they are available here

Recently, in my blog on using Node-RED with OMNIbus, I made a statement that the advantage of Node-RED was that the state of a variable could be carried over to subsequent runs through the rules script as this was not available in probe rules. I was told, by no less an authority than Kristian Stewart, that this was incorrect. Probe rules can create and modify probe properties on the fly and these can be used to carry variables through. I thought that I must give this a go as to me this had been a well kept secret.

I was of course well aware that in probe rules @Variable indicates a field in the target Object Server and $Variable indicates either a token passed to the rules file or a variable created by the rules file. What was new to me was that there is a third prefix and that %Variable indicates a probe property. Typical uses of this capability are to change a property dynamically, for example

%RawCapture = 1

will turn on raw capture for the event being processed. (And all subsequent events unless a matching %RawCapture = 0 is put in earlier in the rules file to turn raw capture off at the start of each run of event processing).

In this case the property (RawCapture) is one defined in the props file. However if the property is not defined in the props file then it will be created as a transient property. Transient, because it will be destroyed if the probe is restarted.

Now this gave me an idea. Probes such as the ping probe do more than give a binary result of good or bad, they can also give an intermediate poor result. Or to be specific the ping probe reports a node to be active.slow or not reachable. When the probe reports a slow response it would be useful if, on de-duplication, the event indicates whether things are getting worse or better. So that we can have an event display like this:

As with all probe rules file work it is important to understand what data we have to work with. With the ping probe, as with others, this is available in the probe documentation, in the section on elements. It is also useful when developing rules files to turn on details with this line near the end of the rules file

details ($*)

This gives us examples of the tokens received by the probe as well as any $ variables created in the rules

From this we can see that the ping probe returns some useful information. Or at least it does if the $status is "slow", for some reason this is not returned for the "alive" status. The first thing then is to extract the trip time from the icmp stats and put it into a field in the Object Server.

The second thing is to set up the probe so that the slow status threshold is set a bit lower than the default. The defaults were after all defined in the days when 64kbps was thought to be quite a fast line speed and 2mbps was a typical core network line speed. In today's world of Gigabit ethernets triple digit millisecond delays can indicate problems with buffers or congested paths. The trip and trigger time properties can be set globally in the props file but I prefer to set them individually for each host in the ping file:

To hold the round trip time in the event I created a field called PingDelay, and put it into the display view as well, and put a couple of lines in the rules file to populate it. (Details do not update on de-duplication so the icmp_stats display shown above will be that at event insertion and stay the same thereafter)

The update line is to make sure that de-duplicated events carry the latest information. The OMNIbus field could be given the property of update on de-duplication but I regard an update command in the rules file to be a more reliable option.

We now want to persist the value of @PingDelay so that the next time the probe rules run we can test whether the result is worse or better than the previous one. So if we want to track the performance of pings to www.ibm.com we can use a dynamic property and use that to test things

This will work if the ping probe is only pinging a single host. If the ping file contains multiple hosts then this section of code needs wrapping inside a switch statement with a separate header for each host, for example:

Last time in this blog I looked at creating alarms when data values collected every five minutes suddenly changed significantly, what I called spike alarms. I only looked at creating the alarms, and I didn't look further into what the event life cycle would be. This is important because you create problems for OMNIbus if htere is no means of clearing alarms when they are no longer relevant. The plan I propose is:

Create an alarm when a 5% difference between successive measures is detected.

Clear the spike alarm if measures return to baseline within ten minutes

Reduce the severity and change alarm to a step alarm if measures remain at new levels for thirty minutes

Reduce the severity and change to a trend alarm if measures settle between spike level and baseline.

Step alarms may merely be reporting a planned change and spike alarms are unlikely to be reacted to immediately so what is the value of creating them? Well not a lot if the events come as single spies, but if they come in battalions (I'm referencing Hamlet here if you hadn't noticed) then that is useful to know. In OMNIbus this will be indicated by an alarm with a high Tally count, but more usefully, if OMNIbus is fronting a service management tool such as Maximo then these spike alarms can be sent into a modelling package which determines whether maintenance procedures should be reviewed.

Drawing up a flow chart might be helpful.

To recap on earlier blogs, I am using a piece of software from IBM Hursley called Node-RED to provide more sophisticated processing of data than a probe rules file can manage. The most important extra feature of Node-RED is that its flows can remember the results of previous runs using its context{} object, unlike a probe rules file which can only run with what it is given. (Or as I thought, but it appears that probes can be made stateful by creating a property on the fly and assigning it a value - something for a later blog I think) There are three things that need to be remembered between individual runs, namely:

context.store - which is set to the measured value at the end of the run so that it is remembered for the next time

context.sbase - which is set to the store value when the first alarm is created so subsequent runs can compare metrics against the alarm baseline

context.salarm - which is used to mark whether an alarm state has been set and also whether the clear times have expired

As before, it is important to have some understanding of the monitored environment. As we have been storing all measures in a log file it is a simple job to get a days worth of data and put it into Excel to create a chart.

From this chart we can see that only the international transfers have the step pattern these alarms are looking for, the national transfers (North-South Scotland-England) have a different sort of variation. Each transfer needs to be analysed individually and that is done by creating array variables and using a for loop to step through each one. This was covered in my previous blog. Using arrays means that limiting the analysis to the three international transfers is easily done simply by restricting the for loop scope to cover just values 3 to 5 instead of 3-7, as so:

for (i=3; i < 6; i++) {

In summary then, over the last few blogs I have covered how to obtain three sorts of alarms out of a single small web page of data. Obviously monitoring the UK National Grid in this way is not a practical proposition but I hope I have demonstrated the principles and that these techniques can be translated to more real world applications. I should also say that while I have used Node-RED with OMNIbus to create this monitoring package, IBM has other tools as well, The new smart analytics for Cloud can certainly do similar things and to a much higher scale, but that may be overkill for many customers. ITM agents can also perform some of the things Node-RED has been used for, but again, ITM may not be suitable for all customers.

So far I have only displayed the alarms generated from metrics in simple event lists. Obviously a WebGUI display with maps and charts as well as event lists would provide a much better overview.

Node-RED function node javascript

The script below is what I used in this demo and is included for information only. I'm sure it could be improved upon.

// This function checks for big changes between successive collections

Over the last few weeks I have been blogging an exercise in creating meaningful alarms from regular usage statistics (Getting more from Netcool OMNIbus). In this exercise I am using some statistics provided by the UK's National Grid and manipulating them using a tool developed by IBM Hursley (Node-RED) before feeding the alarms to OMNIbus via a TCP socket probe. Node-RED uses javascript to carry out far more tests and comparisons than are possible with a probe rules file.

So far I have written about how to use Node-RED to detect when the frequency of the grid drops below 50Hz - and to clear the alarm when three consecutive monitoring periods report the frequency is above 50Hz. And a second exercise was to generate an alarm when the trend in demand was steadily upwards, and again provide a clear when demand returned to the previous baseline level.

The final alarm type to generate is the spike or step alarm. This is where the measured values suddenly jump. In my terminology the spike is when the monitored value jumps and returns back to the baseline, the step is when it jumps and stays at or around the new level.

This time we will monitor the system transfers, but first it will be necessary to redesign the Node-RED flow so that the parsing of the National Grid page and the various alarm analyses are in separate function nodes.

The parsing of the HTML is now carried out in a dedicated function node and the output forwarded to all three analysis function nodes. The msg.payload output from this parse function is structured into a stream of name value pairs which can then be used in the analysis functions

The msg.payload is sent out to the socket probe as before and results in alarms appearing in OMNIbus.

As in previous blogs however it's not enough to create an alarm, it's also necessary to define what the life of an alarm is - how is it going to be cleared and what processes need to be applied to it. With the simple threshold alarm it was an obvious step to create a clear alarm when metrics recrossed the threshold in the homeward direction and let the generic clear take care of things in the normal way. With the trend event the clear alarm was created when the trend went the opposite way, but a modified clear automation was needed to identify when the trend had returned to the point when the initial alarm was raised. These spike alarms have a different life, and a plan needs to be defined for them. The plan I propose is:

Create an alarm when a 5% difference between successive measures is detected.

Clear the spike alarm if measures return to baseline within ten minutes

Reduce the severity and change alarm to a step alarm if measures remain at new levels for thirty minutes

Reduce the severity and change to a trend alarm if measures settle between spike level and baseline.

Step alarms may merely be reporting a planned change and spike alarms are unlikely to be reacted to in isolation. The value of picking them up is as an input into the service models a service tool such as Maximo provides.

In my last blog ( Getting more out of OMNIbus - creating a trend event ) I described a basic mechanism for producing a trend alarm and then clearing it when conditions returned back to a lower level. There are a couple of tweaks that can be done to improve this. Specifically I added three extra checks:

A threshold so that we don't clutter up the event list window with alarms for demand levels well below any level that might cause concern.

An early bailout that is triggered by the second measure dropping back below the baseline figure

An early bailout that occurs when it becomes clear the five out of six increases will not be achieved

The early bailouts are put in so that we don't risk a trend being missed because it crosses the boundary between two sets of six measures.

The comparison with context.counter is to ensure we don't inadvertently bypass the initialisation step. The curly brackets {} need to enclose all the comparison steps and it can be secured by following this up with an "else" section.

As the early bailouts can be triggered at a number of points, it makes for neater programming to use javascript's function capability. The bailout function is defined and then that function can be called at other points in the script. The bailout function primarily to re-initialise counters and the context{} fields. The instance that triggers the bailout is in fact used to be the baseline for the next 30 minutes of monitoring. The code is:

The return instruction in functions return the program to the point where the function was called. This means that a "return msg" instruction within a function does not output that msg from the Node-RED function node. To cause that output another return command has to follow the function call. Thus the two early bailout checks are:

Having created the bailout function it can also be called if the demand metric is below the threshold set for monitoring.

One more alarm type needs to be addressed - a spike alarm if there is a sudden step change in a monitored metric. And then finally the complete monitoring package needs to be unified under WebGUI. These will be the topics of the final blogs in this series

Here I am going to continue with the theme of monitoring the UK National Grid using Netcool OMNIbus and Node-RED - see my earlier blogs in "Getting more out of Netcool OMNIbus" - and look at creating trend warning events.

Threshold events are a useful way of detecting problems from metrics, but they are really only applicable where a clear limit can be defined, for example alarming when a current transformer reports 90 amps on a circuit where a breaker will cut in at 100 amps. Often though we are less interested in crossing a threshold than we are in knowing a potential runaway situation has arisen. In our National Grid example we want to know when demand is rising unusually fast.

First of all though we need to understand the environment we are monitoring. In a previous blog I mentioned that I was routing every piece of collected data to a log file. Now it's time to run that file through Excel so that we can plot a typical day on the UK's national grid. The result is what one would expect with a rise in demand in the morning and a further rise in demand in late afternoon as domestic demand picks up as people go home but industrial and commercial demand has not yet started to tail off:

So for this exercise want to be able to pick up if demand increases faster than a typical early morning spurt or if the late afternoon-early evening climb continues for a longer time than usual. And, always important in event management, we also need to be able to clear any alarms raised when the condition causing them has gone away. This is trending for events and operations though, which is a much more short term affair than the sort of data analysis trending done over a longer period for capacity planning purposes.

Specifically I propose three conditions to be looked out for over a thirty minute period (six five minute collections):

A rise in demand greater than 5%

A steady rise in demand where five out of the six monitoring periods are an increase on the one before

A fall in demand where the mean of the last three periods is less than the mean of the first three periods

The last condition is to be used to clear alarms and will require a modification to the Generic Clear automation, of which more anon. The other two alarms are also prioritised, if the 5% increase occurs then the steady rise is assumed and will not be separately alarmed. This means that there will never be more than one alarm per metric. The mean of three values is also calculated and used instead of a single value to try and obviate false positives generated by a single rogue metric

Our Node-RED flow then is the same as in my previous blog except that the analysis function is not looking for a threshold that has been breached but checking collected values against previous ones. Node-RED has a defined object - context{} - which can be used to store values from one execution of the script to the next. We will use this to hold the first value collected, the means calculated and the most recent value collected as well as counters to check progress.

The first step in our function script is the parsing of the HTTP data as before, but the next step is to initialise the counters if it's the first execution of the script. We can test for first execution by checking if context.counter is defined or not.

The actual decision making on whether an alarm has to be generated is only carried out when six collected values have been assessed. Each of the three possible alarm conditions are tested, though the order in which they are examined means that if the first test is passed the others aren't examined. This meets our prioritisation requirement.

The alarm can then be created from these values and sent to OMNIbus via the TCP socket probe as before.

If we use the rules file from last time we will notice that de-duplication acts on these alarms. The result is that if a trend continues for a second thirty minute period that the existing alarm is updated with a new Last Occurrence time and the Tally count incremented. The two problem alarm types can also overwrite each other. This may not be the behaviour desired so a line or two is needed in the rules file to define different Identifiers for different instances - adding the timestamp to the Identifier - will resolve that.

Whether we want to do this depends on what strategy we decide to adopt for clearing the events. Initially I propose here to clear a rising trend alarm when demand falls back below the reported level. However if multiple rising trend alarms are deduplicated the reported level will be updated and thus the earlier alarms will be prematurely cleared. If the alarms are not deduplicated then each individual rising alarm will be cleared in turn. This, in my opinion, provides for a much more useful display.

We could use the Generic Clear automation to clear the alarms we are creating, except that Generic Clear would need modifying to take account of the different demand levels. Modifications to Generic Clear are dangerous as they could have unintended effects elsewhere so I prefer to make a copy and modify that. That does mean a change needs to be made to the filter statement so that the copy only works on the alarms we want it to. The astute will have noticed that in the code printed above I have used Types 5 and 6 instead of 1 and 2 for the problem and resolution indications. We also need to create a problem_trend_events table in the alerts database by copying the problem_events table and then adding a Demand field to it.

-- Remove all entries from the problems table
delete from alerts.problem_trend_events;
end

The result is that rising trend alarms are cleared when a falling trend alarm goes below the value when they were triggered. This gives a visualisation of the trend situation as things subside back to normal.

This set up now works. However there are some improvements that could be made to make it more robust and efficient, and I'll cover these in the next blog

In my previous blog (https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_creating_a_threshold_event_with_hysteresis?lang=en) I showed how using Node-RED with a TCP Socket probe meant we could have a threshold event that would not clear until three consecutive monitoring periods had been within the threshold. Quite often another threshold breach would occur before the third clear period and therefore the alarm would not clear. Additionally, if the temporal period of the DeleteClears automation is greater than the threshold monitoring period, the threshold may be breached again while the previously cleared event is still in the system and then this event would be updated. In short, it would be useful if we could indicate whether a threshold breach had been present for the entire life of the event or only for part of it - and if the latter how great a part.

In cases like the threshold events created by the Node-RED flow I described last time where we know that each update occurs every five minutes we can calculate how long the threshold was breached, namely by multiplying @Tally by 300 seconds. Since we know how long the alarm has been active from the @FirstOccurrence and @LastOccurrence fields we can calculate a percentage, and if we put that in another field then we have the KPI we can display.

There is one little gotcha. If you think the life of the event is the difference between LastOccurrence and FirstOccurrence you will get strange results, percentages of 198% for example. That is because the first occurrence of the event comes at the end of a monitoring period so we need to add the monitorPeriod value (300 seconds) to the life of the OMNIbus event to get the actual total monitoring period.

Note also that I created a new field, @ThresholdEventKPI, to hold the result. Obviously this field needs to be created before the automation or OMNIbus will report an SQL error.

This approach will not unfortunately work in those cases where repeat alarms come in randomly, but in any cases where the source is a timed or polled collection of data this automation, with a few tweaks, should give some extra information.

In my last blog (https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_a_new_approach_to_event_creation_more_suited_to_smarter_infrastructure_monitoring?lang=en) I set up Node-RED, an IBM Hursley development, to work with OMNIbus taking a feed from the UK's National Grid. That blog just set up a TCP socket probe to receive the Node-RED output. In this blog I will take that further and set up a threshold event that will be triggered if the frequency on the grid drops below the nominal 50 Hertz.

Setting up a threshold event can be done in a probe rules file but what I want to do here is to add some hysteresis. The alarm will be triggered by the grid frequency dropping below 50Hz but a clear event won't be sent until there have been three successive measures of frequency above 50Hz. The idea of this is that an alarm condition has to be clear for a period before the alarm can be cleared.

There are alternative methods of introducing hysteresis. One possibility is to set the clear threshold higher than the trigger threshold, for example trigger an event when frequency drops below 50Hz but don't clear the event until the frequency reaches 50.1Hz. This is how hysteresis is introduced in TNPM. However that is not the best way when a tight threshold is required, as in this case.

The first step is to set up the alarm trigger in Node-RED

From the last blog we have a demonstration flow. A couple of modifications need to be made:

The output that sent every update to OMNIbus has now been redirected to a log file. It might be of interest to graph the measures of demand and frequency over time but that is not the task of OMNIbus. What we want in OMNIbus is events sent when something is wrong, in this case when the frequency drops below 50Hz. So we are using a second output to drive the probe. (For debugging I also set up a third port with just a debug node attached)

To recall matters from my last blog, this flow is triggered every five minutes and an HTTP GET is made to the National Grid website. The NG returns a summary of its status, which is rendered by a browser as:

This output is then fed to a function node (Parse NG output) which contains a javascript script to parse that data into tokens and those are then sent to a log file or via a TCP socket to an OMNIbus socket probe and thus to OMNIbus.

However the javascript in the function node can also do comparisons and manipulations, and this is what I want to examine first.

Parsing the data from the National Grid uses javascript split and parse functions to extract the demand and frequency values:

Now we have some variables to work with we can create a threshold alarm. As the AC frequency of the grid decreases under stress we can use this as a warning indicator. Nominally the grid frequency is 50Hz, so lets trigger an alarm when it drops below 50Hz. A simple "if" branch will do that and then all we need to do is craft the message to send to the socket probe.

When the grid frequency drops below 50Hz a single line message delimited by pipe characters and terminated by two newline characters is sent to the socket probe, which, assuming the props file and rules file are configured to do so, will create an alarm in OMNIbus.

Now we could create a clear alarm in the same way when the frequency goes above 50Hz again, but that would be too simplistic. We want something a bit more sophisticated, namely:

only check for frequency being above 50Hz when it is in an alarm state from a previous below 50Hz state

do not send a clear event until three consecutive above 50Hz reports have been received

To do this we need to hold values from previous inputs. Node-RED has an object defined called Context{} which can be used as an array to hold these values. So the first step is to add a couple of lines to the alarm raise code to set context values for when an alarm has raised and a counter:

context.falarm = 1;
context.fcount = 3;

Now we can use these to test whether an above 50Hz metric should trigger a clear event. Don't forget that for the Generic Clear automation to work the ProblemType needs to be set to 2 and the Severity needs to be set to 1.

The result is that after the third above 50Hz message is received while the flow is in an alarm state that a clear event is sent and OMNIbus' Generic Clear automation uses that to clear both event to a green state.

In practice it will be found that often one or two above 50Hz measures come in but the third metric is back below 50Hz so the clear event is not sent. However a raise event is. Deduplication takes care of this within OMNIbus and the tally count does suggest of a way to create a further KPI to show how serious the problem is. That is however something for the next instalment.

Reference Material

Complete Function Node Code:

// does a simple text extract parse of the http output to provide an
// object containing the uk power demand, frequency and time
// context {};

It may be hard to believe, but at one time IT devices didn't send alarms, they just scrolled a log file like a syslog and operators were expected to recognise there was a problem from the messages scrolling up the screen. Among the earliest OMNIbus probes were the syslog and logfile probes which replaced the operators' eyeballs with code to recognise problems. From the mid 1980s though, IT devices started to be programmed to detect problems and to alert operators through flashing LEDs initially and then through messages to attached text only terminals. Or printers, I recall my network support team being alerted to a problem on our dial in modem racks by the sound of the dot matrix printer springing to life.

In the intervening years we became accustomed to IT devices alerting us to problems through alarm messages following standards such as SNMP or X.733. The truth is though that engineering for alarms was constrained by budget and release schedules, but the alarm catalogue was generally rich enough for that not to be a major problem. It was however common to deploy performance monitoring - for example the Quallaby product that is now TNPM - to fill some of the gaps.

Today however we are encountering the same problems with limited alarm provisioning as we start to support smarter infrastructure. The sensors and smart meters are mostly set up to send regular metrics and not so much to send alarms. A smart meter will certainly alarm if primary power is cut but if it reports that power utilisation is 3.7 kilowatt is that good or bad?

How then do we turn regular metrics into alarms that can be used to warn of anomalous situations and trigger actions to investigate them? With the number of smarter infrastructure monitors being rolled out the data to be analysed will grow huge so it will have real value to be able to detect these anomalies quickly. The problem is not just finding a needle in a haystack - it's knowing there is a needle there to look for. Or five or ten needles.

One obvious way to create an alarm from a metric is to set a threshold. It's simple to do with a line or two in a probe rules file, but the problem is that it is static. There are many times when there is a clear maximum level like a power limit or a temperature setting that will not change, but often it is difficult to set a threshold at a level that will catch a problem without generating a number of false positives first.

Static thresholds are limiting, so what we need to do are create some dynamic thresholds. These can be thresholds set against a baseline, a threshold that compares a value against the previous value, or a threshold that is a trend over an hour or so. Those are all independent thresholds. Other thresholds might require comparing a value against another two or three different metrics collected from other sensors. For example a crane motor drawing x amps might be normal when it's lifting a heavy load but not when it is merely moving the hoist around.

I think we can safely say that the logic required here is beyond the capabilities of a probe rules file. What we need is a flexible tool to do some pre-processing of metrics before the probe creates the alarm. We could do what we did before and bring in a performance tool like TNPM or even ITM, but in this blog I am going to draw your attention to a piece of software being developed in IBM Hursley called Node-RED. I won't give a full description of what Node-RED is or what it does, I refer you to the intranet site node-red.org for that, but I will show the architecture I used and describe how the first integration between OMNIbus and Node-RED was achieved.

The architecture deployed is as in this diagram

In this exercise we have used option 1, the socket probe. A proof of concept of option 3 has been done but that is still too immature to consider further.

Monitoring the UK's National Grid

For this exercise we need a data source that can provide us with a regular stream of data and as the chaps in Hursley have created a demo exercise using the UK's national electricity grid, it seemed a good idea to use that. In that exercise Node-RED collects grid metrics from the URL http://www.nationalgrid.com/ngrealtime/realtime/systemdata.aspx. That URL returns:

Demand: 51291MW
17:00:00 GMT
Frequency: 50.032Hz
17:01:45 GMT

System Transfers

The Node-RED demo also provides some sample code to collect this data at five minute intervals and to parse it so that demand and frequency can be isolated so we may as well use this to get us started. Node-RED also provides nice Debug nodes so we can test what each function node actually delivers, and teh output from the parsing node as suggested is:

A TCP socket probe can now be set up with properties set to parse single lines using the pipe symbol ("|") as a delimiter and two blank lines to indicate the end of a record. In Node-RED a TCP output node is connected to the parse function and configured to send data to the TCP port and IP address of the socket probe.

The probe now receives a set of tokens which can be viewed in Alert Details if the rules file is configured for that, else viewed in the log if message level is set to debug.

From here on in it is simple rules file work to get the event correctly presented in OMNIbus

Naturally you will use deduplication to ensure that the event merely shows the latest status ...........

So now we have the basic integration. In the next blog on this topic I will cover extending this to produce a grid monitoring visualisation using Node-RED and OMNIbus together

One of the real selling points on Netcool OMNIbus has always been that it can reduce the numbers of alarms presented to network operators by really large proportions, 80% or more alarm reduction is not unusual. De-duplication, i.e. updating an existing alarm with a new last occurrence and tally rather than create a new entry for a duplicate alarm, has been a key tool in alarm reduction. The trouble with de-duplication however is that it smothers alarm history, so if an operator is faced with a transient problem, for example occasional failures in accessing a service reported by ITCAM, and can see that there are two or three alarms with high tally counts, then how can that operator deduce whether the service failures are down to an intermittent problem with a router interface or whether the stream of high memory utilisation events from a server is the cause. Being able to track which of these intermittent events occur simultaneously would be very useful.

One way to do this would be to use archived alarms and build TCR reports, but this assumes that all alarms are being archived and that the operator has both ability and authority to create the TCR reports. It would also be a rather heavyweight solution to an issue that should really be a matter of selecting an alarm or three, right-clicking and selecting a "track alarm" tool. In this blog I am going to describe a possible way of achieving that.

This first blog will describe how I went about proving the concept. In later blogs I will cover how to provide wrappers and reporting tools to turn the concept into a solution.

What I didn't want to do was to change the de-duplication process in any way. The way I wanted to approach this was to select a tiny subset of the alarms coming in and have them write severity changes with timestamps to a custom table. To minimise the risk of overloading Object Servers with extra automations I thought it best to push that work out to the probe. That would be possible through checking an alarm's identity against a lookup table and using the genevent function that came with OMNIbus 7.3.1 to write to the custom table in addition to the probe writing the alarm to the normal alerts.status table. This does mean that when a new alarm is selected for monitoring that the probe needs to re-read the rules file but this can be achieved through the new HTTP features in OMNIbus 7.4.

The first step is to create the custom table that will hold the alarm status changes. I've called this table statmon.status and its structure is:

The key fields for monitoring and graphing are ChangeTime and Severity but I've added other fields for manageability purposes. Identifying Owner and Group opens the possibility of running multiple traces by different operators later on. Having LocalIdentifier as well as Identifier gets round the issue that many rules files put either @Severity or @Type in the make up of @Identifier and we don't want that in this application, and the MonitorCount field lets us use an integer instead of a string as a filter in charts.

Now we have somewhere to put the results we need to modify our probe rules file to go into monitor mode and post the changes in Severity. The first part of this is to create a lookup table in a file that will identify which alarms should send their new Severity to this table. This will be a three column table (key plus two variables) looking something like this:

Later on we will look at scripts to add or remove lines from this file that can be invoked from a tool but in this proof of concept we will edit this file manually.

In this proof of concept I have also used the simnet probe to generate events, so I added these lines to the simnet.rules. These lines are not specific to the simnet probe so could be added to other rules files as an include.

At the top of the rules it is necessary to register the target Object Server tables as we will be sending data to multiple tables, and also define the lookup table as a file:

As stated, the LocalIdentifier variable is necessary to strip out @Severity and/or @Type from the @Identifier field else we will only be monitoring a subset of the alarms we want to. LocalIdentifier can then be used as the key to the lookup table. I followed standard practice in setting a variable with the system time (getdate) and I also thought it useful to be able to identify those alarms being monitored which I did by setting @Acknowledged to 10. This had an unexpected but useful effect in that the event was given a grey colour in the Active Event List

At this stage I haven't used the Owner and Group fields, but the genevent tool sends the key data for tracking an event to the custom table as well as sending the alarm through the normal route. Only those events whose LocalIdentifier matches an entry in the look up table with "yes" in it will do that, all other events will ignore this section of the rules.

The result is then that the probe sends the Severity of each new alarm to our statmon.status table with a timestamp

Now we have this data we need to be able to visualise it. For this proof of concept I thought I'd use WebGUI's charts, and boy, what fun that was. Charting in WebGUI is deprecated and the future lies with DASH and the visualisations created by the RAVE project, but that has meant that there are some weird bugs now in charting. However I was able to put some charts together to demonstrate the principle.

These two traces show that the alarms we are monitoring are not related - much as you would expect with a simnet probe - but does show the possibilities of this approach to transient alarm monitoring.

Now we have the concept we need to move on to making it a workable solution.

Firstly we need to have a simple way of selecting an alarm for monitoring and starting the monitoring using a tool that modifies the lookup table and gets the probe to re-read its rules file, along with another tool that stops the monitoring and tidies things up.

We also need some automations to perform housekeeping on the statmon.status table, and possibly an automation that culls monitoring that has been forgotten about.