Our organization is currently facing a "tsunami" of alerts since we recently plugged the monitoring tool. This of course dreadfully diminishes the Service Delivery team's productivity.
A short investigation on alerts that are popping every day (around 1500 / day) reveals that :
- a large number are caused by the same incident
- cause can be technical or applicative
- eradication plans often involve both Operations and Project teams.

Would anybody have already faced the same problem and drawn a procedure (best a process) to place this issue under control ?

First of all, are the alert valid? If you don't care what it's telling you, then turn the alert off. If you do care, then raise problem records, prioritise them and fix the underlying issue.

Also, can you system agregate alerts, i.e. raise 1 incident with a count variable? For example, if a directory is full now (causing an alert), it's likely to be full in 10 minutes (when the alert fires again). A sensible system monitoring system would see that as 1 incident that has started at time X and has caused Y alerts. If a user rings the service desk every 10 minutes to tell them that their PC is still broken, do you record this as separate incidents?

Lastly, it's worth mentioning that all that a monitoring tool does is alert you to incidents before a user does. Therefore, if you are having problems with how to handle the alerts, then I would turn it off and get your incidend management process resolved first.

can you set up the monitoring tool to be capable of identifying Sevverity 1, 2, 3 etc type events.

By this I mean don't stop the events from being created bit look at the conditions used to fire the event. i.e if the condition is a warning event then set it to create a severity 3 incident. if the event is for a major issue i.e. CPU failure, database offline issue then have it create a Sev 1 incident and have your incident management tool send an alert to the relevant people. That way you get alerted to the major incidents but have all others recorded as events / incident records for review.

It does help - bit does not solve the overall issue of relating the flood of events that come in. But get a handle on the important ones and then look to deal with the rest.

Granted a CPU failure can spawn a multitue of other events being triggered - sometimes I have found you just have to stop the flood of events- fix the issue and then switch it back on._________________Mark O'Loughlin
ITSM / ITIL Consultant

NMS tools also have filters like - if an alert clears w/in 5 minutes - it disappears

However, this does not mean that the team that is doing the role of NMS monitoring ignores all alerts because they appear/disappear w/in 5 minutes.....because they appear/disappear w/in 5 minutes.....ebecause they appear/disappear w/in 5 minutes.....

and then if the incident ticket warrents a Problem ticket raised to solve the unknown underlying proble rather than restore service - Incident!!!!_________________John Hardesty
ITSM Manager's Certificate (Red Badge)

Thank you for those very quick answers.
Actually, I think I could have been more specific in my question : my current concern is to set the sound organization and process to best address, for a set of correlated alerts (eg same object, same server ...), whether I should :
- tune up my monitoring tools (alerting threshold for instance),
- change sth on the hardware
- update the instructions manual
- ask analysts to patch their developments,
- whatever ...

I was more thinking of setting up a process like a draft I could send you (since I don't know how to link an image here).
Who do I have to involve ? In what case ? Who should be in charge for coordination ? aso ...

Please excuse my english (we french people are not always very at ease with you language).
Kind regards,
JLB

You have to have a defined Incident mgmt process first
this should have an addenda to deal with automated tools, alerts & System monitoring alerts

These SM alerts should be used to create incidents

If the alerts says ' insufficient memory... system crash'

then the system people for that system would get the incident...resttore service and THEN investigate why there was insufficient memory

This is PROBLEM MGMT.

For example, the investigation reveals that an application

Call it SarkozyTHoughtProcess - hey - I saw a comment that his wife ... like his six brains ----

needs # amount of RAM and ## amuont of hard drive for swap space

There seems to be insufficient hard drive space...therefoe the solution to the problem is . add a new hard drive with ##^8 and use it solely for this applicatiion

a change is raised to implement
it gets approved
it gets schedu;led
it gets implemented

---------meanwhile... the system suffers the alerts and the system team restores service and tickets (incidents are generated and linked to the existing problem which is being dealt with )

and lo after the implementation.. the alerts disappear

-----------------------------

In regards to your question... all five can be done or not be done... depending onthe results of the analysis of the proplem

If the alert is set to low or to high...then this should feed into the System mgmt peiople to investigate the impact of more or less alerts

....NOTE: Before the 5 minute rule went into affect, every alert would have to have an incident ticket potherwise it did not clear
we generate thousand of useless tickets_________________John Hardesty
ITSM Manager's Certificate (Red Badge)

What I understand is that the process I'm trying to define seems to be a hazy mix of incident and problem mgt, since the amount of alerts is a bit confusing and generates way to many problems my team is able to deal with.
I shall think of an effective dashboard that helps to follow-up so many eradication actions plans scattered among so many ITs.
Any template ?

In any case, MANY MANY thanks for your advices : one often need sbdy to remind the basis of Service Management when operations are ... intense.
Regards,
JLB

so then your incident (problem process ) too would be linked to do something about each level through a series of incident ticket states_________________John Hardesty
ITSM Manager's Certificate (Red Badge)