Defining an Incident

During most of my posts I take a shorthand method of talking about Incidents and equate them to a “Service Outage”, but truthfully, an Incident is defined more broadly than just when disruptions in Services are noticed by end users.

So what conditions should be logged as Incidents?

There are four conditions that should be the basis for entering in Incident records in your IT Service Management system:

A Service outage

A Service degradation

An Event that increases the risk of a Service outage

An Event that increases the risk of a Service degradation

Definitions

Service outage – Obviously this is the common understanding of what defines an Incident. When end-users’ Services are disrupted, people who use ITIL terminology call this an Incident.

Service degradation – When a user’s Services are in a degraded state (slow performance, critical functions not working, etc.) an Incident should be logged. What level of degradation triggers an Incident is something I’ve written about in a previous post.

An Event occurs that increases the risk of a Service outage – Let’s say you have a server with 3 drives configured for RAID 5 and one of those drives fails. The risk of a Service outage has significantly increased. If another drive in that array fails, you will have significant data loss. Hopefully you have sufficient monitoring to alert you to the event rather than relying on someone noticing the red light on the array as they walk by, but regardless of how it is detected, an Incident should be logged.

An Event occurs that increase the risk of Service degradation – Let’s take a scenario where you have a FDDI ring with an ISDN fallback between two sites. Again, hopefully you have monitoring to tell you when your primary FDDI ring has failed but the secondary ring should be able to handle the users’ volume. What if that secondary ring goes down and you have to fail over to the ISDN connection? The users’ Services will be seriously degraded. Even though the risk of a complete Service outage is very low with this triple redundancy, the risk of Service degradation has risen dramatically when the primary ring failure occurred.

Downstream

Many IT technicians don’t understand that all of these conditions warrant the capture and recording as an Incident. This significantly affects downstream processes like Problem, Config, Change, Availability, etc., etc.

How can you plan for high Availability if you don’t capture non-service outage events? How can you identify Problems if you don’t record Incidents that don’t directly affect the users’ service perceptions?

Automation

Many tools try to automate the recording of Incidents when non-user affecting events occur, but most of them generate so many spurious events that the volume of invalid Incidents created make the feature not worth using. Only with strong correlation rules would I trust automated Incident creation.

Training

The best thing to do is to train all the IT technicians to understand that Incidents are not just for Service outages and to have good Service Level documentation to inform IT when Service degradation should trigger Incident creation.

One Response

Don – Excellent article. The problem of overstating the frequency of Incidents by monitoring tools certainly does diminish the value of Incident tracking, if not of the monitoring tools themselves.

Another fairly recent development that waters down the value of Incident Management is the coopting of the term “Incident” by the Service Desk function, wherein every call to the help desk is called an Incident, rather than a Serviced Request or TroubleTicket.

I don’t know if we can ever get that particular horse back into the ITIL barn! It could be that ITIL will have to dumb itself down so that the ITSM tool developers can understand it.