I'm currently rewriting our Incident Management process and I was wondering what peoples' views were on the following issue I've encountered.

A bit of background:

Our Incidents are primarily detected by 1st Line staff who constantly monitor alarm screens and check our CIs. We only very occasionally have Incidents raised by our customer or 3rd party suppliers on the phone or via email.

A number of Incidents related to Problems/KEs have immediate impact on our availability SLAs and so require immediate action by 1st line staff to resolve them and make the CI available again. (Every second counts in some of our availability SLAs.) These Incidents would not be escalated or assigned to anyone; merely dealt with by the person who detected them.

The issue:

Some of our Incidents will require immediate action before being logged (and then immediately closed), while others will need to be logged first, escalated, assigned etc.

I am wary of having a different process depending on an Incident's impact/category/urgency. Is this a valid concern or do you think I should go ahead and make this distinction?

Do you have a toolset in place? is there a response time defined in your SLA's or is it just Availability? do you have different SLA's tailored to different customers(Customer SLA's) or is it Service based?. Do you have OLA's in place to underpin the SLA's? Also If you are going to proceed in writing a different process(which is no harm if done right), how would you go in controling your staff to ensure that they log the incidents after its resolved and not by pass it?

We are in the process of setting up ChangeGear (http://www.sunviewsoftware.com/products/overview.aspx) for our Incident, Problem, Change, Config & Knowledge mgt purposes. Currently this is all done with seperate tools however I'm writing this process with ChangeGear in mind.

We have resolution times in our SLAs but not "response times" as such - the resolution times and availability requirements go hand in hand.

We have only one customer, with agreed Incident severities that dictate the SLAs. We do have OLAs in place.

Your last question is the main issue, I think. The problem we're facing is that because of the few Incidents that require resolution before logging, people are treating all Incidents in this way. This has led to Incidents not being logged, or logged so long after the event that the Incident records are missing key information.

I had thought about looking into automation between our alarm systems and ChangeGear, so that the Incidents that our guys need to resolve immediately are automatically logged - however I think this is a long way off, and I don't want to write the process with this in mind until I'm sure it's actually feasible.

Before I start, you need, in any event an umbrella document that describes your incident management in terms of policy strategy and process structure.

You say that seconds count. In that case quality counts and your monitoring staff will have the capabilities and judgement to recognize and deal with these events that require instant fix. You also say that these events are associated with known errors. It will be best to stick to that and not allow "new" incidents to be treated in this way.

I am also assuming that these "emergency" incidents are not happening every two minutes, because then you would probably have to deal with it as an operational activity with dedicated staff and their own processes.

The true issue is getting it right. Normally logging first makes good sense because then you have a record come what may. But it sounds like you can achieve that through your monitoring and alerting tools. Perhaps the staff can press a few buttons while assimilating the event to get the record started; that way it will have a time stamp for when it was picked up as well as a glaring demand to be completed.

Obviously you still need to translate that into a proper incident record as soon as possible and so you might need some end of day/shift check to confirm that things were caught up with.

You also need good and frequent audit/review to ensure things are working as you require.

How do you identify when it is correct to follow this accelerated process? Well the event will link to a known error and that means your staff have to be on top of that (if they have to search for it, then perhaps the incident logging process would speed them up rather than slow them down if it is well designed); and it has to pose an immediate threat to a service (any service or some specific service(s)?) or it has to involve a failure in one of a set of specified CIs (again, if the logging process gets you there quickly it might be quicker to log the incidnet than not).

You are still left with issues. People will err on the safe side and if your precess is not rigorous enough, too many incident logs will be deferred. There is a cost to in terms of reliability of your incident records and in extra levels of audit and checking to compensate. That has to be equated with the savings of a few(?) seconds from instant action.

These are just some thoughts on some of the issues you want to look at, hopefully helpful, but far from comprehensive. I wouldn't like to say whether you should or should not go ahead without a very detailed understanding of the practicalities, costs, risks and business imperatives at the very least._________________"Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718

Since response time is not necessary with your clients, then I dont see how its a problem in delaying the process. Logging a ticket shouldnt take more than 2 minutes.

Anyways, in your scenario for urgent incidents I would recommend an easy solution like this:

- Leave the incident Unprocessed(Since your saying 2 minutes logging can cause availability)
- Let the engineer who's working on this incident, send an acknowlegement(Preferably a saved draft to buy time) to the rest of support staff for them to know that he's working on the incident.
- Now the engineer is focused on restoring the service as quickly as possible.
- Once service is restored the engineer can update the incident records with the details.
- If he didnt, you can verify that by audits, unlogged incident will be very easy to trace since engineers are sending ackowledgements with their names, verify the time they logged the ticket & why it took so long then hit them with a stick and ensure they dont do that again.

However, I dont recommend the steps I listed above to other organization but it might fit yours. Remember ITIL should be tailored.._________________Ali Makahleh
Configuration Management(Blue Badge),
ITILV2 Service Manager(Red Badge),
ITILV3 Expert(Lilac Badge) Certified.

“If you can't describe what you are doing as a process, you don't know what you're doing." W. Edwards Deming.

if you do not have a record of the incident and how it is dealt with, then how are you going to justify the staff.

second - Write your process w/o regard to the tool first.

A fool with a tool is still a fool.

You need to define the policy for Incident, problem, change config and release first. then the process and then the procedures - the procedure document is where you talk about the tool

An incident that is critical needs to be created as an incident record so that there is a place for all the staff working on it can have a central repositiory for it

If the SD / NOC is the team to resolve the issue - use the two person rule. 1 person deals with the incident resolution while other gets the paperwork started. Once the paperwork has been created that person can move on to the next issue while the individual who is resolving the incident can finish the paperwork after the solution has fixed the issue.

It does no good to the service that you are providing if you create the ticket - days later and dont get updates from the engineer in a timely manner

You basically have no proof of your team doing anything.

Also, and I shout this as loud as I can - the tool is NOT a replacement for staff nor is it really a staff multipler. It is a tool. That is all.

If you have multiple incidents happening at the same that are equally critical, you are screwed in more ways than one if you dont have a properly staffed SD_________________John Hardesty
ITSM Manager's Certificate (Red Badge)

Thank you all very for your replies - I would have replied sooner but have been a bit snowed under!

Diarmid,

Thanks very much for your comments. These “emergency” Incidents are, essentially, immediately taking a CI out of service and impacting on our SLAs. They must be resolved immediately.

Operators create an Incident record for each one, and this is audited daily by the service centre’s Supervisors.

They are the most common Incidents we encounter – perhaps 30-40 between 7am and 6pm each day. They are associated with Known Errors but will, depending on a new Release, reduce to a negligible level within the next 2 months.

I was intending to very clearly specify a limited list of Known Errors that the accelerated Incident process could apply to. However, as you say, the difficulty is ensuring that this approach is not taken in response to all Incidents.

TCO,

Again, thanks for your comments. The Operators are both logging and resolving these Incidents – they are not released to engineers. However your approach is basically what I was thinking of specifying.

UKVIKING

I agree with you - I’m in no way suggesting that these Incidents are not logged, whatever approach we take it's imperative that they're recorded. Creation of the Incident records is deferred for no longer than 5 minutes – certainly not days. In an ideal world I would take the 2 person rule but unfortunately resources don’t really allow for this.

I’m not sure what you mean regarding fool/tool, could you explain? I’m not suggesting that our tools are dictating our processes; TCO enquired whether we had a toolset, which is why I mentioned it.

In general:

What I envisaged was an Incident type that is similar to a Standard Change - they are pre-defined and can skip (or in this case swap around) some stages of the process. We can then check whether the "Emergency Incident" approach is being applied to "Normal Incidents" inappropriately and raise it with the staff member if required.

The reason for me asking about that tool, cause there are alot of tools in the market that saves you the logging time and you can tailor it the way you like. For example: your 1st line will just have to pick up that alert that which will be automaticaly logged(staff might have to classify the incident), which literally takes seconds. Then the 1st line can concentrate on restoring the service refering to the KE's or KB. once thats done your staff can update the solution and resolve the ticket.

You can tailor those tools the way you want and they can make your life easier if you do use them right, you can automate most of the steps that I mentioned in my previous approach..

Just make sure you define the process, then tailor your tool upon that, as UK mentioned earlier a fool with a tool is still a fool

Also if you think that investment in problem management is necessary to have a rich structured Known Error Database and proactively incident prevention to lower down the number of incidents that your getting and to improve your availability then go ahead. You should know better than us in that.. Another suggestion I would give is read about availability techniques abd try to use the Extended incident life cycle technique, to see where most time is spent on each incident and where does it need improvement._________________Ali Makahleh
Configuration Management(Blue Badge),
ITILV2 Service Manager(Red Badge),
ITILV3 Expert(Lilac Badge) Certified.

“If you can't describe what you are doing as a process, you don't know what you're doing." W. Edwards Deming.

Yes, got you. We do have (potentially) the ability to have these Incidents automatically log records when they arise - however I don't know whether this is 100% possible yet, and so I don't want to mention the functionality in the process.

I'll have a read up on availability techniques. My ITIL books have arrived today (300 quid, oof!) so I intend to bury myself in them for the next few months anyway!

Cheers for the advice, I think I'm probably going to be on this site quite a bit having read a few of the good help you're all giving each other

if you dont have the resources to have a two person team work on a ticket

how are you going deal with 2 or more incidents where the limited resources fix the incident / restore service and then move on to the next ticket

by then the person who worked on 2 or 3 consectutive incident and restoring service will have forgotten what they have done on each.

This can be alleviated by writing it down as it is done. Sort of like the old timestamp rule for financial transactions

The phrase - a fool with a tool is still a fool - is a phrase used to describe companys / organizations etc/ that think that if they buy / use a tool to do X, all of their problems are over.

As you stated in the first post, you are looking at your processes.. You need to go policy, process, procedure, work instruction -where the last 2 should be tool centric

If your Availability SLA are written that badly and your SD / NOC / monitoring centre is so resource deprived (low staff numbers), I can almost guarentee that your Availability SLA will be breached every month.

Especially if there are multiple incidents requiring instant response happening in a short period.

Finally,
The incident process should be the incident process. The process flow should be the same for all incidents.

The only difference is the time scale between each stage form high priority / high severity and low priority / low severity incidenys

This way all staff knows exactly what the process is and dont require re-training / re-enforcement on when they move from covering one type to another

The time scale between the high pri/severity stages can be measured in second /minutes while the low pri/sev can be hours/days

But the process should the process

And lastly - please dont equate incident managment with change management process. IM restores existing service to what it was. CM redefines / adjusts / remakes (changes) the existing service in some respect.
While a resolution to the incident may involve CM, then the incident mgmt process should involve the CM Process - as stated in both the CM and IM policy and process documentation - and this would include some sort of review from the CM p.o.v_________________John Hardesty
ITSM Manager's Certificate (Red Badge)

Believe me, the SLA and staffing issues are both things that drive me mad. However I don't have a say in them (no matter how much I drive the point home).

Quote:

Finally,
The incident process should be the incident process. The process flow should be the same for all incidents.

The only difference is the time scale between each stage form high priority / high severity and low priority / low severity incidenys

This way all staff knows exactly what the process is and dont require re-training / re-enforcement on when they move from covering one type to another

The time scale between the high pri/severity stages can be measured in second /minutes while the low pri/sev can be hours/days

And lastly - please dont equate incident managment with change management process. IM restores existing service to what it was. CM redefines / adjusts / remakes (changes) the existing service in some respect.
While a resolution to the incident may involve CM, then the incident mgmt process should involve the CM Process - as stated in both the CM and IM policy and process documentation - and this would include some sort of review from the CM p.o.v

Just to clarify - I'm using a Standard Change as an example of a Change where the process flow differs, as I'm considering having an Incident type where the process flow differs. I'm not suggesting involving the CM process in this particular conundrum.

if the answer is "yes" then one of several Incident Models is used instead of the 'general' Incident Management process. I've done one of these models for each of the types of incident that require immediate action/slightly different approach etc. For instance, one of the Incident Models is:

can I just warn you, someone is masquerading as you on another thread claiming you are an idiot.

I know it's not true because you appreciated our advice _________________"Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718