I was wondering how others handle additional tasks that may spawn as a result of an incident, and how you handle managing those tasks. I actually have a specific instance in mind that I'll provide that I'm looking for feedback on.

Let us say that a PC has a hardware problem that causes the customer to place an incident with the service desk. The incident is triaged, and escalated to the operational team responsible for hardware issues. In order to restore service to the customer they decide to replace the failed PC with a spare device from their used stock. The customer is now completely functional again, and is no longer impacted by that failed device. The technician from the ops team however still needs to get the PC repaired, and it may be that spare parts need to be ordered and that device may be inoperable on a work bench for a few days until the parts arrive, at which point it is fixed and then placed back into spare stock for break/fix use.

The question becomes, does the Incident remain open until that original PC is repaired? I wouldn't think so since service has been restored for the customer, so there's no reason for them to have to see all the back-end work occurring to repair a device that is no longer impacting them operationally. I don't want my metrics for how quickly an incident is closed (which I use specifically to track restoration of service to the customer) to be skewed by these types of scenarios. However, if you do close the incident, how do you ensure that failed device is repaired, and it properly documented? My question may not even be so much of an ITIL question, but more of an operations question.

Regardless, I welcome feedback on my question. Thanks in advance to everyone for their thoughtful commentary.

Once the laptop has been restored to the service agreed, the incident is closed

As for the parts etc, that is not ITIL or IT SM but inventory and stock management as well as the service that is linked to the vendor providing the laptop hardware repair service_________________John Hardesty
ITSM Manager's Certificate (Red Badge)

If you want to take a more nuanced approach to Incident Management (and it appears that you do) then you need to distinguish between an incident being RESOLVED and CLOSED.

Once service is restored (new PC delivered) the Incident is RESOLVED. However, the Incident should not be CLOSED until root cause (the underlying problem) is known and addressed. Only then may the Incident be CLOSED.

That way you preserve your "time to resolution" record, but peace the Incident in a state where additional actions are still required. If there is difficulty getting at the root cause, then you invoke Problem Management.

This is how you take full advantage of the power of the ITIL approach to Incident and Problem Management.

First of all, I agree with you 100% that it is not in the purview of Incident Management to do root cause analysis. That is the domain of Problem Management, as you said. I still prefer to use two states - RESOLVED and CLOSED - to represent incidents where normal service has been restored. In my organization (a large university), the engineering/admin team quickly knows root cause for the great majority of incidents that occur. Invoking PM in this case is not required, as they can simply enter the information in the correct field, and the Incident closes automatically.

In the rare case (~4 - 6 times per year on my campus) that root cause is not known, then PM addresses it, and the Problem Report has links to all incidents arising from this problem. When the Problem is closed, the linked incidents automatically have the root cause explanation placed in the appropriate field and have their status field changed to CLOSED. The RESOLVED state indicates that root cause remains to be determined for this incident, and that is useful information to our IT organization.

Of course, no two organizations implement ITIL the same way. Each adapts it to what works best for them. Some of them use RESOLVED to mean that IT says it is fixed, and the CLOSED status to mean that the customer reporting the outage has confirmed normal operations. Since 99% of incidents are discovered and fixed by our engineers before a customer reports it, that makes no sense for us, since there is rarely a "caller" involved.

You need to re-read the section on Problem Management as you are appearing to miss a key point and incident mgmt

Not every unknown root cause can be found nor should it

If you get 10k incidents a month, most of them will have a known cause - keyboard/brain interface failure or some external influence or S.h !!

Incident Management should not care about the reason for the incident happening, only that the service is restored by the best means possible - regardless of whether this destroys the evidence of the incident in order to return the service to an agreed level.

Resolved and Closed should be tightly linked - Resolve it when the service has been restored, close if the customer/user agrees or 3 days (admin)

If the incident recurs w/in the 3 days, a new incident is still raised; and the same rules apply.

so now to PM
If 10% of the monthly incidents have no known cause, that is 1000 per month - that have to be investigated, analysed, assessed, solution proposed, tested, apporved and deployed. Most companies - including schools - do not have a 10 - 20 people in PM

What has to happen is that the PM team lead takes the 1000 candidates for PM review and cherry pick these to find ones that can be solvable in accordance with the work load and staff on hand.

The PM team should reflect the architecture supported. The PM team investigates the causes, they may farm out the work to 3th and 4th tier engineers to find the solution - if possible

The PM Team should be more proactive than re-active so the % of unknowns are reduced_________________John Hardesty
ITSM Manager's Certificate (Red Badge)

My approach to Incident management is clearly not the same as yours, but that's really okay. You obviously have a great deal of experience and knowledge of ITIL, which is wonderful. I have a lot as well, and I believe in paying attention to what other ITSM people have to say, and find out what I can learn from how others do things. So - thank you for sharing your views.

As I have said - no two organizations implement ITIL in the same way. The point is to leverage these best-practices is a way that delivers the maximum value to the business (university in my case). That is our responsibility as IT professionals. No one is "wrong" in how they do it, and strict ITIL compliance is not as simple to assess as one might think.

I have to say that if you are experiencing thousands of Incidents per month in your IT organization, then either you have a spectacularly unstable IT system, or else you have bought into the extremely non-ILTIL re-definition of Incident as "a request to the Service Desk". This redefinition has been promoted by some of the major ITSM tool people recently (Remedy, ServiceNow, others) and has almost negated the relevance of the Incident Management metrics. Why debate whether or not a certain event is an Incident, when it will be logged as an Incident in your system no matter what?

And of course when a service request ordering a new laptop is an Incident, root cause does not need to be determined. In actual ITIL Incidents however (read the definition), root cause is very important for EVERY Incident.

But - as I said earlier, there is no purely ITIL implementation, to my knowledge. You have adapted ITIL in the way that works for you, which is fine. I (and many like me) find a great deal of value in the actual ITIL approach to Incident and Problem Management. I do find it frustrating and sad that so much miseducation of younger IT professionals is being pushed by the tool vendors. The momentum of their effort may ultimately lead, by force of numbers, to a revision of the actual ITIL best practices for Incident and Problem management. (Note how Rob England - also known as IT Skeptic - understands ITIL Incidents and IM, but believes that they should revise the definition to the vendor's approach, as you do: itskeptic . org/ content/ how-itil-gets-incident-vs-problem-wrong [Edited by Admin: no direct links please]).

For myself, I will remain a purist, as I have been for several decades now.

I believe that we are mostly in agreement here, UNVIKING, though we may be talking around each other a bit... and 10k ITIL Incidents per month seems very high to me, even for the multi-national organizations with whom I have worked. (Plus, I hold that root cause needs to be recorded for Incidents every time, without fail.)

Anyway - recording root cause in each Incident ticket is good housekeeping, providing many benefits. For one thing, it avoids the need to look up the associated Problem ticket (if there even is one!) to find the root cause of past Incidents. Doing this does not mean that the ITIL Problem Management process is subverted. PM is still invoked if root cause analysis is required (most times, it is not).

I would add that, while the stated purpose of ITIL Incident Management is to restore normal service delivery ASAP, there is much more than this immediate benefit to the IM process, or else there would be no need to file Incident Reports. Those reports themselves have tremendous value, even after service has been restored. Maximizing that value is part of my responsibility. Reporting and Metrics is not part of that stated ITIL purpose of IM, but it is a major benefit of the IM process. One useful metric is the percentage of Incidents with unknown root cause, for example. There are many more.

While I would argue that my approach is the more purely ITIL-compliant approach, there is definitely room for individual interpretation here. The main point is to understand the framework, and to leverage the principles to maximize the value to the IT organization. Then - I aim to keep an open mind to how others are doing things, to learn where I can get better.

I have done IT SM for 30 years. I have been certified in ITIL since 2001 - 9/12.
I hold the ITIL V2 Red Badge and ITIL Expert.

ITIL IM does not care about the root cause. It only cares about restoring service. If the service is restored, the customer - in truth - does not care. and frankly neither should the IM resolution teams.

All they need to record is what they did to restore the service - it may be as simple as reboot or restart or cancel a stuck process.

That is the sole responsibility of the PM team- both re-active and proactive process. They can spend the time trying to figure out why it happened so it should not happen again.

The Service Desk mgmt and the general management for IT will push the PM team to concentrate on proactive and if there is enough resources - some reactive PM work by taking a set of incidents that recur to find out why and try to remove the cause.

If Incident Management spends time trying to figure out the root cause INSTEAD of restoring the service, the customer will suffer because the service is not restored.

However, sometimes during incidents, it is quite obvious what the cause of the incident is - and of course the remedy. These incidents should have the KNOWN cause recorded against the incident so statistics can be done to track.

But to spend days and weeks of man-hours trying to find a unknown cause while NOT Restoring the service is very poor IT SM.

Statistically speaking, any incident has a 95 % change of being caused by something that is quite obvious or is found during the incident troubleshooting. The remaining 5% are the ones that the PM team should be given - provided that they can find out why.

It is the law of diminishing returns with Problem mgmt.

You have a finite set of resources who can only work on so many problem inestigations at one time w/o impacting them all.

I agree that the better the information after the closure of incidents, the ability to report improves_________________John Hardesty
ITSM Manager's Certificate (Red Badge)

Again, I believe that we agree on most things here, though I think I am hearing you warn about tier-2 IT support spending time on root cause while a system is down, if we have a root cause field in our Incident reports. Not to worry. IT professionals are passionate about "their" services performing well. The need to, at some point, make an entry in that field will not lead them astray! Nor will they be scandalized at entering PM data on an IM form.

As you stated, PM is needed for perhaps 5% of Incidents. But root cause is needed for 100% of Incidents. Additionally, the person(s) following up with PM work are often the same ones who handled the Incident. Once the Incident is RESOLVED, then they take any additional time needed to determine root cause - upon which the Incident is CLOSED. Occasionally, the problem is sufficiently obscure to do formal PM on it. The Problem Report is linked to the Incident Report, which in turn is linked to any Service Desk trouble tickets reporting symptoms of the Incident (which get CLOSED when the Incident is RESOLVED). Filing root cause in the Problem Report automatically enters the info into all linked Incident Reports' root cause fields and closes them.

Very clean, and the two processes are both leveraged to maximum advantage.

Stantonl

PS. One place where I believe that we differ is that I have never allowed the Service Desk people to get involved with Incident or Problem management in any way. They do not have the training nor the experience to know whether a user's problem is due to laptop issues, user error, or a failed IT component. They log the Trouble Ticket, and if they cannot figure out the issue, escalate to tier-2. If it proves to be an actual Incident, tier-2 files the Incident report, linked to the Trouble Ticket.

I still do not agree that all incidents have to have a root cause assigned.

A closure code - yes

Which for the most part will be resolution and . or represent the known cause

So, while we agree to disagree.

Do you keep incidents in a non closed but resolved state until such time that there is a root cause assigned ?
Do you penalise staff for not completing the root cause ?
If so, how do you know the RC selected is the correct one ?
Are you penalised by your customers for still having incidents opened months after resolution but still without a RC.

I firmly agree that the resolution team complete the incident before closure

Of course, there are incidents for which root cause is never known. We have invented a status for those that we call "Closed Pending Recurrence" or CPR. There are incidents for which the decision has been made that further investigation is not a productive use of time. Things can stay in this state forever, and I find that the number of CPR statuses is useful information.

We do not use our ITIL tools to rate performance of technical staff. I want them to see these tools as a help in doing their jobs - not a potential club to be used against them. We do give reminders to staff who are not completing the documentation, or updating their IRs and/or CRs. As far as customers are concerned, they never see Incident Reports. They would only see Trouble Tickets entered by the Service Desk, and these are closed as soon as service is restored to the caller. If level-2 support determines that the customer's issue was caused by an actual ITIL incident, then an IR is created for it and worked from the tech side. The user-facing aspect is handled by the Service Desk with their Trouble Tickets. We don't need hundreds of callers demanding to know how we got to the bottom of the network slowdown that hit us for an hour last month - and the callers really don't care about things at that level.

As you have stated many times, ITIL is a set of suggestions. Each organization will use them or not, in whatever way they find serves them the best. We are getting a wealth of value from our change and incident management processes.