Hey folks,
I'd like to define a scenario/use case and get some opinions on how ITIL would specify it be handled.

8:00am a critical application process on a windows server dies
8:01am an agent generates an snmp trap and sends to an event browser
8:05am tech personnel sees trap message and creates a corresponding incident record
8:07am tech personnel logs into system and restarts application process
8:10 am tech confirms that application is back up - but no root cause has been established.

What is the next step from an Incident Management perspective?

Based on the following assumptions, what is the next step from an ITIL Problem Management perspective? What might the full lifecycle of the Problem look like?

==========
Assume that if so inclined, someone could review system metrics and see that memory consumption had steadily been increasing on the server. No other information is immediately available.
However, someone, if diligent could check the process occasionally and see that it's memory consumption has been growing steadily since it was restarted. From there someone could also check the vendor's knowledge base and see that a patch has been published for a known memory leak.
==========

In this case from the point of incident management, the incident can be resolved because the system is working and not causing a negative impact to the users's services.

However, if operating problem management you would raise a new problem record against so that the root cause of the incident can be identified. The actions you've mentioned such as checking memory and known issues with vendor would be part of the investigation that an analyst working on problem records would go through. If the vendor does have a patch for the memory leak, at this point (correct me if I am wrong someone!) you can create a known error record because you know what the root cause is, and you have a workaround for it - even though it is not in place yet.

You then raise a change record to apply the patch the server concerned. This risk of this would be considered by the change manager and cab according to your change management process, then, assuming they approve and add it to the schedule of changes, someone applies the patch. At this point the change, problem, and known error records can be closed. (It may be that you have a set period of time after the patch goes in to monitor the change before actually closing it from a qa point of view.)

Should Known Error be tracked as an attribute on the problem record or should it be tracked as a seperate record on its own?

Hello Nikhil,

Based on my experience, most organizations like to simply have it be represented as an attribute on the Problem record. They find it easier to simply "tag" an existing record than to enter a new one. However, it's not wrong to break it out as a separate record if you desire.

The feedback we get is that creating another record for "Known Error" is tedious and highly redundant, as most of the information is in the Problem record, already, including the fact that the Problem will have mitigation details, scheduled releases to correct it, transparency, as it ties to specific Changes, etc. To clone or re-enter all of this data into another separate record is not something that most people "want" to do. We've found that it's simply easier to select a field that tags a Problem as a KE and enhance the existing information in the Problem record to reflect actions and details around the KE.

As a result of all of this, in the system we offer, we simply allow users to tag a Problem as a KE. They have the ability to easily filter out Known Error Problem records from Unknown Error Problem records and that seems to keep partners satisfied.

Again, neither way is wrong. Capturing and managing data is all good. As long as you can successfully pitch either one, your enterprise will win.

You then raise a change record to apply the patch the server concerned.

Hello itilimp,

We find that most organizations break this down even further. At the point in your process description where you recommend entering a Change record to apply the patch, they actually enter a Service Request to apply the patch. The Service Request results in a Task or series of Tasks for Service Groups to execute. The Service Group doing the work will then make a decision as to whether the work necessary is big enough to warrant a Release (that groups a number of Changes together) or a Change. Upon making this decision such Service Groups will create the new Release and/or Change records and then act upon them, appropriately, as they move through the process.

Some questions have come up as a result of reading this thread. First, I need to restate the scenario.
I've modified it to show how the various ITSM processes AND the ICTIM processes would fit. These are the acronyms in use: ICTO is ICTIM Operations; IM is Incident Management; PM is Problem Management; CM is Change Management; AU is Automation; SD is Service Desk.

1. 8:00am a critical application process on a windows server dies - AU
2. 8:01am an agent generates an snmp trap and sends to an event browser - AU
3. 8:05am tech personnel sees trap message and creates a corresponding incident record - ICTO (can be AU)
4. 8:07am tech personnel logs into system and restarts application process - ICTO
5. 8:10 am tech confirms that application is back up - but no root cause has been established. - ICTO
6. day end SD Manager runs reports of incident volumes and distributes - SD
7. next day PM manager reviews reports and sees multiple incidents over a period of time against this server for this issue - PM
8. next day PM tech opens a problem against this issue and begins Root Cause Analysis - PM
9. later PM tech finds root cause (memory leak), finds solution (patch) and flags the problem as a known error (including the recycle as the workaround) - PM (I am agreeing with Frank on this approach to KE's.)
10. later PM tech opens an RFC to get the patch installed - PM
11. later change manager does impact analysis and gets CAB approval - CM
12. later change implemented via tasks to ICTO personnel - ICTO (possible that the SD can also implement, depending on your company's stance. our SD isn't to that maturity level, yet.)

So, first question for all is, how do you integrate ICTIM into your ITIL implementations (which are usually ITIL ITSM implementations)?
As for PM work, please respond with how you do RCA work? Do you have a specific methodology? How robust is it? For example, if the solution to the problem were a bit more complex than a vendor's patch, maybe the RCA would reveal that it wasn't cost-justified to solve the known error? IN that case, step 10 would not be completed since it's been decided to "live with it". If we follow this course, step 10 would be replaced with something like this: ICTO modifies the monitoring and alerting to accumulate, say, 10 of the SNMP alerts for this issue before an incident is created. (Of course, each time an SNMP alert pops for this issue, an ICTO tech would perform the workaround of a recycle, just without a logged ticket except for the 10th time.) In other words, to prevent incidents being created for something we've decided to live with, we might choose to not track every time we recycle due to this issue. After all, the SNMP logs would have all of the historical information, so why clog our ticket database with meaningless FYI tickets.
Another question, and this one for Frank Guerino, when working with known problems (or unknown error problems as you called them), the same sort of thing could happen - i.e. doing RCA reveals that determining the root case would be a waste of time compared to simply executing a work around every time the issue comes up. In that case, you'd have a known problem that has been decided to be kept as a known problem - never moving to the known error and, ultimately, the elimination steps of the process. How would you deal with these?

Another question, and this one for Frank Guerino, when working with known problems (or unknown error problems as you called them), the same sort of thing could happen - i.e. doing RCA reveals that determining the root case would be a waste of time compared to simply executing a work around every time the issue comes up. In that case, you'd have a known problem that has been decided to be kept as a known problem - never moving to the known error and, ultimately, the elimination steps of the process. How would you deal with these?

Hello Lexxone,

It depends on the organization we deal with. Many will simply keep the Problem open, with a resolution statement that states that there is no intention to resolve the Problem in the near term. Others will do this "and" tag it as a Known Error, to ensure they can quickly bring the Problem up, on search, keeping the Root Cause blank or putting in a comment stating it hasn't been found yet and why.

However, the workaround you mention would typically be found in the Incident details, not the Problem, as it's an Incident that will be "the issue that comes up", at the Help Desk, not the Problem. The HD staff will look in their Knowledge Management System to see if an Incident has ever occured before, what the resolution is, whether or not there are repeat occurances of the Incident, whether or not a formal Problem has been registered against the Incident, what the state of the Problem is, what work has been scheduled (or not) to address the Problem, what Product Release the Changes are scheduled for, who's worked on the associated Incidents, Problems, Products, Releases, Changes, what documentation is associated with each of the Incidents, Problems and so on, etc., etc., etc. Success for Help Desk staff and alternate support resources is quick access to and transparency into any and all details that impact their customers' satisfaction.

7. next day PM manager reviews reports and sees multiple incidents over a period of time against this server for this issue - PM

First of all, this use case didn't specify anything about multiple incidents over a period of time. Let's say this is the first time they've seen this particular problem. So there isn't a history of problems with this CI. What we have is an incident w/o a root cause.

QUESTION: All too frequently I see incident's closed (appropriately) w/o any subsequent RCA to try to reduce the chance of the problem happening again. ACCORDING TO ITIL, does every incident w/o a root cause trigger the creation of a problem? If not, why?

I had to look in the Book: since the goal of PM is to minimize the adverse impact of Incidents & Problems on the business that are caused by errors in the IT infrastructure, the existence or lack thereof of a Root Cause may not necessarily be THE deciding factor on whether or not to open a problem.
Isn't it Up to the business itself to decide? or am I splitting hairs?

QUESTION: All too frequently I see incident's closed (appropriately) w/o any subsequent RCA to try to reduce the chance of the problem happening again. ACCORDING TO ITIL, does every incident w/o a root cause trigger the creation of a problem? If not, why?

The problem manager should be reviewing incident statistics to identify problems, prioritise them and log them. Then decide wether or not they currently justify further investigation and any RFCs as a result.

A one off incident with low business impact would not be an efficient use of problem management resources. If incident management were to create problem records for every incident then incident management would be slowed down and the problem manager would be overwhelmed.

Of course there is nothing preventing incident management from highlighting problems. However in most cases incident management are more likely to highlight problems from those that shout the loudest or are tedious rather than those problems once solved would give the biggest increase in the efficency of incident management.

Remember that incident management includes escalation through to higher levels of technical support. It's assumed that through these channels all incidents can be resolved using the incident management process. During the incident management process problem solving skills will no doubt be used to identify and workaround or fix the incident however this is still incident management.

Let's take a simple example...

A user logs an incident. They have no network connection.

Through incident management it's determined that the fault is with the users network wall socket. A nearby wall socket is used to get the user running again and action is taken for the socket to be replaced.

Whilst we have a "problem", then a "known cause" followed by a "circumvention" followed by a "permanent resolution", it's not a problem, that's standard incident management.

However, our eagle eyed problem manager has spotted that 8 of these sockets have failed in the past 4 months. This is a high failure rate, a problem is logged and investigation begins.

We already have a circumvention as we can replace the network sockets. However we still have no idea why we have the failures.

It's determined that the underlying cause is some poor quality installation work undertaken by the contractor who installed the network points. We now have the underlying cause in addition to our workaround.

So we now have a known error.

For permanent resolution we haul back in the contractors to check and replace any additional faulty network points - This will go through change management if appropriate.

We might now put some processes in place or review our relationship with that contractor to prevent a reoccurence.