I would like to have your opinion on the following subject.
By definition, a known error is a problem for which a root cause is known and a workaround or alternative has been found.

In practice, I notice that on repetitive incidents, a problem is triggered and analysts searches for the best alternative to close the incidents, to restore the service disruption. Then they start searching for the root cause. In that case, the definition would be ok.

It can also happen that no workaround is available yet, but the root cause is found, and during error control (when searching for a suitable fix or replacement) we have a valid alternative.
This means that only one condition is met: the root cause is known.

I'm describing the best practices in our company, and I was wondering if the validation and acceptance of a workaround is not a seperate process in problem management, over problem and error control. Also, the trigger for finding a workaround and accepting the workaround is more driven from incident management.

Your opinions plz, and how you described this in your process definition + does ITIL need a correction in its known error statement?

assume, you don't have a workaround, you start identifying the problem, start RCA and at the end found the root cause
you fix the failing component, and start planning the change
so, you passed the phase of "KE" and are working in error control
at that point you find a workaround for closing your incidents
only then the two conditions are met

the definition in the phase KE assumes you have a root cause and a workaround; while you can have found your failing component without having an alternative available

A condition that you need to bear in mind is that the relationship between Incident and Problem can be somewhat linear, ie) incident resolved... then problem mgt takes over.

Ed, you can have a condition where you know root cause but no workaround in place. I would call that an Incident! The workflow and corresponding resolution strategy would still fall under the jurisdiction of Incident Mgt. According to the ITIL books...

Investigation and diagnosis may become an iterative process, starting with a different specialist support group and following elimination of a previous possible cause. It may involve multisite support groups and support staff from different vendors. It may continue overnight with a new shift of support staff taking over the next day. All this demands a rigorous, disciplined approach and a comprehensive record of actions taken with corresponding results.

Tip:

If it is not clear which support group should investigate or resolve a User-related Incident, the Service Desk, as the owner of all Incidents, should coordinate the Incident Management process. If there are differences of opinion or there are any other issues arising, then the Service Desk should escalate the Incident to the Problem Management team.

In otherwords, as long as the service interuption/degredation existins (by virtue of there not being a suitable workaround or alternative) work is performed under Incident Mgt. If all else fails, engage Problem!

A problem is the underlying cause of one or more incidents. It will become a known error when the root cause and a temporary workaround or a permanent fix is identified.

You have a root cause and you have identified a permanent fix (replacing the component). Therefore you have a known error.

Jason.

Jason,

I figured I'd add some more information to the mix.

Many development organizations would not agree with this definition. To many organizations, a "Known Error" is one for which they can reproduce the problem, typically with a repeatable testcase of some form. This is the only criteria for it to be a "Known Error". In other words, to many organizations, a KE is a Problem that has been verified to be an accurate and repeatable error.

A KE does not have to have an identified fix, as the fix may not be scheduled for a number of Releases in the future and coming up with the fix, itself, may not be possible until someone spends a great deal of time analyzing/debugging the problem and evaluating options to fix it.

A KE does not have to have a workaround, as many Problems may never have a valid or acceptable workaround. Example: It may be acceptable to have a memory leak in a product, where memory randomly gets reset by other functionality in the product, making it benign to the End Users. In this case, the development team(s) may make a decision not to worry about it for a very long period of time, as it's hurting no one.

A Problem, until it is verified to be a "Known Error" or a "Repeatable Error" is typically a perceived Problem or can even be an anticipated Problem that needs to be addressed, at some point. In these cases, a Problem may never actually progress to become a "Known Error".

Development teams will typically work with stakeholders such as Product Managers, Marketing, Sales, and Customers to prioritize which Problems will or won't be addressed to improve future Releases of Products and/or Services. They will use this Problem list, in conjunction with their list of new feature Requirements and Risks that will drive work in these future Releases. These teams will not typically close a Problem until they have some formal signoff that proves that the Problem has been fixed completely, typically from the stakeholder(s) that originally identified the Problem or were victims of the Incidents that were symtomatically caused by the Problem.

A KE does not have to have a workaround, as many Problems may never have a valid or acceptable workaround. Example: It may be acceptable to have a memory leak in a product, where memory randomly gets reset by other functionality in the product, making it benign to the End Users. In this case, the development team(s) may make a decision not to worry about it for a very long period of time, as it's hurting no one.

Frank, I agree with your point and I'd take your explanation a little further. The K.E. is likely related to the concept of Proactive Problem Mgt. This memory leak may have accompanied the application during the transition into production. If it is benign to users, and no service-related impact is present, there really isn't a need for a workaroud because by default the workaround is intended to reduce the impact of repeat instances of an existing Problem or KE. HOWEVER, The second this benign KE exerts impact on a user, enough to record an Incident, any tactic used to reduce the impact, or restore service would essentially become the Workaround in the KE record and that "proactive" problem would be recognized as a conventional reactive one with an associated workaround.

So Sherlock, to answer your question, in the 'traditional' or 'reactive' sense of problem mgt, there should always be a workaround to a given KE, and the workaround addresses the action to take on the repeat occurence of incidents. However, in the absense of Incidents caused by that KE, your KE can still exist without a workaround if it's recognized as a "Proactive Problem"

Ed, you can have a condition where you know root cause but no workaround in place. I would call that an Incident! The workflow and corresponding resolution strategy would still fall under the jurisdiction of Incident Mgt.

My point here was that if you take Sherlocks situation

"It can also happen that no workaround is available yet, but the root cause is found, and during error control (when searching for a suitable fix or replacement) we have a valid alternative.
This means that only one condition is met: the root cause is known"

I disagree with him because you have an alternative. This makes it a Known Error for me.

The term "known error' is pretty unambiguous. If you know which CI(s) is in error then you know that and record it. It would be very odd to identify the root cause of an incident (or potential incident through proactive PM) and decide that you could not record and manage it under the error control subprocess.

I don't think for a second the Problem Management authors intended this.

A known error without a corresponding work around or solution/rfc would still be a known error - it would not be a 'special' kind of problem or incident. However one of two things are certain:

If you are instantiating and recording known errors by setting a status value on your problem record the incident records should stay in an unresolved state if no workaround is available. (Of course if you just want to keep you information management consistent you could have a boilerplate 'workaround' that says - 'decided to live with it' - to cover such cases, provided the real activity behind that 'workaround' was a negotiated and documented agreement from the customer on that course of 'inaction'.

If you are keeping separate error records from your problem records (there are some good reasons to do so), you would 'resolve' your problem. A problem is an 'unknown' cause of one or more actual or potential incidents - so if it's not unknown there is no 'problem' in terms of this process (natural language is a different thing again.) But as above you would leave the incident(s) open or have a special case handler.

If you are instantiating and recording known errors by setting a status value on your problem record the incident records should stay in an unresolved state if no workaround is available.

If you are keeping separate error records from your problem records (there are some good reasons to do so), you would 'resolve' your problem. A problem is an 'unknown' cause of one or more actual or potential incidents - so if it's not unknown there is no 'problem' in terms of this process (natural language is a different thing again.) But as above you would leave the incident(s) open or have a special case handler.

RJP, I agree with your comments, and that was really the point that I was making, that the Incident should remain unresolved. The reason I stated that IM would be the jurisdiction under which the issue gets addressed is because if you read the "TIP" from The Book in my initial post, you get the distinct impression that resolution activities for that WIP incident are still handled as an incident. If the investigation goes 'pear-shaped', then Problem can be engaged, even while the incident remains unresolved.

However, I do have a question about your comment:

"A known error without a corresponding work around or solution/rfc would still be a known error " - RJP

Back to Sherlocks's original question, how could this be if the definition of a KE is "An Incident or Problem for which the root cause is known AND for which a temporary Work-around or a permanent alternative has been identified."

Back to Sherlocks's original question, how could this be if the definition of a KE is "An Incident or Problem for which the root cause is known AND for which a temporary Work-around or a permanent alternative has been identified."

It can't. But I for one feel very comfortable putting this down to a poor choice of words. There are more glaring clunkers in the books than this one. After all the aim here is intelligent application, not exegesis.

However, some exegetical light begins to dawn... looking again at the offending passage I was struck by a slightly different discordant note: The conflating of Incidents and Problems - the Incident Life Cycle doesn't produce known errors. But I believe the problem and incident management chapters didn't have the same authors anyway.

Perhaps the following is worth considering:

The sentence reflects the situation of the majority case of how incidents will be viewed from a problem managment perspective.

The most pressing problems are going to be raised on those incidents that were not resolved in incident management.

In these cases resolution of those incidents becomes dependent on Problem management, which normally will, a) find the root cause and raise an RFC, but also b) work equally hard to get a workaround so that Serivce restoration doesn't have to wait for the implementation of the RFC.

In the majority of these cases the work around will be arrived at once the error is identified, (otherwise Incident Management would have got there first.)

But if the question is really about closing incidents there is only one hard rule - you can't close an incident while the Serivce is disrupted. And the corollary - you must close it if Service is restored - whether by work around or not. So in the end it's moot. The guidelines in the Incident Management chapter trump Problem Management on this point.

Many development organizations would not agree with this definition. To many organizations, a "Known Error" is one for which they can reproduce the problem, typically with a repeatable testcase of some form. This is the only criteria for it to be a "Known Error". In other words, to many organizations, a KE is a Problem that has been verified to be an accurate and repeatable error.

A KE does not have to have an identified fix, as the fix may not be scheduled for a number of Releases in the future and coming up with the fix, itself, may not be possible until someone spends a great deal of time analyzing/debugging the problem and evaluating options to fix it.

The critical part of problem management according to ITIL is that the problem manager has responsibility for a problem throughout it's lifecycle.

In a development environment, steps to reproduce an error are an essential component of of the problem record, but I would not consider a reproducable problem a known error.

A known error indicates that the analysing/debugging has been completed (ie. a root cause) and a fix proposed (might well be a patch). This indicates the end of the investigative processes that locate the problem and identify a fix. If you do not identify a fix and/or workaround then you still have a problem.

Perhaps where you are going wrong is assuming that problem management themselves need to perform the root cause analysis and develop the fix. It may well be that the Problem Manager highlights the problem with the development manager to get resources assigned to the problem.

Although at this point the problem has been delegated to the development manager, the problem manager still needs to track it's progress and ensure that it is ultimately fixed.

If the problem has little or no business impact then of course it can be left alone, invoking the problem/change/release process for a very minor fault is not justified. Bashing one minor bug has the potential to generate side effects that will cause problems with a greater business impact.

Very low business impact problems should still be noted and kept open as problems as it may well be worth resolving in a major update to your software package when a major testing process will be undertaken.

Many development organizations would not agree with this definition. To many organizations, a "Known Error" is one for which they can reproduce the problem, typically with a repeatable testcase of some form. This is the only criteria for it to be a "Known Error". In other words, to many organizations, a KE is a Problem that has been verified to be an accurate and repeatable error.

A KE does not have to have an identified fix, as the fix may not be scheduled for a number of Releases in the future and coming up with the fix, itself, may not be possible until someone spends a great deal of time analyzing/debugging the problem and evaluating options to fix it.

The critical part of problem management according to ITIL is that the problem manager has responsibility for a problem throughout it's lifecycle.

In a development environment, steps to reproduce an error are an essential component of of the problem record, but I would not consider a reproducable problem a known error.

A known error indicates that the analysing/debugging has been completed (ie. a root cause) and a fix proposed (might well be a patch). This indicates the end of the investigative processes that locate the problem and identify a fix. If you do not identify a fix and/or workaround then you still have a problem.

Perhaps where you are going wrong is assuming that problem management themselves need to perform the root cause analysis and develop the fix. It may well be that the Problem Manager highlights the problem with the development manager to get resources assigned to the problem.

Although at this point the problem has been delegated to the development manager, the problem manager still needs to track it's progress and ensure that it is ultimately fixed.

If the problem has little or no business impact then of course it can be left alone, invoking the problem/change/release process for a very minor fault is not justified. Bashing one minor bug has the potential to generate side effects that will cause problems with a greater business impact.

Very low business impact problems should still be noted and kept open as problems as it may well be worth resolving in a major update to your software package when a major testing process will be undertaken.

Jason,

I believe we are all in agreement. The nuance that allows this, is an open known error versus a closed known error.

The open known error would be a reproducable case which results in an error.

The closed known error would include the appropriate fix to resolve the reproducable case.