I know that known error is problem with identified root cause and work around. However, for me is not clear at which point we consider root cause as sufficiently identified. Here is my example

We had major incident whit application ABC, after incident got recovered problem is opened to identify root cause

1.) It was investigated at first by application specialist for application ABC they identified that application did not work because of database cluster responded were slowly - for them at this point in time root cause is identified

2.) After that SQL admins were involved and they identified that SQL cluster was slow because all resources were used by storeprocedure xzy1 - for them root cause is identified

3.) Than sql developers were involved and they detected that procedure xzy1 was wrong because in line 21 it refers to not indexed table.

At which of these three points we can say that problem became known error. My interpretation is at point 3 when root cause is sufficiently identified that fix and RFC can be produced but I am not sure that I am right.

I would say that it's sufficient to mark a Problem as a known error when you identify the first root cause. However, each root cause is, itself, a Problem. What you're doing is Root Cause Analysis and you've discovered that the original Problem has three sub-problems associated with it. The reality is that you don't yet know which of the three (or which combination of them) caused your major Incident. However, it's more important to have identified the three individual Problems and work to resolve all of them than it is to know when you have to mark the original Problem as a known error. The key is that you now can do so and move on. I recommend that you don't get too caught up in small things like this. Just mark the Problem as a KE and move on so that you can efficiently address other work that needs to be done. The reality is that if you tag the original Problem as a KE when you found the first root cause or the third makes no real difference to the operation of your greater enterprise, so I recommend you don't spend too much time worrying about it.

Where do you identify a workaround??? that's basically when you move to KE. Franck is right in stressing that you need to focus on operationnal rules that make things work, not so much on the theory. However, the theory helps building operationnal rules: to me,
at stage 1) you've identified THERE IS a problem and you open one.
at stage 2) you have indetifiedTHE PROBLEM (location, nature, characteristics)
at stage 3) you have identified the ROOT CAUSE....

I would say you can consider it a known error at any stage where you identify a workaround (that can: to reboot the system) . If there is no satisfactory work-around , then it never goes into the KE state... That's usually a sign for high priority change request (once you have identified the change that will solve the issue).

Thanks once more for helpful answers. I agree with statement that at which point you turn problam into known error status may not be crucial for the enterprise but on the other side if one of the KPIs is average time to turn problem into known error (or to identify root cause) than this definition is not irelevant. More I think about that more convinced I am that this KPI is wrong.

Hi, I agree with what both Franks and JPs have said and I would just like to add the following.

Much of what has been said so far address the "reactive" side of Problem Management. There is also the "proactive" side of Problem Management, and in your example, you appear to have uncovers several additional factors that may or may not have caused this particular "Major incident", but they can and will result in future incidents.

Problem Managements job is not to only find solutions for existing incidents, but to prevent incidents from occurring in the first place.

If you wanted to use a meaningful KPI, you can show to those requesting the KPI, "the number of identified and solved problems that have been detected before they can impact the business, and if that problem would have been recorded as an incident, the impact to the business would have been in $, time, outage, etc".

IT should toot their own horn a little, and here is an opportunity to do just that.