Generally I would take a known error to be you have found the root cause of the problem/diagnosed what the issue is and have potential work around.

However, I have a grey area situation in that: We know what the issue is (a database is corrupted); what the workaround is (rebuild it) and how to premanently fix it (the system is running on legacy infrastructure and will have to be replaced at some point in the next few years) but we don't actually know what is originally causing the database to become corrupted.
Am I correct in my belief that as the root cause has not be identifed that I cannnot put this as a "known error".

I would say, consider it from a management point of view. The Incident > Problem > Known Error > RFC > Change chain of processes is there for a specific purpose: To ensure that services are operating as expected.

In this context it is sufficient to consider a 'root cause' as the combination of 'CI in error status' + 'Action required to restore functionality'. So in the service restoration chain the root cause is the cause [i]of the incident[\i], not the cause of the failure that caused the incident.

The technical 'cause' may need to be understood in some cases, but if the fix is clear without that, there is often no additional value in pursuing it. If you do you are (perhaps) doing the vendor's job for them.

Or in slightly different words:

The 'root cause' is adequately discovered when you can answer 'Why did this happen' and 'What must be done to remedy it'. In your example the root cause was found. There is always a deeper why: Don't go any deeper than is necessary to ensure services are effectively delivered at optimal cost.

We know what the issue is (a database is corrupted); what the workaround is (rebuild it) and how to premanently fix it (the system is running on legacy infrastructure and will have to be replaced at some point in the next few years) but we don't actually know what is originally causing the database to become corrupted.

Hi Jamiey,

I have a question: How could you know how to "permanently fix it" if you "don't actually know what is causing the database to become corrupted"?

How do you know that legacy infrastructure is causing the corruption? If your not certain, you're still dealing with a Problem rather than a K.E.

my 2cents ( or in another week, 2 Kronor, as I will be gallivanting around Sweden
I've had a few of these: always major incidents, don't really know the root cause, but upgrade(s) will either fix - or change the infrastructure, so that any problem will be 'new' :-/
I like Frank Guerinos answer, but really, it's just semantics. good semantics, but semantics none the less. I want action.
I will not accept inability to determine root cause as a license to do nothing - no one gets off my PM-hook that easily.
If the answer is 'upgrade', I want problem resolution in 3 months (that's our current slack target). If not resolution, I want mitigation so that no more severe incidents occur before the problem is resolved. Even when waiting for a patch/upgrade from an external vendor, we should ask for mitigation in the meantime.
I poked our managers with this stick again yesterday, and I think the repetition is actually starting to pay off. Either that or they're wearing good padding. But they seemed to get it, & I'm actually looking forward to their reports when I get back in a month.
Best of luck to you!
/Sharon

If you don't know the underlying root cause it's a problem not a known error.

PS rebuilding the DB isn't really a workaround either, but a recovery process.

A typical workaround would be to restore the DB from a backup (perhaps leaving in read-only mode temporarily so at least the app is available) , and then apply the transaction log against it to recover any lost data etc.

Generally I would take a known error to be you have found the root cause of the problem/diagnosed what the issue is and have potential work around.

However, I have a grey area situation in that: We know what the issue is (a database is corrupted); what the workaround is (rebuild it) and how to premanently fix it (the system is running on legacy infrastructure and will have to be replaced at some point in the next few years) but we don't actually know what is originally causing the database to become corrupted.
Am I correct in my belief that as the root cause has not be identifed that I cannnot put this as a "known error".

Thanks in advance...

One of the principles of the ITIL Framework is to take away the IT Cowboy mentality. In the old days, we would do exactly as you suggest. Loosely define an issue and then find a shotgun approach to fix it. And many times we were wrong.

The reason Problem Management is so specific about requiring that the Root Cause (and Configuration Item at fault) be identified is that it forces us in IT to accurately determine what is the failing component before we write up the Request for Change.

It may be that there is an untrained user who is doing something in the app that should never be done during the production day and causing the corruption. In which case the Root Cause is a procedure that is being followed, and the CI at Fault may be the Training Material or New User Training Syllabus.

In fact, by following your example of replacing the entire app with an upgraded version, you may bring the Problem over into the new system. Do you have to work every Problem through to resolution? No. It may be too costly to do the required investigation to identify the Root Cause and CI at Fault. Should you stop implementing a new system just because the old system still has identified Problems? No, the chances are that the majority of outstanding Problems will be addressed if a new system is implemented.

But once the new system is implemented, there needs to be a time period when the old system's Problems are left open to see if they will reoccur in the new system. Because, truthfully, you never did the work required to successfully take them into the Known Error realm.

Some organization's management think that having Problems not followed through to Known Errors is a terrible thing. They shouldn't believe this. It is natural for a mature Problem Management process to uncover many undiagnosed Problems. It is then up to the Process Manager to determine which Problems need the additional resources/time/money spent on them to do full Root Cause analysis.

A known error is an incident or problem for which the root cause is known and for which a temporary or permanent alternative has been identified.

Much of the preceding discussion has hinged on what defines knowing the root cause.

Root cause analysis can only go so far. If you know that your database is being corrupted (let's assume this means the tables are corrupted) under a certain set of circumstances, after a reasonable amount of investigation then you can legitimately conclude that it is something to do with the database application itself if you can find no other reason. So you contact the supplier and open a problem record with them, and they say (predictably) that you need to upgrade to version xyz.

In this case you still don't know exactly what is causing the problem i.e. down to the specific bug in the code, but you do know enough to say it is the the root cause, so of course it is a known error.

Much of IT is now supplied as black box, whether it is a software app, outsourced MPLS service etc. So your own RCA can often only identify which black box is is the problem, and then you are at the mercy of the supplier. Often they will say upgrade, perhaps they already have a known error for the problem.

So I think RJP was exactly right when he said:

Quote:

The 'root cause' is adequately discovered when you can answer 'Why did this happen' and 'What must be done to remedy it'. In your example the root cause was found.

BTW. we had exactly this problem a few years ago with MySQL 4 when tables in our network management app were periodically corrupted. After a lot of investigation we still could not work out why this happened, but we knew we could fix it by rebuilding the tables.

We knew that sooner or later we would have to upgrade to 5.0 and so we accelerated testing on the new platform, upgraded and the problem went away.

For dboylan to suggest (perhaps accidentally) that Jamiey's approach is somehow 'cowboy mentality' is well wide of the mark, especially as he cannot know the extent of testing, investigation that was done.

For dboylan to suggest (perhaps accidentally) that Jamiey's approach is somehow 'cowboy mentality' is well wide of the mark, especially as he cannot know the extent of testing, investigation that was done.

Dave

I am sorry if I implied that Jamiey's approach was that of an IT Cowboy. I was trying to (perhaps poorly) draw an analogy of how IT used to handle issues in the past by saying "We don't know the underlying cause of the issue. Let's start spending money to fix it." and how ITIL says we should attempt to resolve errors in the infrastructure.

ITIL says that we must determine Root Cause before we can define a Problem as a Known Error. If that is not possible, then we might implement fixes for the Problem because the Root Cause is unknowable. Having done that, we need to be aware that any attempted resolution might not succeed because we never defined the cause. Hence the need to keep the Problem open until sufficient time has passed and we can be reasonably sure the error is fixed.

Theoretically a known error database maintains a record of all the known/identified resolutions for a problem.
From a KEDB perspective, there is only one relation the Known errors has with the Problem Requests. This relation is the final resolution or workaround that is an outcome of the problem request. Only when any one of these is obtained shall the problem request be closed.

Hence until the resolution or workaround is not obtained, the incident or the problem does not qualify to be a known error in any sense. There is no implication that a known error is a necessary outcome of a problem request.

Moreover if a resolution as a result of Root Cause Analysis has been obtained and has been implemented through a Change Request, the initial problem statement never appears in the KEDB as the problem has been permanently fixed.

Moreover if a resolution as a result of Root Cause Analysis has been obtained and has been implemented through a Change Request, the initial problem statement never appears in the KEDB as the problem has been permanently fixed.

Hummmm. you might be sticking a little bit too much to the concepts.

A Knowledge base is supposed to provide increased knowledge as time goes on and experiences appear... So I would not remove anything from the KEBD even is the problem is (supposedly permanently) solved, as I would fear to lack some valuable information later on... If you have not (yet) seen incidents and problems reoccuring weeks or months after initial resolution....you are probably quite young

I would only cure KBs of older (?) records as part of the necessary clean-up required to avoid "problems" (space, performance, costs,...)._________________JP Gilles

JP - I suggested this because post a Change there could be Configuration changes that has resulted on the IT components. Sometimes it is possible that if a resolution that was available before the change was implemented is prescribed to the user it may cause further complications.

For example an executable file that may work like a patch to fix a problem and a later release fixing the problem permanently.
However if the problem reocurrs and the old executable file is sent to the customer, it may corrupt the application as during the fix implementation the old executable file was not considered to be used for reocurrance of the problem. PHEW! not sure if I am clear.

The interesting information I want to keep in the KB is not only the solution that was identified (and that may not be adapted as context has evolved) but trather all the findings from the investigation phase.

Let me take an example:

18 months ago, when moving a server to version X.Y of the OS, traffic performance problems arose (very slow response time) . A lot of investigations were made (network, cabling, CPU, server turning, ...) to find out that the firmware on the network card needed a fix. That solved the issue.

Last week the server got upgraded agin: trafic performances arose... If the way you describe incidents and problems is well planned for and strictly followed (another of my BIG subjects...) PM would straight away direct investigations toward the network card firmware .... If the information on the first problem has been deleted, they may spend hours through the same investigations.... (people may have changed, and even not, I would not rely on people memory: that what a KB is suposed to provide...).
The real solution there might be to change the card by a newer version /model in order to avoid implementing fixes every time....(*)

CONCLUSION: the solution that was supposed to be permanent in the first place (problem solved and no more incidents) just proved not to be so permanent....

anyhow, if you spend some time in support, you will build your own experience that will enrich your knowledge of the ITIL framework with concrete aspects...

(*) the example is just for illustration....I acknowledge that proper change management would have determined that the firmware needed to be adapted or the card changed for the OS version to come ....._________________JP Gilles

A known error is an incident or problem for which the root cause is known and for which a temporary or permanent alternative has been identified.

Much of the preceding discussion has hinged on what defines knowing the root cause.

Root cause analysis can only go so far. If you know that your database is being corrupted (let's assume this means the tables are corrupted) under a certain set of circumstances, after a reasonable amount of investigation then you can legitimately conclude that it is something to do with the database application itself if you can find no other reason. So you contact the supplier and open a problem record with them, and they say (predictably) that you need to upgrade to version xyz.

In this case you still don't know exactly what is causing the problem i.e. down to the specific bug in the code, but you do know enough to say it is the the root cause, so of course it is a known error.

Much of IT is now supplied as black box, whether it is a software app, outsourced MPLS service etc. So your own RCA can often only identify which black box is is the problem, and then you are at the mercy of the supplier. Often they will say upgrade, perhaps they already have a known error for the problem.

So I think RJP was exactly right when he said:

Quote:

The 'root cause' is adequately discovered when you can answer 'Why did this happen' and 'What must be done to remedy it'. In your example the root cause was found.

BTW. we had exactly this problem a few years ago with MySQL 4 when tables in our network management app were periodically corrupted. After a lot of investigation we still could not work out why this happened, but we knew we could fix it by rebuilding the tables.

We knew that sooner or later we would have to upgrade to 5.0 and so we accelerated testing on the new platform, upgraded and the problem went away.

For dboylan to suggest (perhaps accidentally) that Jamiey's approach is somehow 'cowboy mentality' is well wide of the mark, especially as he cannot know the extent of testing, investigation that was done.

Dave

We have many Known Errors in our problem dB which are waiting on Vendor upgrade/patches. A lot of the time problem management find the root cause, which is the application itself, so we have to note in the workaround that we are waiting for the new version/patch from the vendor to remove the Error. There isnt actually a workaround but it is a known error at the vendor level.
Would you agree with this approach ?
If we were to not mark this type of problems as KE's due to us not knowing what line of code the Vendor has updated (as suggested in other posts in this thread) I beleive we would have a lot of open problems.
How do other folks mark there Vendor Application related problems ?