I'm fighting a battle, not only against the Service Management but also against the Consultant who is supposed to know ITIL inside out.

Firslty they have the view that a Known Error can be closed without actually having a permanent fix, and that only a workaround is sufficient to close the ticket.

Secondly they then feel that the Known Error Database is the list of all closed tickets. I'm currently looking at 269 closed Problem tickets and wonder how the Service Desk (in India) would ever be able to find the relevant live Known Error burried within 260 closed problems that have had permanent fixes over the past few years. And all of this with the customer hanging on the phone!!!

I don't have real problem management experience but from what I recall on my ITIL exam... (and if I'm wrong I'm sure someone here will correct me!)

The Known Error can be closed with a work around because it is not the same as the problem record itself. They are separate entities.

Incident management would search known errors with the hope of finding a workaround to get the user up and working again with minimum disruption.

If the tool you are using isn't up to the task of quickly earching past incidents, known error records, and knowledge bases for such work arounds then perhaps the tool should be reviewed or staff trained?

Meanwhile, whoever is working on the problem (that may or may not be linked to the known error) can continue their Root Cause Analysis without the pressure of a user being down.

1. Be sure to make the difference between an incident and a problem. The incident is the disruption in service. The problem is the unknown root cause of it.

2. When an incident is reported, you only have an incident and the Service Desk is going to try to get the customer up & running in as little time as possible.

3. If the root cause of that incident cannot be found, a problem record is raised, independently of the resolution of the incident. (e.g. e-mail doesn't work ==> send a fax is a workaround, incident can be closed, problem stays open)

4. The problem management team can then work on finding the cause of the problem, and document a standard workaround to apply to it. Once that is done, your problem has become a Known Error (this is the Problem Control process)

5. From this point on, anyone who will report an e-mail problem should be presented with the workaround provided by the Problem Management team.

6. The Problem management team will then evaluate how/whether to fix the problem for good (Error Control process). In the case of an e-mail server being down, they probably won't spend much time wondering whether they will find a permanent fix, but other problems may stay Known Errors because there is no business case for fixing them. If it doesn't make financial sense for the company to fix the problem, you could stay with a workaround forever, that it true.

The Known Error database contains known errors only but past incidents are relevant to the incident investigation so I believe the way this is going to be handled will likely depend on the tool you're using.

Sometime ago I combed through the ITIL blue book looking for a clear definition of a known error. The only one I could find say that a known error is a faulty CI. To be a little more subtle I would suggest....

A known error is a record indicating a CI whose current state is the identified-root-cause of one or more incidents.

So, in a way, a known error is what a problem 'becomes' after root cause analysis is done. It is for this reason that another line in the book says that raising a known error may be a simple as changing the status on a problem record.

However I would stress the "may be" - generally it isn't, and especially where multiple CI states are found to be at the heart of a problem. I would always advocate keeping known errors in a sparate table, with a one to one relationship to CI information (even if it's just details on the KE record).

In the term known error, known means live. You can think of a problem as a best-effort initial summary of a set of symptoms thought to be casued by one (or multiple related) unknown errors.

When an error becomes 'known' there are a couple of actions and a couple of choices. You can decide to fix the CI - in which case you raise an RFC, and wait for the fix. You can also find a workaround to addresss further occurrances of incidents. You can decide the workaround is sufficient, for various reasonse, and not raise an RFC. You can decide there is no workaround (or it would be too expensive), and no fix is warranted - and take the case back to the customers of the affected service and get them to sign off on an adjustement to the SLA that incorporates acceptance of the incidents. Or you may even identify the error, but be unable to fix it - eg., an error in proprietory software that has to wait until the vendor decides it's worth issuing a patch or update.

Now, because a known error records a CI whose state is causing incidents, it stays open for as long as the CI is in a state that causes those incidents. The workaround lives with the error and for as long becasue it is there to handle each new incident caused by the error. If it is never fixed, the record is never closed. You should however have a way of recording in the CMDB that the error state of the CI in question is 'accepted'.

Once the error is erradicated, it is 'closed'. This may mean put into a closed status and left in the table, or archived into a past errors table, with its workaround, or even (gasp) deleted. Whatever works best with your processes or tools.

As intimated above, however, knowledge may have been gained in the problem managment and error control processes that could be of valuer. What causes people to want to keep the error live is the workaround. They want staff to have access to it and not have to redo the PM process if the error occurs again in a similar CI.

Capturing an preserving this is 'knowledge managment' - not problem managment. It should be done, but it is a separate process. The key difference being, the error control information is about actual events, and knowledge mangement is about types of events. It is not good practice to keep your errors forever just to keep your workarounds.

I'd like to support what Fabian and RJP supplied with the following...

I believe there are two common interpretations of Known Error that confuse people.

The first is that an error is "known". In other words it's enough to be classified as a "Known Error" as long as you know about it.

The second is that a Known Error is a Problem, where the error is known, such that the Problem is now a Defect (or spawns a Defect, depending on your beliefs).

My clients, typically, tend to believe that the ITIL definition is more aligned with the latter, not the former, where a Known Error really correlates to a Defect that needs to be addressed through work that is encapsulated in one or more Requests For Change.

You may want to take the position that a "Known Incident", with it's own workaround, is different than a "Known Error", which is a Problem/Defect that needs one or more RFCs to fix that Known Error. A good knowledge base will allow you to query across either, individually or together, to allow you to answer questions for your customers.