We are in the process of starting ITIL implementation; I have the Incidents, Requests and Change Management done. The Problem Management has me confused.

I understand the concept but I donít understand where the known errors are recorded and work around are captured. What types of categories need to be created for Problem mgmt? Do problem categories get created on the fly as problems occur?

Problem management is a tough cookie as most organizations do not have the resources or toolsets to implement it properly. Problem Management, in my opinion, is typically implemented as reactive triage rather than proactive detection, diagnosis, and remediation.

Try this whitepaper as a starter...
bitpipe.com/detail/RES/1110995399_466.html

Work around is ideally provided by Incident Management even though Problem Management also uses them depending on the urgency of the fix required. A Workaround can lead to being a Know error once a way to circumvent the error is identified. A good way to start problem management is as mentioned using Reactive approach. Identify, Record, Classify, investigate, Diagnose(say error identified), Identify & Record Error, Assess Error, Record Error Resolution, Close Error and Associated Problems.

Now if there is a record of error and problems then each time you receive an incident you check if known error for the incident exists if so use error resolution, else if problem is open for the incident then add to incident to the problem list, else if routine incident then treat as applicable.

If none of the prior mentioned exists then create new problem record and proceed with steps mentioned before.

Known Errors are Problem records which have a work around identified and/or RFC initiated.

Work arounds are used in/derived from Incident Management. Think in terms of 'how can I restore service in this incident? How can I work around the issue?'. In most cases, the work around is derived out of the investigation that takes place during the lifecycle of the incident, which makes sense since Incident Mgt's goal is to restore service in the most quick, efficient way possible. The Service Desk, when resolving the incident can choose to raise the issue to the Problem Mgt team to have a problem record created to help determine a more permanent solution and to vet the work around that's been used. The problem mgt team would create the problem record and invesitage a permanent solution. The next time the incident occurs (with most likely the same symptom) the SD can search the K.E. database for an approved workaround to expedite the resolution of the incident. At the same time, the SD would also relate the incident to the problem, hightenning the urgency and impact of the problem, and thus raising the priority of the related problem investigation. Before long a permanent solution is identified, and RFC is applied and the possibility of incident re-occurence is theoretically eliminated.

To answer your other question, it would be prudent to align your problem categories with the likely root causes that would represent your environment. The official CCTA ITIL documentation (Service Support)gives a good recommendation on how the categorization structure should be organized. The book isn't cheap, but it's an invaluable resource.

Gord,
can Problem management ever create a workaround?
Is merely restoring service a workaround?
Some of the doc I've seen defines workaround as a method of avoiding the problem (even though the root cause and perm. fix has not beend found).

If for example a server hangs, is rebooting the machine (to restore service) considered a workaround?

Problem Management differes from Incident Management in that it looks for the root cause of one or more incidents. Once it does it creates a Known Error record, and ideally raises an RFC to fix the Error. Incident Management may find a root cause, but it doesn't look for them, (but if it does find one that's a bonus).

A fix may not be found by Problem Management for every Error - in fact one source of Known Errors are the 'bug reports' that vendors put out. These, and any work arounds, should go into your Known Errors Database - but finding a fix may be beyond the reach of your team (obviously they won't recompile an application the source code to which they don't have access).

So the example you gave is quite a good one - an app might run some services that are a bit 'twichy'. The symptoms are known and the work around is to bounce the service. Putting up the Known Error and Work Around records is something Problem Management would be expected to do.

Remember that the real difference between Incident Management and Problem Management is not the tasks they undertake - there are a lot of overlaps, and access common information. It is their objectives (and sequence) that sets them appart...

The 'health-check' for activities within each discipline is: Is undertaking that activity going to undermine the capability of the process to acheive its objectives. Finding a work around for a known error where:

* it is not possible to raise an RFC for a fix,
*or where for cost (or other reasons) a concsious decision is made not to spend resources on a fix,

If we go with the strict definition of a workaround, then the other example we're debating is if a workaround results in fewer occurrences of the problem but does not completely prevent it... what is that?

Example, a memory leak causes a server to hang after running for about a week. The root cause is not known yet but we know that if we reboot
the server on Wednesdays we lessen the chance. However on occasion,
depending on system load the problem occurs.

Perhaps 'workaround' is a concept that shifts its meaning slightly when going from the Incident Management to Problem Management contexts.

Frim the IM point of view I tend to think in terms of recovery - getting things going again. Often that action isn't going to prevent the incident recurring.

And this is after all why PM is there. PM has a brief to eliminate problems, and following on from that, where not able to eliminate the problem itself it would try to eliminate the 'impact' of the problem by looking for a stable work around.

But now we are in a territory where work arounds look like 'fixes' and strcitly speaking that should go through change management via an RFC.

And how many work arounds do people find that are both stable and not changes?

So for my part I think I would call any activity that ameliroates a known error, that isn't a change, a 'work around'. I'd just say there are better or worse 'work arounds' depending on how stable the service delivery situation.

A 'partial' work around seems to me like a perfectly reasonable term to employ in the case of less-than-ideal responses to a known error. And I agree, the goal should be to find work arounds that prevent incidents recurring.

Perhaps the Known Errors Database needs something that indicates Known Error, no work around, recommended incident resolution procedure is....

Which would use the best know action for resolving incidents with this Known Error indicated as the cause - and going back to your original example that would be restarting the service.

That would be a good thing - simplicity of the example aside - because you want to ensure that once a resolution to an incident is found, staff don't have to go through the investigation and diagnosis every time it occurs - until a real work around is found.