I am newbie for ITIL. need some valuable advice from you guys. thanks in advanced.

I am working on problem management process. ITIL only tell what to do in problem management, but never say how to do it. for example:

1.how to identify the standard of severity and priority? I know it is business related. but I need a model or method to define them.

2. do we always perform root cause analyst for every problem? what does "root" means? for example, a network connection down because of network card is broken. This is "root" or we need to dig deeper to find out why the card is broken....how many "why" we should ask...

3. how to define the resolved target day of the problem? 2weeks?3weeks? or half year? since in reality, it is difficult to tell how long should you take to find out "reason" for a problem.

4. what is the criteria for problem close?if we spend monthes and can't find the reason, or we know the reason, but there is no way to fix it, or we do have the solution, but cost of implementation it too high, can we close the problem?

1. Problems often have a much longer time-scale than incidents, as there it usually isn't know how long it will take to fix; there's usually quite a lot of investigation and sometimes trial fixes (preferably not on a live system!)

You need to identify levels of service you provide that can be grouped into severity. There must be some services/functions, if unavailable, would have a critical effect on the business - a bank unable to process BACS transfers, for example. Work out your "essential", "must have" and "nice to have" services and work from there.

2. Root cause should be as accurate as possible. You should also try and get as close to the exact reason for the problem, because you could end up with either:
- Root is a defective network card
- Root is the fact the server is at ground level and sucks in dust through the vents, coating the card making it overheat.
One appears to be a root cause and the other is the real root cause, in this example.

3. How critical is the service? Can the business function while you test solutions, or have you got to just slap in a replacement and get the failed item on the bench for testing?

4. If you can't fix a server that reboots itself, then take the problem to the supplier. If the kit is out of warranty then you heed to consider replacement of an identical part or new equivalent. If management bitch at the cost, then you need a serious buttock-covering letter stating that senior management are happy to run without or run 'at risk'. Make sure management know (in writing!) what services/functions will be unavailable, and make sure your service desk know what's going on - if it impacts the users, they'll be the first to cop the flak!

If you set up an accurate problem management process is to improve the general customer satisfaction from an outer point of view, but with an inner view you mainly want to improve the ratio (resources used)/ (result obtained)

Remaining on the example you proposed (network connection down) for me the root cause is the broken network card. Going further in problem investigation is a waste of resource.

If -and only if- you start to get several broken network cards, and so you start to look at the broken card as an incident, it make sense to go further on with the investigation and so look at the root cause of the broken network card.

Hi pac ,
This is one of these questions where there is no true/false answer; the better choice is more a trade-off depending on the context.

Problem management is of course a more “intelligent” practise than incident management but still far from academic research, where the aim is to give a detailed answer to the starting question, and so getting to the root cause most close to the root.

If you are talking about ITIL problem management you talk about a business reality, where the cursor between the search for the truth (or more concretely the most satisfactory answer) and the evidence that an answer could be considered as acceptable is always driven by an economic logic.

So back to this example if the interruption of the network connection happened just on one machine and the network card costs 4 dollars, replacing the network card for me is the solution to the problem. I won’t put an engineer spending two days on the issue to discover why the card broke down.

If the network card costs 3000 dollars maybe I’d hesitate in considering replacing it a solution to the problem.

On the other side if the card gets broken again in the next future as I told you in my previous post this become en incident and of course replacing the network card a workaround asking for a deeper investigation of the root cause.

Just to add my views to the debate above. I agree you need to consider the cost of the fix to the problem / incident, but I think the critical factor which has been ignored is the cost to the business of the failure. A network card may cost $4 to replace, but the failure of a network card could cost the business $000's, or expose you to unnacceptable risk.

If we know it's the network card at fault, then its a (Known) error, not a "Problem".
Error control requires that we assess the error, considering factors such as cost and risk and may choose to either apply the resolution or leave it as-is.

Is replacing the Network card a workaround our a permanent resolution? I guess it depends on your root cause analysis

I agree with andy. The decision to do deeper investigation or not should be based on the impact of the business. But in reality, especially in network problem, since the impact sometimes is difficult to tell, people tend to raise it to higher priority. How to convince customer that the problem is fixed can be headache...

I have also heard the 5 questions rule of thumb for determining Root Cause:

What happened? The network went down
Why? A server's network card failed
Why? The card overheated
Why? There was dust coating the card
Why? The floors in the server room are never vacuumed
Why? There is a policy against letting cleaning crews into the data center due to the sensitive nature of the equipment

So what is the root cause of the failure? A policy that is preventing the floors from being cleaned.

Again, it's just a simple rule of thumb that can be taken less far or even further.

I've used the "5 Why's" tool successfully for a couple of years now, and find that by first defining a set of possible reasons for a failure, grouping those failures together into similar kinds then performing the 5 why's on each group usually gives you from one to a few different root causes.
I guess in a perfect world we would all like to see one root cause, which has been identified as the real root cause of the problem. In the real world, however, this may not be the case. It could very well be a combination of several things that highlights the problem. By eliminating each one in turn, however, then reviewing the Root Cause Analysis, you usually find that some of the other possible root causes then disappear.
It's not a perfect science, but it does work in the end !!