I just had a question about best practice and the major incident process. For those who have implemented Problem Management do your Problem Managers over see the Major Incident process?

At first I believed this was still a function of the Incident Management process and that the Problem Management process would pick up after the service was restored. However this has recently become a question for debate and I'm finding it difficult to find the true best practice.

At my company, we have actually split it up. The actual solving of the disruption is under control of incident management. After this, a structural improvement should be described and communicated (not yet excecuted) within 5 working days. This bit is under control of problem management. Have a look at Kepner-Tregoe (google) with ATS (Analytic Trouble Shooting) to continue where ITIL stops.

Thank you for the opinion. My company handles it exactly as you described. I was confused due to a training course for problem mgt practitioner they stated that it was best practice to have Problem Management oversee the entire Major Incident process.

Think of incidentmanagement about the firebrigade and ambulance after a plane crash, and the organisation analyzing the crash (NTSB in the US) as problem management.

Both organisations can have conflicting interests.

The NTSB would like nothing being touched or damaged, because the want to analyse the cause of the crash.
The firebrigade has only interest in getting alle the people out of the plane. And if they have to cut open the body of the plane for it, they will! (and possibly damaging evidence of the cause of the crash).

I heard this example in a training I got. This sample explained a lot for me. Especcially the conflicting interest!.

Hi,
Major Incident is handled by Major Incident Team. The team needs to be comprised of all the Senior Line Managerse.g. CIO,Operations Manager,Development Manager etc etc. It needs to be headed by Chairman or at least CIO of the organization. This is a requirement of ISO 20000 certification as well.
So wht does this mean? This means "Major Incident" concept is needed only by very big organization where there can be some failure which can affect a whole country's operation or something like that. It is something that happens "very very rarely".
On functional side,Problem Manager,Change Managers will be always part of the Major Incidnet team.
Why is the concept of "Major Incident" there? It is the linkage between Incident Management and Service Continuity Management.
Incident->Major Incident->Service Continuity
Hope,this clarifies the doubts
Shubhendu

The way, I have seen this implemented in some organizations is through problem management. The major incidents are handled through the problem management team, though there could be common staff between Incident and problem management. Typically, the registration of Problems, is done by the second level support team of Incident management, but the overall control for handling the major incidents is through problem management (as in a process owner for problem management).

I am not sure, If I quite agree with Shubhendu on the ISO 20000 certification. Would you be able to tell, which clause in the standard requires a major incident team to be formed.

... After this, a structural improvement should be described and communicated (not yet excecuted) within 5 working days. This bit is under control of problem management....
hope this helps,

Michiel

Hi Michiel - Do you have any metrics on how long after the Problem resolution (structural improvement) is communicated that it actually gets implemented? Our management team does not believe me when I tell them that we are slow, with our current average of 50% resolution implemented within 3 months of a major incident. They want to know 'industry standards'...
*seeking enlightenment in Canada*
/Sharon E

I do not have the exact metrics available from my own company, and do not know if there is an industry standard at all, but we report on the % of proposed improvements realised per branch within two months (3 branches in total untill recently), and those rank from 40-60%. From my personal experience i'd say you are doing a good job (besides having worked at our own data centers, I've been stationed with 10 different customers with aprox. 20-25 data centers. Many of these customers do not have these mechanisms in place at all).

To understand major incident better, I would suggest to have a brief idea about SOX or Sarbanes Oxley requirements.

Major incidents usually are those issues that are related to business critical IT configurations that affect business or business financial applications.

The major incidents while having to be resolved by the Incident Management team itself, should have a Lead role at the Service Desk who manages all the major incidents (as many as 1 or none or as many possible, for a day). The major incident handling process should be framed as a skeleton first for the entire Service Desk, describing WHO will conclude that it is infact a major incident, WHAT are the criteria to decide the same, WHAT actions need to be taken, the ESCALATION procedure, WHO needs to be contacted if the primary contacts are not reachable etc. Business Continuity & Disaster Recovery is definitely part and parcel of handling major incidents.