I had a quick query. Say there is a P1 or Sev 1 Incident reported and the resolution teams responsible to resolve the Incident within the stipulated time of 4 hours as agreed has not happened or rather the SLA has been breached and the resolution teams are still not in a situation to understand the root cause, in such scenario what is to be done and what are the roles involved?

I had a quick query. Say there is a P1 or Sev 1 Incident reported and the resolution teams responsible to resolve the Incident within the stipulated time of 4 hours as agreed has not happened or rather the SLA has been breached and the resolution teams are still not in a situation to understand the root cause, in such scenario what is to be done and what are the roles involved?

Thanks and regards,

GB

Functional (to call on gretaer expertise) and hierarchical (your bigwigs talking to the customers bigwigs) escalation should be invoked according to your own escalation model....if you have one.

In resolving an incident the SLA is largely irrelevant. It is more appropriate for retrospective analysis of performance.

So your question is really: there has been a high impact/risk incident and it is proving difficult to resolve (perhaps every moment the cost/risk is increasing); what do you do?

Well, like Boris said you escalate. You identify the resources (management, technical, physical or whatever) that you need to make progress and you invoke the authority to get them. But you do this the moment you realize it is required; you don't wait for some SLA clock to reach a critical point.

Your target for incident resolution is to minimise cost and risk to your customer, not to meet some predefined time constraint._________________"Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718

So in this case, is SLA is largely irrelevant? Shouldn't the Incident Manager be on top of the Incident Timelines. Say, when the SLA for the Incident has reached 75% of the Resolution Time, shouldn't the Incident Manager act appropriately then?

Inspite of doing all that, even if the SLA gets breached (or rather the work around fails), should the IM not looking for a contingency plan for recovering the services?

as far as I'm concerned SLA is always irrelevant when working to resolve an incident.

The focus is on service to the customer and therefore on resolution of the incident.

Escalation criteria should not be mapped on SLA times but on factors that indicate the need for escalation, such as high risks and impacts (this will probably map to priority level), lack of knowledge, expertise or resources. By all means have rules about how long a service has been unavailable, but not to the extent of precluding earlier escalation when warranted.

As for your other point, incident resolution is about restoring service by any practical means. So you should always be looking for the quickest way to achieve this. It would be contrary to good service to spend time trying for a "perfect fix" until you are nearing some SLA value and then quickly putting in a workaround that had been available from the off.

If your SLA has a commitment to be measured individually against every incident rather than an overall performance, then it is not a good SLA. It makes you a hostage to fortune since the time required for the resolution of incidents is more closely related to their nature than to some arbitrary (or averaged) figure. It also is bad for the customer since it focusses staff on meeting targets for each incident based on their duration rather than on their risk and impact. You should always be fixing the most costly first, even if that means that ten others breach their SLA target. This is why such individual targets should not exist._________________"Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718

I agree and disagree with Diarmid. As a long time Service Level Manager I'm not prepared to put aside SLA targets easily but the priority is service to the customer. At the end of the incident if everything has been done to restore service as quickly as possible and the timescale has been missed then that's fine so long as it's an exception.