Comments (0)

Transcript of Problem management in practice

ABC Spedition PLC transports and stores food in Scandinavia. The transport is carried out by trucks.Introduction1 September27 AugustMajor incidentA failure similar to the July 29 incident caused approx. 4 hours of unavailability for most of the IT services. A reboot was tried, but had no effect.

A team was formed to investigate and diagnose the issue. After an hour the team reported that they had found the root cause, a hardware failure in a module in a central network component (distribution switch). The module was replaced, and all IT services were again available.ProblemConclusionMobile: +45 40 15 97 82E-mail: tf@bluehat.dkWeb: http://www.bluehat.dk/The CIO was not convinced that the IT specialists had prevented a similar incident in the future and initiated therefore a problem investigation.by Thomas Fejfer BlueHat P/SThomas Fejfercreate a simple cause & effect diagram to explain the cause and effect relationshipsPROBLEMMANAGEMENTIN PRACTICENote: This case is simplified, to make it easier to communicate, and times, business type, etc. has been changed to anonymous the company.IT systems provide information about:What must be loaded on a given truckWhere the truck must deliver its loadDo not confuse the problem with a causethe switch module was one of several causes - not the problemProblem solving is team workand at least one in the team should not have a deep knowledge of the nature of the problemITIL® Incident and Problem Management describes well how to manage incidents and problems, but not how to solve themMajor incidentA possible failure in a central network component caused approx. 4 hours of unavailability for most of the IT services.

After a reboot of the component, all of the IT services were available again.29 Julydownload at: http://www.bluehat.dk/downloads/no registration neededHow Kepner-Tregoe can improve your ITIL processesWhitepaper:ITIL® is a Registered Trade Mark of the Office of Government Commerce in the United Kingdom and other countries.ITIL® Problem Management describes well how to manage problems but not how to solve them.

Therefore, it is a huge challenge for many IT organizations to get Problem Management to work in everyday life.

In many cases, the Problem Management process acts as an Incident Management process and the IT organization achieve at best only a limited value of the process in relation to the potential that is possible.Slow down – and think before you actassess the risk before you start implementing solutionsReturn to "normal" operation as quickly as possiblebe aware of different perceptions of normal operationIncident Management and Problem Managementbe aware of the fundamental differencesITIL® is a Registered Trade Mark of the Office of Government Commerce in the United Kingdom and other countries.06:00 The first user reports slow response times07:05 Network specialist began investigation07:15 Sporadic loss of packets over ”Distribution Switch A”07:00 Several users have reported slow response times07:30 ”Distribution Switch A” rebooted. Automatic failover to ”Distribution Switch B”. All services were available10:15 HW supplier received the log from the switch10:00 ”Distribution Switch A” turned off. Automatic failover to ”Distribution Switch B”. All services were available08:00 Automatic failover/fallback with less than 1 sec. interval07:45 ”Distribution Switch A” was up running after the reboot. Automatic fallback to ”Distribution Switch A”. IT services were unavailableTimelineThe problem solving team was staffed with:1 problem coordinator with a limited knowledge of network and servers, but with knowledge and experience of problem solving methodologies3 subject matter experts:Cause-effect diagram12:00 ”Distribution Switch A” turned on. Automatic fallback to ”Distribution Switch A”. All services were availableEconomyProblem descriptionSketchFailover is switching (Automatic or manual) to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application, server, system, or network.

Fallback is the process (Automatic or manual) of restoring a system, component, or service in a state of failover back to its original state (before failure).

WikiProblem solving team10:45 Supplier reported back a hardware failure in a module in the ”Distribution Switch A”11:45 Module replaced and testedDecisionIdentify possible solutionsActions aimed at getting back to "normal operation" as quickly as possible had caused additional unavailabilityA possible solution was to address this by trainingA timeline (chronological analysis) is a valuable tool for problem solving, but it can not replace a cause-effect diagram, as it do not explains all cause and effect relationships.PRINCIPLE:Always perform an analysis of causality to explain why the problem occurred.PRINCIPLE:Problem solving is teamwork, and at least one in the team must not have a deep knowledge of the nature of the problem.PRINCIPLE:Problems are not solved in general. Always focus on how to prevent one specific incident.PRINCIPLE:Problems can be complicated but solutions may not, because complex solutions are new problems.PRINCIPLE:Problem are solved when specific solutions are implemented.PRINCIPLE:Choose best solution(extract)A hardware failure in the primary distribution switch caused extensive unavailability of IT services.

August 27, 2011 from 06:00 am – 12:00 amWhat:Undesired outcome:Where:When:Always start with defining the business impact to get a unique starting point for problem solving.Simple(extract)PRINCIPLE:- 2 network specialists- 1 platform (server and OS) specialistPossible solutions are identified by going systematically through each cause and ask:“What can we do to remove the cause?” or“What can we do to prevent the cause from having a negative effect?”PRINCIPLE:A timeline documents events in chronological order and is very useful to show which events may have been triggered by others – or to discount any claims that are not supported by the sequence of events.a) Prevent recurrence or minimize the adverse impact including similar occurrences at e.g. different locationsb) Be within your controlc) Be simpled) Provide reasonable value for its coste) Not cause other unacceptable problems

If we do not know when a failover has occurred, then we have no redundancy. We need to setup a notification that is sent to the service desk in case of failover (estimated cost € 2.000). Total cost € 3.000.What are our objectives?What alternatives do we have?Which alternative best fits our needs?What could go wrong with that choice?(Fast version of Kepner-Tregoe Decision Analysis)Note: This is classic ITIL incident management. An intermediate action (a workaround) is used to restore service. Note: Many will see this as problem management, but is it not. In this case a corrective action is used to restore service and therefore it is pure incident management.Note: This is where ITIL problem management starts!