How to prepare and respond to data centre emergencies

Data centre operations and maintenance teams should always be prepared to act swiftly and surely without warning. Unforeseen problems, failures, and dangers can lead to injury or downtime. Good preparation can quickly and safely mitigate the impact of emergencies, and help prevent them from happening again. This article describes a framework for an effective emergency preparedness and response strategy for mission critical facilities.

Even an expertly engineered data centre cannot guarantee 100% availability. Good preparation is the best defence, and will help ensure responses are timely, effective, and error-free. Table 1 gives a short overview of key aspects of an effective emergency preparedness and response programme for data centres.

Emergency operating procedures

Emergency operating procedures (EOPs) are used for handling crises and disasters as soon as they are detected. EOPs should exist as documents and preferably maintained through a computerised document management system (CDMS). Each procedure describes an approved set of actions for how to respond to a crisis or disaster. The response should cover how to safely isolate the fault and how to restore service or redundancy. The EOP aims to have facility operators respond in the correct sequence of events for the purpose of safety and minimising the duration and impact of the emergency.

An EOP has multiple functions. First, it assists operators in placing the affected system(s) into a controlled and stabilised condition as quickly as possible. Second, it provides step-by-step guidance to ensure all activities are carried out in a safe and deliberate manner. This is done to prevent further (or wider) service interruption, equipment damage, or personal injury. These negative or possibly even devastating effects result from performing work in an uncontrolled manner, by omitting essential steps, or by performing them incorrectly; or half-heartedly. A third function of EOPs is as a training tool for new operators. They should be used as the basis for scenario drills and testing in staff training programmes. They are also important to have when audited or evaluated by customers or management to demonstrate effective emergency preparedness and response.

The EOP is the most important recovery tool in ensuring operational stability and recovery after a failure event. It should be a well-practiced and rehearsed procedure to ensure that all facility staff is aware of their responsibility and tasks in the EOP process. Before any EOPs are developed, first draw up a list of all the likely and/or high-risk failure scenarios. An EOP should be written for each one. Of course, data centre operators and their managers cannot foresee all problems, but they can prepare for the worst and hope for the best.

Crisis management plan

The crisis management plan (CMP) is a set of policies and procedures to help data centre operators prepare for, respond to, and learn from crisis situations that could eventually lead to a true emergency or disaster that would then require the execution of EOPs. The CMP should be closely reviewed by all major stakeholders who would participate in the process.

Preparation and prevention

The best crisis management tool is prevention. It is commonly known that most data centre outages are a direct or indirect result of human error. To minimise errors, data centre personnel should undergo intensive training in change management procedures to ensure proper behaviour and execution for work in or around critical facility systems. All data centre work procedures (standard operating procedures/SOPs) should be created with safety and operational risk mitigation as the primary goal. It is recommended that all procedures be peer reviewed on site and undergo an additional review by a quality assurance specialist.

Detection and incident classification

Not all events appear out of nowhere or are easily identifiable at first glance. It is important to be able to recognise their early warning signs and threshold qualities. There is a distinction between an urgent situation and a crisis. An urgent situation that is being managed with a proven process or procedure would not normally be considered a crisis. One of the defining characteristics of a crisis is a loss of control. If a situation passes outside the boundaries of what can be reliably managed and becomes, or threatens to become out of control, a crisis may ensue. Another characteristic of a crisis would be a high level of severity. Data centre infrastructure management (DCIM) software tools can be an effective way to centrally monitor data centre system state changes and alarms to provide more proactive notification of problems and conditions that could lead to a crisis or disaster.

Table 1: Overview of key elements of an emergency preparedness and response strategy for data centres.

Category

Element

Description

Emergency response procedures

Emergency operating procedures (EOPs)

EOPs provide a plan of action for safely isolating faults and restoring service or redundancy.

Crisis management plan (CMP)

A detailed step-by-step plan of action on what to do in the event of a crisis situation.

Emergency drills scheduled and per-formed in line with top ten identified operational risks, help ensure readiness

Incident notification

A process that ensures any safety or mission critical event is made known to appropriate personnel.

Incident management

Incident identification and reporting

All incidents must be reported immediately once the situation is stabilised. A brief summary of the incident should be sent to the appropriate distribution list.

Failure analysis

A comprehensive programme to determine a root cause is required for any incident that involves an injury or system downtime, or has the likelihood of doing so.

Response and mitigation

Once a crisis or disaster has been declared, the first inclination on the part of well-meaning operators might be to immediately jump in and take action to fix the problem. Until the situation is fully understood and a well-considered response plan created, however, such actions run the risk of causing further harm or downtime. Except in obvious cases requiring immediate action (e.g., fire), the proper course of action is to craft a plan of action with subject matter experts and key stakeholders. The time invested in these activities will often, in the long run, provide a safer, surer, and longer lasting solution than hasty action.

After any first response activities, the primary task is to assess the situation. Basic information must be put together about the scope and severity of the incident, as well as the state and stability of the plant. This data must be quickly established and continuously updated in order to ensure good decision making and accurate communications.

Recovery and analysis

Once the incident has been fully resolved, a failure analysis report should be prepared and issued to key stakeholders. It is best to do this quickly – within one week of the incident’s resolution – while the experience is still fresh in people’s minds.

Fig. 1: DCIM software including building management systems (BMS) can be very helpful in simplifying and automating incident notification (and reporting).

Escalation procedures

As situations go from normal to urgent to potential crisis or even disaster level, escalation of the problem must take place. This is to assure the right know-how and resources are brought to bear at the right time.

Proper escalation of business-impacting incidents as well as “near-misses” is an important element of an emergency preparedness and response strategy. Communication between data centre staff, management, customers and vendors is crucial for business success to ensure that the situation is under control and all necessary resources are being focused on the incident. Table 2 provides an example of escalation procedure and timelines.

All incidents should be assigned a class level based on severity, Class 1 being the most serious and Class 5 being the least serious. Summary definitions of the event class are as follows:

Class 1: Overrides all other classes. Threat to human life is more important than threats to the IT load. Emergency response teams must be notified. Includes covers fire, natural disasters, threat to human life, and physical security threats. After the event has been stabilised, the decision must be made by data centre management as to how to proceed.

Class 2: Defined as an event that interrupts IT function, or if “N” is lost in any building system, mechanical or electrical. Mainly “recovery” situations needing direct management decision making before recovery actions can be performed.

Class 3: No further backup systems are available; i.e., redundancy has been reduced from “N+1” to “N”. Also covers any non-scheduled generator runs.

Class 4: Critical systems redundancy is still available, i.e., “N+1” exists. Class 4 may be difficult to define due to the many definitions of “redundancy” that may exist.

Class 5: Designed to notify the immediate supervisors of threats. Examples would be strong wind warning, lightning storm warning. This class is mainly for notification of situations that could have the possibility of escalating to a higher class.

Similar escalation procedures should be put into place for facility incidents and for vendor escalation. For multi facility data centres, a 24/7 operations centre should be available as a centralised resource to coordinate escalation procedures.

Table 2: An example escalation procedure and timelines based on the level of severity (incident class).

Incident class

Facility manager

Service manager

Operations manager

Operations director

Class 1: Life safety

Immediate

+20 minutes

+30 minutes

+1 hour

Class 2: Critical

Immediate

+30 minutes

+1 hour

+2 hours

Class 3: Serious

Immediate

+1 hour

+4 hours

+24 hours

Class 4: Significant

Immediate

Next business day

+2 business days

+5 business days

Class 5: Advisory

Advisory

Emergency drills

The primary function of a drill is to evaluate the proficiency of an operator’s response to emergency events. Written and oral tests can demonstrate knowledge, but more importantly, drills show both knowledge and proficiency of action.

The drills should be based on real world conditions and an understanding of the underlying principles of how the equipment and systems function. A drill report document allows for the evaluation and recording of individual performance. It is also useful to use the some drills as an opportunity to train and enhance the individual’s knowledge of the data centre environment and installed equipment.

Drills should be mandatory and should be created for each emergency operating procedure that addresses anticipated events of high probability and/or high severity. Each facility should establish a goal of each data centre operations team member participating in at least one drill per month, but must in all circumstances meet any contractual obligations regarding drill requirements. Emphasis should be placed on the top ten EOPs, in combination with the current threat evaluation.

Incident management

Mission critical facilities expend large amounts of capital and human effort to ensure continuous operation. These environments should be highly controlled and monitored as a result. As part of that control scheme, it is important to be aware of, document, and report on any unexpected events that might affect operations or safety. An effective process has three elements: incident notification, incident reporting, and failure analysis.

Incident notification and identification

Incident notification includes the systems, process, and people involved in alerting stakeholders that an incident has occurred. It is important for notification of events to be timely. Who should be alerted needs to be determined and planned for ahead of time. As previously shown in Table 2, incidents can be classified by level of urgency or criticality. This classification system can then govern who needs to be notified, when, and how often. DCIM software including building management systems (BMS) can be very helpful in simplifying and automating incident notification (and reporting).

Incident reporting

Incident reporting is a generic process that may be augmented or superseded by an existing system (e.g., DCIM or BMS) or process. Once the initial incident has been detected, responded to, and the right people notified; it is recommended that an incident report be completed within 24 hours and sent out to all appropriate stakeholders. It is a good practice to start filling the report out as soon as there is time to do so (while the event is still fresh in people’s minds), and to continue updating it throughout the event, if the situation allows. A standardised template should be used to report all incidents. This helps ensure all of the relevant information is gathered every time.

Incident follow up

Once the initial incident is reported, follow-up activities may occur. For instance, an incident may occur when a system component fails and service is restored through a redundant unit, and a vendor is called in to repair the failed component. The incident report would be filled out right away and can be approved and distributed prior to the repair being affected. Once the vendor shows up and performs the work, the incident report can be updated. Action items, recommendations and support information would be included in the follow up.

Failure analysis

The failure analysis process is designed to provide a standard method for determining and documenting the root cause of an incident, whenever it is determined such an investigation is necessary. It focuses on the “why” the situation occurred rather than the “who, what, how, when and where”. The failure analysis report should still provide a description of what happened where and when, and how it was responded to. Accurate detailed documentation of problems can provide valuable “lessons learned” to operators.

Conclusion

For now, companies must understand the importance of preparedness and response strategy for their sustainable operations. To effectively respond to all different kinds of risks and crises in data centres, organisations must act quickly and know what to expect in unexpected situations. By implementing the best practices developed over the many years in Schneider Electric, organisations can protect their expensive assets like data centres and ensure the best returns of investments.