Technologies

Let's defeat downtime

Recent times have witnessed some high profile airlines fall victim to system downtime. What can aviation do to mitigate against such crises? And how might we recover rapidly?

Failed network router. Reservation and check-in systems down. Automatic transfer switch malfunction. Global IT crash. They can all result in downtime and disruption. But need that be an inevitability?

As air transport industry operations depend ever more on IT, a robust business continuity plan is the vital bridge between service availability and service restoration.

Failing to plan for high risk potential incidents, and not having a plan that mitigates the risk, is not an option. Otherwise, when an incident does occur, it can take hours to devise a recovery mechanism, before that mechanism can, in turn, be implemented by operational teams.

That means the incident doesn’t get resolved for a considerable length of time. Needless to say, such scenarios can all too easily result in revenue loss and vastly diminished customer satisfaction – not to mention the untold damage to organizational reputation.

So what does it take to build a robust disruption continuity plan? What measures are needed to ensure this plan is always up to date? We asked SITA’s experts for advice.

Foundation first

According to Mathew White, SITA’s VP APAC Geography Services Operations: “Any IT business continuity plan must be grounded on four key foundation principles, with everything else built on that.”

The first principle is simple, says White. “It’s important to have a plan that works and not one that’s merely spoken about.

“That means a well-designed technical solution to ensure business availability, including the use of redundant technology. Second, you need to have a way of implementing preventative maintenance and proactive monitoring.”

Principles three and four, according to White, are that organizations must carry out a business impact and threat/risk analysis as well as putting into place a rapid recovery system, including consideration for regular testing of systems and processes.

Testing

What greatly supports system resiliency is geographic diversity, whereby servers are in different countries and even perhaps on different continents. Of course, cloud services use this approach.

Mathew White, VP Geography Services Operations, SITA

What’s clear, explains White, is that you must start with a regular failover testing schedule or ensure your high availability solution is working as designed. Here, dummy scenarios are frequently enacted to test system resiliency. Playing a crucial part in the success of failover testing is the build design.

“What also greatly supports system resiliency is geographic diversity,” he continues, “whereby servers are in different countries and even perhaps on different continents.”

Redundancy

SITA’s VP Service Operations Manuel Garcia-Fernandez believes that being dual-equipped, while consistently carrying out failover testing, is an important step towards a vision of zero downtime for the air transport industry.

He regards a redundancy set up – especially redundancies across geographies – as a vital part of this.

“Both capability and flexibility come into play; the capability of being able to do it, with the flexibility of it being able to deliver what’s best,” says Garcia-Fernandez.

Cloud infrastructure and services offer a perfect example. With their rising popularity, this type of redundancy setup will soon be the norm. In parallel, the inclusion of a consistent and a well-established failover schedule will contribute to better mitigate and address outages when they happen.

Here, disaster recovery solutions or shared host access, are made possible across different cloud locations. So if one area fails, the recovery solution effectively takes over in another location, ensuring no break in service.

Risk reduction

Risk is inherent in anything that’s delivered for the air transport industry, including ‘technological unpredictabilities’.

Reducing risk requires:

Implementing a risk assessment program

Ensuring a comprehensive monitoring capability

Incorporating a service management role

Assessing the business impact

“Put simply, risk assessment is about considering specific areas of high potential linked with high business impact and then working on solutions to reduce this,” White explains.

It requires a constant consideration of where future failures may arise, the likelihood of the failure occurring and the potential outcomes that may result.

Alongside this there must be a comprehensive monitoring capability that enables your service to be both proactive and preventative.

“This gives you a ‘view into the future’ and allows you time to be prepared – a valuable asset needed during any incident,” explains White.

“You can't always have the resiliency you want for various reasons. So you've got to have risk mitigation around that fact, with plans to continually look and whenever you can, work towards a resiliency solution that’s best for your business.”

Service management is the face of services provided to customers. The purpose of creating this role is to enhance the organization’s ability to work closely with customers on regular improvement plans.

In the view of Garcia-Fernandez, these plans enable service managers and their customers to assess any potential risks, especially those relating to a single point of failure and areas of improvement going forward.

“They should take place at least once a year or even more frequently if possible,” he asserts.

Don't ignore history

SITA’s Director of Service Operations, Gustavo Romero, underlines the importance of an historical and present focus on risk assessment.

“Risk validation means considering both the history as well as the current health of systems,” he says.

“By assessing both history and current system data, you’re better able to make recommendations to prevent incidents from happening and reducing their impact if they do occur.

“Having this information to hand enables you to make more informed decisions,” according to Romero.

Always improve

Echoing this advice, Garcia-Fernandez emphasizes the importance of “looking backwards to progress forward.”

You’ve got to look backwards to progress forward. A knowledge base with key learnings from previous outages helps to fix the problem more quickly than starting from scratch.

Manuel Garcia-Fernandez, VP Service Operations, SITA

He firmly believes that what’s important is learning from the past and building on this knowledge.

“A knowledge base with key learnings from any previous outage, and the ability to use that to assess actions taken before, helps to fix the problem more quickly than starting from scratch,” he explains.

Making knowledge management an important part of a disruption continuity plan results in continuous improvement.

“By constantly learning from errors in the past, you’re able to make any service system more robust and you instill a process of continual improvement,” he adds.

The customer's shoes

Ultimately, defeating downtime is about putting yourself into the customer’s shoes to assess the business impact of services.

That means recognizing the vital difference between a mission critical and non-mission critical requirement. Failure to consider this when designing a service solution for customers could result in either over-providing or under-providing.

“However, when you’re talking about departure control in airports and all-important turnaround times, you must ensure business sustainability. Here you need to have mission-critical type resiliency and availability to meet the customer's business needs.”

Proactive

Proactivity is key because what makes the difference is detecting failure quickly. What customers value is a first response of ‘how can we help’ or ‘this is how we can help, instead of ‘what’s happened?’.

Gustavo Romero, Director of Service Operations, SITA

Continuous monitoring that’s both proactive and preventative is what makes a disruption continuity plan robust.

These two characteristics are inter-dependent: constant proactive monitoring helps identify and detect potential risks; this then provides the necessary information for the preventative measures to be taken.

Proactivity is key because what makes the difference is detecting failure quickly. Having done that, you must be able to quickly activate a disaster recovery protocol should any eventuality happen.

“What customers value is a first response of ‘how can we help’ or ‘this is how we can help’,” says Romero, “instead of ‘what’s happened?’.

The main difference lies in the added ability of being able to understand the incident through proactive monitoring and impact evaluation, which then enables implementation of a solution that’s more relevant and realistic.

“Staying proactive means you’re immediately aligned with customer needs and better able to deliver what’s right for them.

“But proactive monitoring is not only about having that constant surveillance to guarantee and ensure service availability,” adds Romero.

“It’s also about providing important input to other aspects of the service that will result in higher and better service availability for all.”

Customer centricity

All things considered, Garcia-Fernandez believes that defeating downtime demands a customer centric mindset. Of course, it’s important to have the capability to respond and deliver service, but demonstrating flexibility and care is equally fundamental.

“It’s all about ensuring business continuity by responding smartly and quickly and being flexible, keeping services running and achieving maximum uptime,” he says.

“The resolving teams who implement the service solutions also perform a triage on what the main root cause is. With well-prepared plans, this can be quickly implemented.

“In addition, customer centricity is a pillar in any service improvement plan. It means regularly assessing the plan from the point of view of specific risks related to the customer.

“This achieves a continuous improvement process to enhance areas of the service. It becomes an integral part in all future dealings with your customers.”

While the capability to deliver a disruption continuity service solution is important, the flexibility to care makes the difference, he concludes.

“Here we monitor the various services and provide support across the many solutions we have, offering an expert monitoring and support service, which is a market differentiator for SITA.”

Keep your infrastructure up to date

By Rohit Bhatnagar, SITA Command Center, Singapore

In our business we must always improve. That means keeping IT infrastructure up to date. Outdated and unreliable underlying technologies are unlikely to be compatible with the concept of IT readiness. Older systems may be more vulnerable (e.g. malware attacks), less resilient (e.g. surviving a power surge) and make it a larger challenge to recover systems in the event of a disaster.

Up to the minute IT infrastructure that’s compatible with the latest compliance patches and operating systems will make it easier to bring new systems on line in the event of a disaster recovery scenario.

Recovery plans in a crisis - simplicity rules

An overly complex recovery plan can be a real problem when the moment comes to use it. In times of real crisis, plans need to be crisp, familiar and actionable and teams are to be trained to swiftly act and enable the plan. It’s been proven that too many tables, look-ups and references won’t work.

The best disaster recovery plan is one that can easily be used by knowledgeable people to guide them through a failover/recovery.

In SITA’s Command Centers, we challenge our readiness every six months with an intense test of our people, processes and tools. With clear guidelines, procedures and training we are able to achieve continuous improvement year over year.