Failed network router. Reservation and check-in systems down. Automatic transfer switch malfunction. Global IT crash. They can all result in downtime and disruption. But need that be an inevitability?
As air transport industry operations depend ever more on IT, a robust business continuity plan is the vital bridge between service availability and service restoration.
Failing to plan for high risk potential incidents, and not having a plan that mitigates the risk, is not an option. Otherwise, when an incident does occur, it can take hours to devise a recovery mechanism, before that mechanism can, in turn, be implemented by operational teams.
That means the incident doesn’t get resolved for a considerable length of time. Needless to say, such scenarios can all too easily result in revenue loss and vastly diminished customer satisfaction – not to mention the untold damage to organizational reputation.
So what does it take to build a robust disruption continuity plan? What measures are needed to ensure this plan is always up to date? We asked SITA’s experts for advice.
According to Mathew White, SITA’s VP APAC Geography Services Operations: “Any IT business continuity plan must be grounded on four key foundation principles, with everything else built on that.”
The first principle is simple, says White. “It’s important to have a plan that works and not one that’s merely spoken about.
“That means a well-designed technical solution to ensure business availability, including the use of redundant technology. Second, you need to have a way of implementing preventative maintenance and proactive monitoring.”
Principles three and four, according to White, are that organizations must carry out a business impact and threat/risk analysis as well as putting into place a rapid recovery system, including consideration for regular testing of systems and processes.
What greatly supports system resiliency is geographic diversity, whereby servers are in different countries and even perhaps on different continents. Of course, cloud services use this approach.
Mathew White, VP Geography Services Operations, SITA
What’s clear, explains White, is that you must start with a regular failover testing schedule or ensure your high availability solution is working as designed. Here, dummy scenarios are frequently enacted to test system resiliency. Playing a crucial part in the success of failover testing is the build design.
“This relates to the duality that’s built into the design – dual lines, dual power, dual everything – to ensure great resiliency,” says White.
“What also greatly supports system resiliency is geographic diversity,” he continues, “whereby servers are in different countries and even perhaps on different continents.”
SITA’s VP Service Operations Manuel Garcia-Fernandez believes that being dual-equipped, while consistently carrying out failover testing, is an important step towards a vision of zero downtime for the air transport industry.
He regards a redundancy set up – especially redundancies across geographies – as a vital part of this.
“Both capability and flexibility come into play; the capability of being able to do it, with the flexibility of it being able to deliver what’s best,” says Garcia-Fernandez.
Cloud infrastructure and services offer a perfect example. With their rising popularity, this type of redundancy setup will soon be the norm. In parallel, the inclusion of a consistent and a well-established failover schedule will contribute to better mitigate and address outages when they happen.
Here, disaster recovery solutions or shared host access, are made possible across different cloud locations. So if one area fails, the recovery solution effectively takes over in another location, ensuring no break in service.
Risk is inherent in anything that’s delivered for the air transport industry, including ‘technological unpredictabilities’.
Reducing risk requires:
- Implementing a risk assessment program
- Ensuring a comprehensive monitoring capability
- Incorporating a service management role
- Assessing the business impact
“Put simply, risk assessment is about considering specific areas of high potential linked with high business impact and then working on solutions to reduce this,” White explains.
It requires a constant consideration of where future failures may arise, the likelihood of the failure occurring and the potential outcomes that may result.
Alongside this there must be a comprehensive monitoring capability that enables your service to be both proactive and preventative.
“This gives you a ‘view into the future’ and allows you time to be prepared – a valuable asset needed during any incident,” explains White.
“You can't always have the resiliency you want for various reasons. So you've got to have risk mitigation around that fact, with plans to continually look and whenever you can, work towards a resiliency solution that’s best for your business.”
Service management is the face of services provided to customers. The purpose of creating this role is to enhance the organization’s ability to work closely with customers on regular improvement plans.
In the view of Garcia-Fernandez, these plans enable service managers and their customers to assess any potential risks, especially those relating to a single point of failure and areas of improvement going forward.
“They should take place at least once a year or even more frequently if possible,” he asserts.
Don't ignore history
SITA’s Director of Service Operations, Gustavo Romero, underlines the importance of an historical and present focus on risk assessment.
“Risk validation means considering both the history as well as the current health of systems,” he says.
“By assessing both history and current system data, you’re better able to make recommendations to prevent incidents from happening and reducing their impact if they do occur.
“Having this information to hand enables you to make more informed decisions,” according to Romero.
Echoing this advice, Garcia-Fernandez emphasizes the importance of “looking backwards to progress forward.”
You’ve got to look backwards to progress forward. A knowledge base with key learnings from previous outages helps to fix the problem more quickly than starting from scratch.
Manuel Garcia-Fernandez, VP Service Operations, SITA
He firmly believes that what’s important is learning from the past and building on this knowledge.
“A knowledge base with key learnings from any previous outage, and the ability to use that to assess actions taken before, helps to fix the problem more quickly than starting from scratch,” he explains.
Making knowledge management an important part of a disruption continuity plan results in continuous improvement.
“By constantly learning from errors in the past, you’re able to make any service system more robust and you instill a process of continual improvement,” he adds.
The customer's shoes
Ultimately, defeating downtime is about putting yourself into the customer’s shoes to assess the business impact of services.
That means recognizing the vital difference between a mission critical and non-mission critical requirement. Failure to consider this when designing a service solution for customers could result in either over-providing or under-providing.
SITA’s White again: “For instance, a back-office administration function for an airport is probably looking for something that’s reliable but isn't business-critical.
“However, when you’re talking about departure control in airports and all-important turnaround times, you must ensure business sustainability. Here you need to have mission-critical type resiliency and availability to meet the customer's business needs.”
Proactivity is key because what makes the difference is detecting failure quickly. What customers value is a first response of ‘how can we help’ or ‘this is how we can help, instead of ‘what’s happened?’.
Gustavo Romero, Director of Service Operations, SITA
Continuous monitoring that’s both proactive and preventative is what makes a disruption continuity plan robust.
These two characteristics are inter-dependent: constant proactive monitoring helps identify and detect potential risks; this then provides the necessary information for the preventative measures to be taken.
Proactivity is key because what makes the difference is detecting failure quickly. Having done that, you must be able to quickly activate a disaster recovery protocol should any eventuality happen.
“What customers value is a first response of ‘how can we help’ or ‘this is how we can help’,” says Romero, “instead of ‘what’s happened?’.
The main difference lies in the added ability of being able to understand the incident through proactive monitoring and impact evaluation, which then enables implementation of a solution that’s more relevant and realistic.
“Staying proactive means you’re immediately aligned with customer needs and better able to deliver what’s right for them.
“But proactive monitoring is not only about having that constant surveillance to guarantee and ensure service availability,” adds Romero.
“It’s also about providing important input to other aspects of the service that will result in higher and better service availability for all.”
All things considered, Garcia-Fernandez believes that defeating downtime demands a customer centric mindset. Of course, it’s important to have the capability to respond and deliver service, but demonstrating flexibility and care is equally fundamental.
“It’s all about ensuring business continuity by responding smartly and quickly and being flexible, keeping services running and achieving maximum uptime,” he says.
“The resolving teams who implement the service solutions also perform a triage on what the main root cause is. With well-prepared plans, this can be quickly implemented.
“In addition, customer centricity is a pillar in any service improvement plan. It means regularly assessing the plan from the point of view of specific risks related to the customer.
“This achieves a continuous improvement process to enhance areas of the service. It becomes an integral part in all future dealings with your customers.”
While the capability to deliver a disruption continuity service solution is important, the flexibility to care makes the difference, he concludes.