Risk-Based Decision Making

Several factors have prompted principal maintenance inspectors in the U.S. Federal Aviation Administration (FAA) to avoid precipitous enforcement action against air carriers, such as grounding an aircraft fleet, if alternative corrective action resolves the safety issue. According to Keith Frable, FAA principal maintenance inspector for United Airlines, risk-based decision making, is being introduced as “a new way forward … a new path where we are not inconveniencing passengers but we’re still having continual operational safety at the airlines,” he said. Frable spoke in April at the World Aviation Training Conference and Tradeshow (WATS 2015) in Orlando, Florida, U.S.

The policy shift considered poor decisions about grounding airline fleets1 based on non-risk-related practices, reductions in the number of FAA maintenance inspectors, reductions in the number of maintenance professionals at airlines, and the numbers of new maintenance engineers added in the context of required qualifications and employee turnover. “It was an initiative about a year and a half ago to get a team together and start this process; however, the training [for some categories of inspectors] is not out yet,” he said in April.

Risk-based decision making involves FAA flight standards district offices, principal maintenance inspectors and their ongoing relationship with the airline counterparts that they oversee. This means airline maintenance leaders and maintenance technicians — as representatives of the regulated entity — will play an important role in adjusting local safety cultures, in communicating and in providing information that enables the newer process to work as intended, he said.

Prescriptive Culture

Before being hired to oversee airline maintenance operations under Federal Aviation Regulations (FARs) Part 121, FAA maintenance inspectors had been indoctrinated for years in prescriptive system safety. “We teach the regulations [under that philosophy]. We teach people to follow the regulations. … We teach people to follow a prescribed procedure in the aircraft maintenance manual. … We teach people to follow the airworthiness directive [AD]. … We teach people to follow the checklists and to stick to prescriptive language. … This culture of … adherence to rules and regulations was developed way before [most of today’s aviation safety inspectors, with average age in their 50s and 60s, were hired],” he said.

Frable’s presentation objective, he noted, was explaining “why risk-based decision making is so new for Flight Standards and why [it is] such a struggle to change the culture of Flight Standards. … This is very foreign to [principal operations inspectors] to be able to make those decisions of not grounding an airplane because an airworthiness directive isn’t accomplished on an airplane. Their initial reaction is ground it, then fix it, then fly it.”

To launch the process while training catches up with policy, Frable made other requests of the attendees. “As an operator, you can also provide risk-based analysis to your [principal maintenance inspectors]. … You need to be in the forefront … I told United [Airlines], ‘It’s your operation. You have the problem. … You operate the airplanes. Tell me the risk analysis you performed, the SMS [safety management system] process you used, what the initial risk assessment is [and] what the mitigations are going to be. … We don’t always know [at FAA] the proper mitigation and then what’s going to be the [post-mitigation action].”

Principal maintenance inspectors need all the relevant information to determine what the level of risk would be in continuing to operate affected airplanes, whether flight operations should be shut down and whether FAA and the operator need to put fixes in place right away, he said, noting that FARs Part 135, commuter and on-demand operations, and Part 145 repair stations, are not required to have a safety management system (SMS) but stand to benefit from incorporating SMS components into their operations.

“The better you get [with SMS,] the better you’re going to help the FAA inspector help you make … sound decisions based on identified risk. … If you come up with a risk-based decision [and] a risk-based mitigation, you have to stick to that plan,” Frable said. “You’ll want to make that decision based on the likelihood and severity of the event, and what that really means to your company.”

Gradual Implementation

Frable said this cultural shift will take time and experience, as well as training of FAA principal maintenance inspectors. “[The question now is,] ‘How can we get them past initial reactions to violate you [i.e., to allege airline noncompliance with a regulation or a requirement] or to ground your airplanes without having severe consequences to the flying public and to your operation?’ … How do we get in there and change the way they think about this entire process?”

Examples of how the process works — as part of Flight Standards’ transition from the Air Transportation Oversight System to the Safety Assurance System (SAS) of risk-based analysis — were taken from real-life stories of FAA experiences at United Airlines, “where we’ve worked through these issues and they’ve let me release this information … because it was done in a methodical, thorough process,” he said.

“For the Part 135 inspector and the Part 145 inspector, it incorporates decisions based on risk, so they have to go out and do their inspection, bring that data back and then decide what implications that has on the operation of your fleet or in the operation of your certificate. There’s no training for them on that, so it’s a … very big gap.”

Thrust Reverser Example

Regarding the FAA–United Airlines relationship while implementing risk-based decision making, Frable said, “We have [real-time] feedback, we have meetings. [We ask,] ‘Where are we on the project?’ if they put a mitigation in place. ‘Are we seeing the event [recur]? Has [your] mitigation worked?’” (See “Nonpunitive Interviews Yield Insights Into Aircraft Ground Damage.”)

The new process came into sharp focus in the case of noncompliance with an AD concerning thrust reversers. The stated intent in the AD was to prevent a thrust reverser from deploying in flight, he said, and the wording of the intent itself implied high risk.

Frable said, “They didn’t perform an AD task, so their initial conversation was, ‘We missed the AD cycle on these airplanes because the [task] card wasn’t done, whatever the reason. And now we have airplanes flying out there with the potential of a thrust reverser opening up in flight because we didn’t accomplish the AD.’”

He said that, historically, a principal maintenance inspector’s immediate reaction likely would be to say, “Let’s shut down 85 airplanes around the world, inconvenience passengers and shut the operation down until we get that checked out.”

In the risk-based decision making paradigm, inspectors make a more cautious and deliberate assessment first. In the example, United Airlines brought a sound decision to the discussion derived from detailed analysis of the thrust reverser situation, he said.

“The [uncompleted] task was an ancillary task; it wasn’t part of the AD. It was a task that was done on a secondary backup system that had a 4,000-hour flight-cycle check [interval] on it. … Where’s the criticality on a 4,000-hour check? [It’s] pretty minor. [The maintenance task is not done] to prevent the [in-flight thrust reverser deployment]. It is there as a backup check. You [check] a pin [on a circuit board and] make sure [you don’t] have voltage at a certain spot. … Even if voltage is there, it wouldn’t contribute to what the AD was for.”

Instead of grounding aircraft, FAA permitted a staged-but-rapid assessment of the risk factors by United Airlines, starting on a Friday night with four airplanes already in “remain overnight” status, including one in a hangar for a heavy maintenance check.

“If on Saturday [morning,] those four airplanes had voltage [present and so] met the … required-maintenance [criteria], then … this would raise the likelihood and, in my opinion, raise the severity. That would tell me that [United Airlines] needed to do more of those checks quicker,” Frable said. “On Saturday morning, all four checked out ‘good.’ On Sunday, they [planned to check] another 10 [but instead of continuing checks for four to five days], all were checked [in three days, which brought them into AD compliance]. … Our risk assessment was valid, and we proved it was valid. There was no risk to the flying public, and there was no risk that [noncompliance with] this AD was going to cause the catastrophic loss of an airplane.”

A similar situation at United Airlines occurred when FAA inspectors discovered that a manufacturer of auxiliary power units (APUs) had discontinued a process called the third bearing wash. FAA’s immediate theoretical concern was that, potentially, this practice could have a life-limiting effect on bearings — essentially a type of damage that could lead to bearing failure and an uncontained failure of the APU.

“[We found] out that that the third bearing wash was required as a result of a [another manufacturer’s] maintenance task card vetted by United Airlines,” and apparently APUs were being returned to line service without the benefit of this process, although still specified in the aircraft maintenance manual. Frable worked with United Airlines engineers to check for bearing discrepancies and to rate the risk based on checking bearings from a sample of APUs.

The engineers’ report at the end of this work led to a risk-based decision that potential severity was high but that probability of an unsafe outcome was extremely remote. Frable ruled that no grounding would be required, and further analysis justified the risk-based decision making. The engineers ultimately concluded that performance or nonperformance of the third bearing wash actually had no effect on safety, and that ironically, all the APUs of initial concern were in maintenance parts stores at the time. “We could have grounded airplanes. We could have made that decision based on a knee-jerk reaction,” Frable said.

Hangar Nose Drop

Another event raised similar initial concerns of potentially catastrophic outcome. In this case, Frable received a late-night message saying that the nose of a United Airlines aircraft had dropped to the hangar floor during maintenance. No one was injured or killed in this ground accident. “So again … you don’t shut everything down. You let the company work the process. The airplane is safe. It’s in the hangar,” Frable said. He visited the site to observe whether the nose gear safety pins had been inserted, and assisted otherwise in the airline’s investigation the next morning.

“[Critics] could say, ‘You’re letting them fly around with AD noncompliances. You’re letting them drop airplanes on the nose, and you’re not shutting them down. You’re not stopping [their investigation] process. You could say that, but in reality — if you have decisions based on risk and you have a great relationship — [this] is going to help [the airline] and it is going to help the FAA make those calls. [Airlines still] do get violated [but] that’s the last thing we want to do. As a [principal maintenance inspector], am I going to violate the company [because an airplane was dropped on the nose]? Typically, I’d say no,” he said.

“If you put a different person in the same position, would they make the same mistake? If the answer is ‘no,’ the procedures are there, the task cards are written properly, the requirement to pin the nose gear when [the airplane] comes into the hangar is there. … The company has established a protocol, everything is there. …

“So you go to the individual who made a conscious decision not to follow the procedures set forth by the company. I want [the airline] to fix the guy who was not following them … to fix that culture. I don’t want [them] to fix what’s already in place and what’s already working. … We have an ASAP [aviation safety action program, and this situation] would be handled through ASAP.”

Ultimately, post-mitigation analysis by the airline is critical in all such situations, he said. “Did it work? Was it effective? What were the lessons learned? Is there something else we should have done for that event? … If it wasn’t a good, comprehensive fix and it was ineffective, how can we make it effective and what follow-up do we need to do to make that mitigation effective?” he said.

Note

In 2008, for example, the FAA was involved in grounding of airplanes by American Airlines, Southwest Airlines and Delta Air Lines, Frable said, adding, “[Delta’s McDonnell Douglas MD-88] fleet was grounded for basically the routing of wiring in the landing gear and for [cable-sheath] tie wraps. … If you risk-rate that, was it a high risk? I would say no. The likelihood of a catastrophic event would be low and the severity would be low. … However, the decision was made based on it being [noncompliance with] an airworthiness directive and an airworthiness directive alone.”

The results led to insightful, nonpunitive interviews with employees involved in aircraft damage events, said Lisa Crocket, senior manager, quality assurance for line operations safety assessment, United Airlines. “The data I have so far already give a completely different picture of what [SOP] compliance looks like and why. We’re going from [assuming] complacency to [saying,] ‘Holy smokes, they didn’t have the materials they needed!’ That is … a completely different way to mitigate the issue.” She spoke in April at the World Aviation Training Conference and Tradeshow (WATS 2015) in Orlando, Florida, U.S.

“We collected data and used that data to point to very specific systemic fixes. ASAP can tell you what’s wrong. LOSA can tell you how often it happens in the real world. I hope you use this example to look beyond the ‘who’ and get to the ‘what’ — and find a model that works for you, that helps you get to risk-based decision making,” Crocket told the audience. “I hope [trainers and training content developers] actually incorporate [our example] into some of your training so you can see these underlying factors where people are not compliant, [and] find a way to train those out of your organization. … We always want get to the ‘why.’ How did it happen?”

She described her study as involving an unidentified organization within the airline that had resisted previous efforts to resolve this ground-damage trend, and some employees had told her that the problem seemed too large to fix. “They didn’t necessarily see it as a high risk or they didn’t understand the ‘why’ [of] it. [They were] great at telling us what the problem was and how often it happened, but [they] weren’t getting to the correction.”

Previous attempts had recognized that equipment failures and malfunctions sometimes were factors in a specific case, Crocket said. “But largely, the problem … was that employees failed to follow standard operating procedures. … They were being complacent. …Complacency [seemed to be] a big issue, but how do we know it’s really complacency? Is it an assumption? Can we measure it? … We also wondered if there were other factors underneath that were causing people to not comply with procedures.

“When we found nonconformances by employees, we would fix them. [We told ourselves,] ‘So this employee didn’t follow the SOP and we worked with that employee, we trained them, we disciplined [them]. Whatever the path was — that employee is fixed. Then [at another] location, the same thing happened: aircraft damage, nonconformance to the standard operating procedures, [and] we fixed that one.”

She became dissatisfied with the prospect of repeating this cycle without resolving issues at one time for the entire airline. “We decided to look at the problem systemically — when we found factors, we would map them to a human factors model [that] encourages us to look not only at ‘what happened’ and ‘why it happened’ on the employee level, but also look for factors higher in the organization,” Crocket said. “It’s a little harder to [turn] that ship, but you’re fixing [the issue] across the organization.”

To look deeper into causal factors, she proposed a hybrid of well-established, peer-to-peer LOSA techniques and a quality assurance audit. “When we found a nonconformance, our goal was to interview the employee nonpunitively. … Something happens in the real world that prevents employees from following that procedure [learned in training,] or something impacts their work out in the organization that causes them to make different decisions. … We visited 19 locations and collected 225 observations [and] great interview notes from people who didn’t conform to SOP. [I told interviewees,] ‘Listen, we have something to learn [and] I’m gathering data. [Can you] help me understand why you took this path instead of this path?’ … I cannot tell you how rich the data are when you don’t ‘hammer’ the person … when you give them free reign to talk.”

Several factors driving SOP nonconformance emerged from these interviews and the related sources of factual information. “The first driver of nonconformance … was called a decision error — when the employee consciously decides to vary from that procedure. It’s the wrong decision, but it’s a decision. [Some said,] ‘I did it this way because I was in a hurry. I did it this way to get the plane out because we were behind.

“The second [driver was called] a skill-based error. … It’s something you do time after time, and you hardly even have to think about it. It’s the same thing as ‘being on autopilot.’ We would approach an employee and say, ‘We noticed you missed a step … and they would say, ‘[No,] I did that.’ And we said, ‘You didn’t, actually. … That’s part of your checklist, and you missed that. So we dug deeper to ensure that they weren’t just kind of smoking us [i.e., trying to mislead investigators]. If they really believed that they had completed the checklist, that was a skill-based error, and we classified it as that.

“The third driver of nonconformance … was called a routine violation. … It’s a variance from the rule [because I know] I’m not going to get in trouble and they let me [deviate]. We found that was a high driver.”

Compared with these insights, attributing aircraft-damage events to the vague category of complacency fails to explain or solve correctable issues. “[As to deliberate choices in] decision making, the skill-based error, being on autopilot and routine violations, we can train and we can set up processes to engineer those out of our company,” Crocket said.

Investigators recognized another causal level in some interviews. These background factors — overconfidence, lack of proficiency and complacency — came into play by influencing the employee to decide not to comply with the SOP. “Overconfidence [is apparent if the person says,] ‘I’ve been doing this job for 30 years; nothing bad ever happened. I’ve never hit an airplane; I’m fine. … [Or they say,] ‘I haven’t worked in this area for a while.’ Largely, in training [on belt loaders, for example, or] whatever the procedure is, they should be familiar with it. [They’ve done] it 1 million times, they should know how even if they haven’t done it in three months.”

Because of these beliefs, the airline had not recognized the actual risks of taking people out of their normal area of operation or out of their normal job position, then returning them several months later to their original position, although some struggled at first to meet the required level of communication and coordination of the team.

“We [also] did find complacency in the study. … It didn’t even make it past 5 percent [of causes of aircraft damage],” she added.

With the human factors model, supervision also was addressed. Regarding inadequate supervision, the study noted cases in which supervisors were not adequately monitoring maintenance employees’ work or, when monitoring, they were not reinforcing the employee’s training on how to correct a known problem. “When I did the study, I was taken aback by [some supervisors’] failure to correct a known problem, which actually relates back to my routine violation.”

At the highest level of the scope of the study, the LOSA staff identified aircraft damage causal factors at the organization level. “What we found in that category was either that there was no accountability or weak accountability for standard operating procedures,” she said, recalling her recommendations to senior management. “We should be making sure that supervisors are out in the operation more, and that they have the tools they need to ensure that people are adhering to SOP consistently. In addition, the leadership needs to provide those supervisors with the structure to accomplish this. … Now, we have actually trained our LOSA observers to interact with the person who didn’t conform to SOP so that they can grab those interview notes and use [them] so we can give ‘color commentary’ [i.e., background details as] to the reasons why people aren’t following SOP. We just implemented this April 1, and it already has transformed how we look at what’s wrong out there.”

Crocket said such LOSA data are considered to be “owned” by each stakeholder entity, such as the maintenance division, station operations or ground operations. “They own their data, and we’ve given them a really easy tool that — literally, in three clicks [of a computer mouse] inserts the data and ranks it by risk. … I do it on a systemwide basis so that I can take that information and roll it up to leadership as a system compliance.”

In sum, noncompliance with SOPs cannot be presumed to occur solely because people are complacent or because they don’t care, Crocket said, adding, “We have to find those reasons so that we can make better decisions — [through] risk-based decision making — on how to mitigate them.”