A very normal accident

It was early evening on the 24th September 2022 when an offshore AW139 helicopter inbound to Houma-Terrebonne Airport in Louisiana, USA, declared a mayday. A lot had already happened in the cockpit by the time the co-pilot hit the press to transmit.

The first sign of trouble was a smell of burning plastic permeating the aircraft, but there was no smoke, nor were there any abnormal indications, so crew thoughts turned to the air conditioning unit, which they decided to turn off.

A few minutes later a loud “whoof” sound caught the pilots’ attention and within seconds they were engulfed in orange smoke. Cockpit visibility was zero due to the thick smoke, and emanating from somewhere around them the crew heard the rotor low audio warning (the highest priority warning there is, which advises the pilots of critically low rotor speed). At the same time they encountered a rapid overspeed of both engines and a significant and uncommanded movement of both the collective and cyclic flight controls. Unable to clear the smoke by opening his small ventilation window, the co-pilot attempted to open the cockpit door but was unable to do so due to high airspeed. He finally managed to clear the smoke by removing the entire cockpit window.

The crew now fought to regain control of the helicopter, holding the flight controls hard forward and right to counter the uncommanded movement. Although they had managed to lower the collective (power) lever fully both engines were still outputting a power beyond their design limitations and the aircraft was climbing rapidly through 3500 feet as a result. Unable to establish a descent with normal control movements the crew opted to try to control the climb by selecting one engine to idle. This action caused a sudden and unexpected drop in rotor speed, forcing them to return the engine to flight condition. Leaning on the collective with full body weight to keep it down, and still climbing at high airspeed, they resorted to pushing the cyclic control hard forward to force a descent causing the helicopter to race past VNE.

Compound emergencies such as this one with multiple and confusing failure modes in the cockpit are as challenging as it gets. The crew were forced into the old maxim of aviate- navigate-communicate, starting with working out just how to keep the aircraft flying. However, in what might seem like less than enthusiastic recognition of the heroic efforts of the pilots given the extreme complexity of the emergency unfolding, I’m going to suggest that the exceptional circumstances in which they found themselves were actually those of a very normal accident.

The effect of interactive complexity on safety

Now might be a good time to introduce the work of Charles Perrow and his book Normal Accidents – Living with high risk technologies (1984). Perrow was a sociologist and not a safety scientist, but he was amongst the first to describe the characteristics of systemic accident sequences that were later popularised in the 1990s by the much more well known James Reason, who’s Swiss Cheese Model is familiar with most in professional aviation and beyond. Perrow argued that any system that works with complex and high-risk technology is characterised by what he calls ‘interactive complexity’. A modern aircraft like the AW139 is an example of this; a technological system which has many thousands of interacting components and engineered parts. So according to Perrow, the complex interaction of failure modes the crew faced in this incident is to be expected as entirely ‘normal’.

When we experience two or more failures amongst multiple components they are likely to interact in some unexpected way. And two or more failures can easily interact in such a way that can break the system. No operator can be reasonably expected to be capable of figuring out many of these interactions in real time and responding accordingly, and this inability to understand how multiple failures could occur and interact is pretty much true across any industry. Furthermore, Perrow also argues that it is simply not possible for aircraft or system designers to consider every permutation of how when X fails, Y might also be impacted. The outcome of this is unanticipated – and unanticipatable – risk. After a component failure, accident or incident the designer might respond with a new control measure and this control might solve one problem, but it might introduce three new ones.

No operator can be reasonably expected to be capable of figuring out many of these interactions in real time and responding accordingly.

Therefore, he concludes, technology is both a risk control and a hazard itself. The act of adding technology is at best risk neutral. Continually adding more technology in the belief that we are adding more layers of defence in a system is flawed because we are in fact adding more combinations of possible failure modes. In other words, there is a direct trade- off between increasing safety by adding in more controls, and decreasing safety by adding complexity. For example, it is a simple and inevitable fact that pilots’ understanding of their own aircraft is decreasing. The aviation industry is an example of what should be accepted as a more general truth: year on year we are creating ever more complex systems and organisations. What can we do about this safety paradox? There is a case to make that simplicity should be a key objective in achieving safety within any system. And if you aren’t convinced by this then consider that while any fool can make a system larger and more complex, it takes a genius to make something simpler.

It is a simple and inevitable fact that pilots’ understanding of their own aircraft is decreasing.

Mis-understanding risk in complex systems

Let’s go a little deeper into this idea of interactive complexity, as it clearly plays a starring role in the nature of the accident that we left our pilots grappling with above. Redundancy of critical systems is a key safety principle in aircraft design. When one layer fails we have always got other layers of safety to keep us airborne. The problem with this philosophy is that it assumes that each layer is independent of the other, and they’re not. And hence redundancy didn’t do much of favour for this crew.

This principle is well explained using the example of probabilistic risk assessment. In aviation as in many high-risk industries many organisations use use Fault Tree Analysis to assist the risk assessment processes. The idea of this tool is that it examines the probability of each individual event and then calculates – through a complicated logic structure – the confluence of different events, taking the probabilities of each and combining them together. Such confluences are not probable. In fact, each individual combination of them is statistically quite improbable, but the tool achieves the task of examining the possible combinations of events and outputs a likelihood. What is missing from this kind of analysis however is that it assumes that we truly know the probability of each individual event and – more importantly still – that we can treat them as individual things and combine them. It never takes into account all of the factors that might make these apparently improbable combinations likely to happen all at once.

For example, in an air traffic control tower the chance of diode number 337 failing at the same time as wiring cable number 454 failing might be 1:100,000 multiplied by 1 :100,000. So the chance of them happening at the same time is considered to be 1 in 10 million. But if the electrical plant room is in the basement and entire ground floor of the building is under water then they are both guaranteed to fail at the same time. If the maintenance regime is failing at an air operator then two apparently unrelated parts on an aircraft could well both be under-maintained and therefore likely to fail. Risk assessments don’t take into account these reasons why apparently independent events might actually be quite likely to happen at the same time.

There is no better illustration of this than a paper written by John Downer (2013) called Disowning Fukashima which reflects on the credibility of nuclear risk assessments and argues that these sorts of calculations are fundamentally unworkable. Can we objectively and actively calculate the probability of suffering a catastrophic nuclear meltdown? Downer describes how one of the manufacturers of nuclear plant equipment for the Fukashima reactor had calculated the risk of a core damage incident being one per reactor for every 1.6 million years. They therefore decided that the probability of a core meltdown was so small that it was not even worth calculating a number for it. The risk assessment experts at Fukushima judged that the reactors were at risk of an incident once in a thousand years.

If this figure sounds ridiculously low to you, it’s because it is. Their estimates did not include any consideration of where that reactor happened to be. The assumptions of the number crunchers were only based on the reactor itself. Japanese history has not been well recorded for the last one thousand years, but even if we think back over the last one hundred years and focus on those things that have nothing to do with a nuclear reactor we should pause for thought. How many events have happened to Japan in the last hundred years which are capable of flattening a city never mind a nuclear power plant? There have been multiple city-destroying earthquakes, and multiple city destroying floods and tornadoes, not to mention a war that did a pretty good job of flattening multiple Japanese cities in that time; all of which just shows how ridiculously the calculated figures don’t take into account all of the possible causes of that accident happening.

Epilogue

Returning to Louisiana and the stricken AW139 helicopter, we have an apparently unrelated loss of engine control, flight control failure, and smoke in the cockpit all of which are interacting in unusual and apparently inexplicable ways. What we do know from the initial accident report from the NTSB is that a chafing wiring loom against a flight control run above the pilots’ heads was responsible for a localised fire which deformed the flight controls and provoked a series of interactive failures. Just as Perrow described, no design mitigation or risk calculation could have reasonably anticipated how this chain of events would play out, and no pilot could be expected to be able to understand exactly what was happening in real time. The pilots could only focus on what was working and fly the aircraft as best they could. They did an outstanding job of doing just that, eventually managing a power-off landing of the helicopter from which everyone on board walked away.

Within a system-of-systems with literally billions of potential billion-to-one type failure modes we will see accidents from time to time.

The AW139 helicopter is a complex system with millions of potential interactions between its engineered parts. The aircraft also operates within the context of another complex system – the construct of civil aviation itself. It stands to reason that within a system-of-systems with literally billions of potential billion-to-one type failure modes it is inevitable we see accidents from time to time. Crucially, as this accident shows us, when things go wrong that you don’t expect, we need the operators there. Human operators are a really important source of capacity and resilience in a complex system. After all, you cannot pre-programme automatic responses to unanticipated threats and conditions. The human ability to react to unforeseen and unforeseeable circumstances remains unique and as yet unmatched.

A very normal accident

The effect of interactive complexity on safety

Mis-understanding risk in complex systems

Epilogue

2 thoughts on “A very normal accident”

Leave a comment Cancel reply

The effect of interactive complexity on safety

Mis-understanding risk in complex systems

Epilogue

Share this:

Related

2 thoughts on “A very normal accident”

Leave a comment Cancel reply