Drift towards danger and the normalization of deviance (2017)
High-hazard activities often face safety issues due to incomplete procedures, leading to accepted unsafe practices. This normalization of deviance can result in catastrophic failures, highlighting the need for comprehensive safety management.
Read original articleHigh-hazard activities depend on established rules and procedures to ensure safety, but these guidelines are often incomplete, leading to deviations by frontline workers. This gap between "work-as-imagined" and "work-as-done" is recognized in human factors research, particularly by French ergonomists who studied the differences between prescribed and actual work. Over time, these deviations can lead to a normalization of deviance, where unsafe practices become accepted due to their repeated use without immediate negative consequences. Jens Rasmussen's concept of "drift to danger" describes how organizational behavior can shift towards riskier practices under pressures for cost-effectiveness and efficiency. This gradual process often goes unnoticed until an accident occurs, as safety boundaries are not clearly defined and can change over time. The phenomenon is not driven by malicious intent but is a natural outcome of adaptive behaviors within complex systems. Historical accidents, such as the Challenger and Columbia space shuttle disasters, exemplify how normalization of deviance can lead to catastrophic failures. These incidents highlight the importance of understanding the systemic factors that contribute to safety lapses, emphasizing that safety management must consider both proactive and reactive measures across all levels of an organization.
- High-hazard activities rely on incomplete rules and procedures, leading to deviations in practice.
- Normalization of deviance occurs when unsafe practices become accepted over time.
- "Drift to danger" describes the gradual shift towards riskier behaviors due to organizational pressures.
- Safety boundaries are often fuzzy and can change, complicating risk management.
- Historical accidents illustrate the systemic nature of safety failures and the need for comprehensive safety management.
Related
Why Heroism Is Bad and What We Can Do to Stop It
Heroism in site reliability engineering can obscure systemic issues, create unrealistic workload expectations, and lead to burnout. Encouraging discussions about service level objectives and allowing failures can improve system reliability.
Deep Adaptation opens up necessary conversation about breakdown of civilisation (2020)
Deep Adaptation highlights the necessity of preparing for potential societal collapse due to risks like climate change and pandemics. It promotes "collapsology" to study these threats and encourages public discourse.
Kobayashi Maru Management (2018)
The article explores "Kobayashi Maru" management, emphasizing the need for preparation, clear communication, and proactive leadership to navigate unexpected challenges and prevent crises in organizational settings.
Safety First
The article highlights how production pressures in tech companies undermine the "safety first" concept, suggesting that true safety requires allowing engineers to extend deadlines without consequences, despite management's productivity concerns.
Designing Organisations That Work
Dan Davies' "The Unaccountability Machine" critiques modern organizations' lack of accountability, contrasting management and cybernetic revolutions, and emphasizes Stafford Beer’s Viable System Model to improve organizational design and outcomes.
- Personal experiences highlight how individuals can unconsciously adopt unsafe practices over time.
- References to historical examples and literature illustrate the widespread nature of this phenomenon across different fields.
- Concerns are raised about the lack of practical guidance on combating normalization of deviance.
- Connections are made between normalization of deviance and high-stakes situations, such as foreign policy and organizational culture.
- Discussion includes the idea that normalization of deviance can lead to catastrophic outcomes, drawing parallels to various domains beyond physical safety.
I see the same forces working on my software practices. For instance I start with wide and thorough testing coverage and over time reduce it to the places I usually see problems and ignore the rest. Sometimes production can be nearly maimed before I notice and adjust my grip.
"The Overton window is the range of policies politically acceptable to the mainstream population at a given time. It is also known as the window of discourse.
The term is named after the American policy analyst Joseph Overton, who proposed that an idea's political viability depends mainly on whether it falls within this range, rather than on politicians' individual preferences.[2][3] According to Overton, the window frames the range of policies that a politician can recommend without appearing too extreme to gain or keep public office given the climate of public opinion at that time."
While originally about politics, I feel it can be applied to many other aspects of humanity and maybe is just a specialized form of the normalization of deviance.
"Safety may not at all be the result of decisions that were or were not made, but rather an underlying stochastic variation that hinges on a host of other factors, many not easily within the control of those who engage in fine-tuning processes. Empirical success, in other words, is no proof of safety. Past success does not guarantee future safety. Murphy's law is wrong: everything that can go wrong usually goes right, and then we draw the wrong conclusion."
"Why, in hindsight, do all all these other parts (in the regulations, the manufacturer, the airline, the maintenance facility, the technician, the pilots) appear suddenly "broken" now? How is it that a maintenance program which, in concert with other programs like it never revealed any fatigue failures or fatigue damage after 95 million flight hours, suddenly became "deficient"? Why did none of these deficiencies strike anybody as deficiencies at the time?"
The central idea is not to (stop at) discovering what mistakes were made, but to understand why they didn't seem like mistakes to the individuals making them, and what suppressed the influence of anyone who might have warned otherwise.
"First we were at an altitude where we probably weren't thinking all that sharply to begin with, and then we got tired, cold, and hungry, and that's when we made the stupid mistake that killed ${COLLEAGUE}."
A detailed analysis of the organizational culture at NASA, undertaken by sociologist Diane Vaughan after the [Challenger shuttle destruction] accident, showed that people within NASA became so much accustomed to an unplanned behaviour that they didn’t consider it as deviant, despite the fact that they far exceeded their own basic safety rules. This is the primary case study for Vaughan’s development of the concept of normalization of deviance.
Even here they have a section on how the safety performance boundary is fuzzy and dynamic.
I wonder though what things look like with super high dimensions. When there are a 100 different things that go into whether or not you're being safe. That boundary's fuzzy and dynamic nature might extend clear across the entire space. And the fact that failures happen due to rare occurrences suggests that we're not starting at a point of safety but actually starting in a danger zone that we've just been lucky enough not to encounter failures for.
100% unit test coverage comes to mind (even for simple getters). Where some might see a slide towards danger as the coverage goes down, another sees more time to verify the properties that really matter. And I don't see why we can't get into the scenario where both are right and wrong in incomparable ways.
Normalization of Deviance – https://www.lesswrong.com/posts/4nzfts9AYXy6htcQo/normalizat...
I thought it was pretty hilarious, and eventually learned that it was part of this odd movement about fifteen years ago regarding the same thing as this article. More here: https://mikerowe.com/2022/03/the-origin-of-safety-third/
I put the sticker on my laptop, and once got confronted by a confused and possibly angry worksite manager who saw it in a cafe and demanded an explanation. I'll never forget how he took the slogan as some kind of personal affront.
Practical Drift Towards Failure - https://news.ycombinator.com/item?id=21406452 - Oct 2019 (1 comment)
I don't see practical guidance on how to do it in the article? Do I just sit down and throw my arms in the air, and complain "oh, how things are going in a bad way"?
> (in particular if they are encouraged by a “cheaper, faster, better” organizational goal)
This struck me, I have never remotely worked for a place that seriously believed "you get what you pay for." I wonder what that would be like.
The reactor control systems are powered by the reactor itself, but this isn't considered a liability, because once started, such a device is not intended to be stopped; shutdowns are large costly affairs intended to occur rarely for refueling. The reaction is regarded as a force of nature like a running river. But the reactor can be operated in high vs. low power modes. Notably, as a system, the device is most hazardous when transitioning between power modes, especially towards low power mode.
It was expected that in certain emergencies, reactor power would be lowered to the point where steam generator turbine inertia is intended to work like a battery of reserve power used to cool the reactor, but knowing precisely how well this works requires verification. To conduct tests the operators intentionally drove the reactor towards the edge of its low power operational limits, overriding safety protocols and subsystems to create the preconditions of the experiment. Disaster ensued eve when the operators feared they had lowered power to much to the brink of an expensive non-routine shutdown so they goosed it, creating a feedback loop into over power. Operators made a last ditch attempt to control the crisis using the emergency core shutdown system, a mechanism of last resort, but a poorly handled design edge case resulted in the shutdown mechanism to create an enormous power surge which caused the core cooling system to explode: A 3 giga-watt thermal core spiked to 30 giga-watt thermal and the lid blew off, so to speak.
The disaster was directly caused by testing of facilities to handle a theoretical emergency, and would have been avoided if the testing was not performed.
But beyond this, the test protocol required driving the machine into a hazardous state, leading to the operators' accidental discovery of tripwire for a catastrophic failure mode that, although it had been a matter of conjecture in contingency planning, was regarded as so unlikely by planners that needed retrofitting of the emergency shutdown system was deferred. "Off" is the least-desired operational state of the reactor, so making an expensive effort to address a conjecture with a hazard of the systems most unlikely mode of operation was not a high priority.
There's a vague parallel between the Chernobyl disaster and the Pan-Am, KLM airport disaster at Tenerife, where a constellation of exceptional conditions led to a collision of two fully loaded 747s. The ostensible cause was an off-by-one error by an arriving flight crew member in the counting taxi-ways, bringing his plane into the path of the other during the others take-off, and the other assuming that a routine but ambiguous figure of speech on the part of control meant clearance to take off, when actually it just meant control's acknowledgment of the departing captain's statement of readiness to proceed with take off.
And Titanic will not be forgotten.
In these disasters, everybody was fully engaged and driving into mayhem with everything running according to plan, but under an unlikely confluence of conditions.
Philosophically, a proper plan depends on equality between the conditions of the plan and the execution of events, but paradoxically there's only one place for true equality in the entire universe: in concept. So all plans are at best provisional. This observation could lead to more wonder about the contours of probability gradients in systems designs.
Related
Why Heroism Is Bad and What We Can Do to Stop It
Heroism in site reliability engineering can obscure systemic issues, create unrealistic workload expectations, and lead to burnout. Encouraging discussions about service level objectives and allowing failures can improve system reliability.
Deep Adaptation opens up necessary conversation about breakdown of civilisation (2020)
Deep Adaptation highlights the necessity of preparing for potential societal collapse due to risks like climate change and pandemics. It promotes "collapsology" to study these threats and encourages public discourse.
Kobayashi Maru Management (2018)
The article explores "Kobayashi Maru" management, emphasizing the need for preparation, clear communication, and proactive leadership to navigate unexpected challenges and prevent crises in organizational settings.
Safety First
The article highlights how production pressures in tech companies undermine the "safety first" concept, suggesting that true safety requires allowing engineers to extend deadlines without consequences, despite management's productivity concerns.
Designing Organisations That Work
Dan Davies' "The Unaccountability Machine" critiques modern organizations' lack of accountability, contrasting management and cybernetic revolutions, and emphasizes Stafford Beer’s Viable System Model to improve organizational design and outcomes.