Air traffic failure caused by two locations 3600nm apart sharing 3-letter code
An air traffic control failure in the UK on August 28, 2023, affected over 700,000 passengers, causing 1,500 flight cancellations due to waypoint confusion, prompting a review for system improvements.
Read original articleAn investigation into a significant air traffic control failure in the UK on August 28, 2023, revealed that the incident was triggered by confusion over flightplan waypoints from a French Bee flight. The failure of the UK air navigation service NATS' flightplan processing system affected over 700,000 passengers, resulting in more than 1,500 flight cancellations and numerous delays. The issue arose when the system misidentified exit waypoints due to coincidentally identical identifiers for two geographically distant waypoints. The primary system, FPRSA-R, initially identified the entry point correctly but failed to process the exit point, leading to a critical exception error. This caused both the primary and backup systems to disconnect within 20 seconds of receiving the flightplan, halting all automatic processing and forcing controllers to revert to manual operations. The independent review panel noted that the misidentification of the waypoints was a key factor in the failure, highlighting the need for improved systems to prevent similar occurrences in the future.
- The UK air traffic control failure on August 28, 2023, affected over 700,000 passengers.
- The incident was caused by confusion over flightplan waypoints with identical identifiers.
- More than 1,500 flights were canceled, and many others were delayed due to the failure.
- Both primary and backup systems disconnected within 20 seconds of the flightplan receipt.
- An independent review panel is assessing the incident to improve future air traffic control systems.
Related
Ryanair Boeing 737 MAX plunges 2,000ft in 17 seconds
An investigation is underway as a Ryanair Boeing 737 Max rapidly descended over 2,000ft in 17 seconds near London Stansted Airport. No injuries reported. Concerns raised over the incident's cause.
Mass worldwide IT outage affects airlines, media and banks
IT outages globally impact airlines, broadcasters, hospitals. American Airlines faces flight disruptions from Crowdstrike software issues. Microsoft acts amid unconfirmed causes. UK trains, GP surgeries, Sky News, KLM, Lufthansa, SAS Airline, hospitals, Paris Olympics, airports, services affected.
74935 viewing Mass worldwide IT outage affects airlines, media and banks
Reports indicate a global IT outage affecting airlines, media, banks, and railways worldwide. Major disruptions in flights, broadcasts, and stock exchanges are occurring. Companies are addressing the situation with mitigation actions.
CrowdStrike's impact on aviation
On July 19, 2024, a CrowdStrike software update caused the largest IT outage, affecting 8.5 million Windows computers, disrupting services, and grounding flights for major airlines, particularly Delta and United.
Accident: United B752 enroute on 19-Sep-24, TCAS resolution 2 passengers injured
On September 19, 2024, a United Airlines flight experienced a TCAS advisory, resulting in two passenger injuries. The FAA classified it as an accident, prompting discussions on improving TCAS protocols.
When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan. After all, literally yesterday they were all functioning without the automated system, if it doesn't seem to be working right better switch back to the manual process we were all using yesterday, instead of risk a catastrophe.
In that situation, switching back to yesterday's workflow is something that won't interrupt much.
A couple decades -- or honestly even just a couple years -- later, that same fault system, left in place without much consideration because it rarely is triggered -- is itself catastrophic, switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.
The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.
Bad News: the system can't recover from an error in an individual flight plan, bringing the whole system down with it (along with the backup system since it was running the same code).
From the day of:
https://news.ycombinator.com/item?id=37292406 - 33 points by woodylondon on Aug 28, 2023 (23 comments)
Discussions after:
https://news.ycombinator.com/item?id=37401864 - 22 points by bigjump on Sept 6, 2023 (19 comments)
https://news.ycombinator.com/item?id=37402766 - 24 points by orobinson on Sept 6, 2023 (20 comments)
https://news.ycombinator.com/item?id=37430384 - 34 points by simonjgreen on Sept 8, 2023 (68 comments)
Seems "reject individual flight plan" might be a better system response than "down hard to prevent corruption"
Bad assumption that a failure to interpret a plan is a serious coding error seems to be the root cause, but hard to say for sure.
/* This should never happen */
if (waypoints.matchcount > 2) {
I was three days in my jeans at business meetings. My bag came back through Lima, Peru and Houston. My bag was having more fun than me.
https://news.ycombinator.com/item?id=37461695 ("UK air traffic control meltdown (jameshaydon.github.io)")
Let's look at point 2.28: "Several factors made the identification and rectification of the failure more protracted than it might otherwise have been. These include:
• The Level 2 engineer was rostered on-call and therefore was not available on site at the time of the failure. Having exhausted remote intervention options, it took 1.5 hours for the individual to arrive on-site to perform the necessary full system re-start which was not possible remotely.
• The engineer team followed escalation protocols which resulted in the assistance of the Level 3 engineer not being sought for more than 3 hours after the initial event.
• The Level 3 engineer was unfamiliar with the specific fault message recorded in the FPRSA-R fault log and required the assistance of Frequentis Comsoft to interpret it.
• The assistance of Frequentis Comsoft, which had a unique level of knowledge of the AMS-UK and FPRSA-R interface, was not sought for more than 4 hours after the initial event.
• The joint decision-making model used by NERL for incident management meant there was no single post-holder with accountability for overall management of the incident, such as a senior Incident Manager.
• The status of the data within the AMS-UK during the period of the incident was not clearly understood.
• There was a lack of clear documentation identifying system connectivity.
• The password login details of the Level 2 engineer could not be readily verified due to the architecture of the system."
WHAT DOES "PASSWORD LOGIN DETAILS ... COULD NOT BE READILY VERIFIED" MEAN?
EDIT: Per NATS Major Incident Investigation Final Report - Flight Plan Reception Suite Automated (FPRSA-R) Sub-system Incident 28th August 2023 https://www.caa.co.uk/publication/download/23340 (PDF) ... "There was a 26-minute delay between the AMS-UK system being ready for use and FPRSA-R being enabled. This was in part caused by a password login issue for the Level 2 Engineer. At this point, the system was brought back up on one server, which did not contain the password database. When the engineer entered the correct password, it could not be verified by the server. "
It's pretty readable and quite interesting.
My first thought was that this was some parasitic capacitance bug in a board design causing a failure in an aircraft.
https://chaos.social/@russss/111048524540643971
Time to tick that "repeat incident?" box in the incident management system, guys.
Only way into the article it dawned to me that "nm" could stand for something else, and guess it was "nautical miles". Live and learn...
Still, it turned out to be an interesting read)
Related: The editorialized HN title uses nanometers (nm) when they possibly mean nautical miles (nmi). What would a flight control system make of that?
From Sept 2023 (flightglobal.com):
- Comments: https://news.ycombinator.com/item?id=37430384
Also some more detailed analysis:
- https://jameshaydon.github.io/nats-fail/
- Comments: https://news.ycombinator.com/item?id=37461695
Having ambiguous names can likewise lead to disaster, as seen here, even if this incident had only mild consequences. (Having worked on place name ambiguity academically, I met people who flew to the wrong country due to city name ambiguity and more.)
At least artificial technical names/labels should be globally unambiguous.
as designed here sounds a big PR move to hide the fact they let an uncaught exception crash the entire software ...
How about : don't trust your inputs guys ?
That's quite a DoS vulnerability...
This is hack-on-hack stuff, but I am wondering if there is a low cost fix for a design behaviour which can't alter without every airline, every other airline system worldwide, accommodating the changes to remove 3-letter code collision.
Gate the problem. Require routing for TLA collisions to be done by hand, or be fixed in post into two paths which avoid the collision. (intrude an intermediate waypoint)
not nanometres as you might assume from being used to normal units
=3
> This forced controllers to revert to manual processing, leading to more than 1,500 flight cancellations and delaying hundreds of services which did operate.
Aug 2023: “UK air traffic woes caused by 'invalid flight plan data'”
https://www.theregister.com/2023/08/30/uk_air_traffic_woes_i... --
(-11 down votes and counting)
Related
Ryanair Boeing 737 MAX plunges 2,000ft in 17 seconds
An investigation is underway as a Ryanair Boeing 737 Max rapidly descended over 2,000ft in 17 seconds near London Stansted Airport. No injuries reported. Concerns raised over the incident's cause.
Mass worldwide IT outage affects airlines, media and banks
IT outages globally impact airlines, broadcasters, hospitals. American Airlines faces flight disruptions from Crowdstrike software issues. Microsoft acts amid unconfirmed causes. UK trains, GP surgeries, Sky News, KLM, Lufthansa, SAS Airline, hospitals, Paris Olympics, airports, services affected.
74935 viewing Mass worldwide IT outage affects airlines, media and banks
Reports indicate a global IT outage affecting airlines, media, banks, and railways worldwide. Major disruptions in flights, broadcasts, and stock exchanges are occurring. Companies are addressing the situation with mitigation actions.
CrowdStrike's impact on aviation
On July 19, 2024, a CrowdStrike software update caused the largest IT outage, affecting 8.5 million Windows computers, disrupting services, and grounding flights for major airlines, particularly Delta and United.
Accident: United B752 enroute on 19-Sep-24, TCAS resolution 2 passengers injured
On September 19, 2024, a United Airlines flight experienced a TCAS advisory, resulting in two passenger injuries. The FAA classified it as an accident, prompting discussions on improving TCAS protocols.