November 14th, 2024

Air traffic failure caused by two locations 3600nm apart sharing 3-letter code

An air traffic control failure in the UK on August 28, 2023, affected over 700,000 passengers, causing 1,500 flight cancellations due to waypoint confusion, prompting a review for system improvements.

Read original article

Air traffic failure caused by two locations 3600nm apart sharing 3-letter code

An investigation into a significant air traffic control failure in the UK on August 28, 2023, revealed that the incident was triggered by confusion over flightplan waypoints from a French Bee flight. The failure of the UK air navigation service NATS' flightplan processing system affected over 700,000 passengers, resulting in more than 1,500 flight cancellations and numerous delays. The issue arose when the system misidentified exit waypoints due to coincidentally identical identifiers for two geographically distant waypoints. The primary system, FPRSA-R, initially identified the entry point correctly but failed to process the exit point, leading to a critical exception error. This caused both the primary and backup systems to disconnect within 20 seconds of receiving the flightplan, halting all automatic processing and forcing controllers to revert to manual operations. The independent review panel noted that the misidentification of the waypoints was a key factor in the failure, highlighting the need for improved systems to prevent similar occurrences in the future.

- The UK air traffic control failure on August 28, 2023, affected over 700,000 passengers.

- The incident was caused by confusion over flightplan waypoints with identical identifiers.

- More than 1,500 flights were canceled, and many others were delayed due to the failure.

- Both primary and backup systems disconnected within 20 seconds of the flightplan receipt.

- An independent review panel is assessing the incident to improve future air traffic control systems.

Ryanair Boeing 737 MAX plunges 2,000ft in 17 seconds

An investigation is underway as a Ryanair Boeing 737 Max rapidly descended over 2,000ft in 17 seconds near London Stansted Airport. No injuries reported. Concerns raised over the incident's cause.

Mass worldwide IT outage affects airlines, media and banks

IT outages globally impact airlines, broadcasters, hospitals. American Airlines faces flight disruptions from Crowdstrike software issues. Microsoft acts amid unconfirmed causes. UK trains, GP surgeries, Sky News, KLM, Lufthansa, SAS Airline, hospitals, Paris Olympics, airports, services affected.

74935 viewing Mass worldwide IT outage affects airlines, media and banks

Reports indicate a global IT outage affecting airlines, media, banks, and railways worldwide. Major disruptions in flights, broadcasts, and stock exchanges are occurring. Companies are addressing the situation with mitigation actions.

CrowdStrike's impact on aviation

On July 19, 2024, a CrowdStrike software update caused the largest IT outage, affecting 8.5 million Windows computers, disrupting services, and grounding flights for major airlines, particularly Delta and United.

Accident: United B752 enroute on 19-Sep-24, TCAS resolution 2 passengers injured

On September 19, 2024, a United Airlines flight experienced a TCAS advisory, resulting in two passenger injuries. The FAA classified it as an accident, prompting discussions on improving TCAS protocols.

50 comments

By @jrochkind1 - 5 months

I don't know how long that failure mode has been in place or if this is relevant, but it makes me think of analogous times I've encountered similar:

When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan. After all, literally yesterday they were all functioning without the automated system, if it doesn't seem to be working right better switch back to the manual process we were all using yesterday, instead of risk a catastrophe.

In that situation, switching back to yesterday's workflow is something that won't interrupt much.

A couple decades -- or honestly even just a couple years -- later, that same fault system, left in place without much consideration because it rarely is triggered -- is itself catastrophic, switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.

The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.

By @jp57 - 5 months

FYI: nm = nautical miles, not nanometers.

By @FateOfNations - 5 months

Good news: the system successfully detected an error and didn't send bad data to air traffic controllers.

Bad News: the system can't recover from an error in an individual flight plan, bringing the whole system down with it (along with the backup system since it was running the same code).

By @steeeeeve - 5 months

You know there's a software engineer somewhere that saw this as a potential problem, brought up a solution, and had that solution rejected because handling it would add 40 hours of work to a project.

By @Jtsummers - 5 months

There's been some prior discussion on this over the past year, here are a few I found (selected based on comment count, haven't re-read the discussions yet):

From the day of:

https://news.ycombinator.com/item?id=37292406 - 33 points by woodylondon on Aug 28, 2023 (23 comments)

Discussions after:

https://news.ycombinator.com/item?id=37401864 - 22 points by bigjump on Sept 6, 2023 (19 comments)

https://news.ycombinator.com/item?id=37402766 - 24 points by orobinson on Sept 6, 2023 (20 comments)

https://news.ycombinator.com/item?id=37430384 - 34 points by simonjgreen on Sept 8, 2023 (68 comments)

By @jmvoodoo - 5 months

So, essentially the system has a serious denial of service flaw. I wonder how many variations of flight plans can cause different but similar errors that also force a disconnect of primary and secondary systems.

Seems "reject individual flight plan" might be a better system response than "down hard to prevent corruption"

Bad assumption that a failure to interpret a plan is a serious coding error seems to be the root cause, but hard to say for sure.

By @convivialdingo - 5 months

I guarantee that piece of code has a comment like

  /* This should never happen */
  if (waypoints.matchcount > 2) {

By @GnarfGnarf - 5 months

Funny airport call letters story: I once headed to Salt Lake City, UT (SLC) for a conference. My luggage was processed by a dyslexic baggage handler, who sent it to... SCL (Santiago, Chile).

I was three days in my jeans at business meetings. My bag came back through Lima, Peru and Houston. My bag was having more fun than me.

By @perihelions - 5 months

Original (2023) thread with 446 comments,

https://news.ycombinator.com/item?id=37461695 ("UK air traffic control meltdown (jameshaydon.github.io)")

By @amiga386 - 5 months

This is old news, but what's new news is that last week, the UK Civil Aviation Authority openly published its Independent Review of NATS (En Route) Plc's Flight Planning System Failure on 28 August 2023 https://www.caa.co.uk/publication/download/23337 (PDF)

Let's look at point 2.28: "Several factors made the identification and rectification of the failure more protracted than it might otherwise have been. These include:

• The Level 2 engineer was rostered on-call and therefore was not available on site at the time of the failure. Having exhausted remote intervention options, it took 1.5 hours for the individual to arrive on-site to perform the necessary full system re-start which was not possible remotely.

• The engineer team followed escalation protocols which resulted in the assistance of the Level 3 engineer not being sought for more than 3 hours after the initial event.

• The Level 3 engineer was unfamiliar with the specific fault message recorded in the FPRSA-R fault log and required the assistance of Frequentis Comsoft to interpret it.

• The assistance of Frequentis Comsoft, which had a unique level of knowledge of the AMS-UK and FPRSA-R interface, was not sought for more than 4 hours after the initial event.

• The joint decision-making model used by NERL for incident management meant there was no single post-holder with accountability for overall management of the incident, such as a senior Incident Manager.

• The status of the data within the AMS-UK during the period of the incident was not clearly understood.

• There was a lack of clear documentation identifying system connectivity.

• The password login details of the Level 2 engineer could not be readily verified due to the architecture of the system."

WHAT DOES "PASSWORD LOGIN DETAILS ... COULD NOT BE READILY VERIFIED" MEAN?

EDIT: Per NATS Major Incident Investigation Final Report - Flight Plan Reception Suite Automated (FPRSA-R) Sub-system Incident 28th August 2023 https://www.caa.co.uk/publication/download/23340 (PDF) ... "There was a 26-minute delay between the AMS-UK system being ready for use and FPRSA-R being enabled. This was in part caused by a password login issue for the Level 2 Engineer. At this point, the system was brought back up on one server, which did not contain the password database. When the engineer entered the correct password, it could not be verified by the server. "

By @sam0x17 - 5 months

I've posted this here before, but they really need globally unique codes for all the airports, waypoints, etc, it's crazy there are collisions. People always balk at this for some reason but look at the edge cases that can occur, it's crazy CRAZY

By @gadders - 5 months

If you want to, you can read the final report from the UK Civil Aviation Authority here: https://www.caa.co.uk/publication/download/23340

It's pretty readable and quite interesting.

By @junon - 5 months

For the people skimming the comments and are confused: 3600nm here is nautical miles, not nanometers.

My first thought was that this was some parasitic capacitance bug in a board design causing a failure in an aircraft.

By @fyt2024 - 5 months

Is nm the official abbreviation for nautical miles? I assume it is natural miles. For me it is nanometers.

By @NovemberWhiskey - 5 months

So, exactly the same airline (French Bee) and exactly the same route (LAX-ORY) and exactly the same waypoint (DVL) as last September, resulting in exactly the same failure mode:

https://chaos.social/@russss/111048524540643971

Time to tick that "repeat incident?" box in the incident management system, guys.

By @IlliOnato - 5 months

What brought me to read this article was a confusion: how can two locations related to air traffic be 3600 nanometers apart? Was it two points within some chip, or something?

Only way into the article it dawned to me that "nm" could stand for something else, and guess it was "nautical miles". Live and learn...

Still, it turned out to be an interesting read)

By @_pete_ - 5 months

The DVL really is in the details.

By @tempodox - 5 months

When there's no global clearing house for those identifiers, maybe namespaces would help?

Related: The editorialized HN title uses nanometers (nm) when they possibly mean nautical miles (nmi). What would a flight control system make of that?

By @cbhl - 5 months

Hmm, is this the same incident which happened last year? Or is this a new incident?

From Sept 2023 (flightglobal.com):

- https://archive.is/uiDvy

- Comments: https://news.ycombinator.com/item?id=37430384

Also some more detailed analysis:

- https://jameshaydon.github.io/nats-fail/

- Comments: https://news.ycombinator.com/item?id=37461695

By @jll29 - 5 months

Unique IDs that are not really unique are the beginning of all evil, and there is a special place in hell for those that "recycle" GUIDs instead of generating new ones.

Having ambiguous names can likewise lead to disaster, as seen here, even if this incident had only mild consequences. (Having worked on place name ambiguity academically, I met people who flew to the wrong country due to city name ambiguity and more.)

At least artificial technical names/labels should be globally unambiguous.

By @chefandy - 5 months

As an aside, that site's cookie policy sucks. You can opt out of some, but others, like "combine and link data from other sources", "identify devices based on information transmitted automatically", "link different devices" and others can't be disabled. I feel bad for people that don't have the technical sophistication to protect themselves against that kind of prying.

By @mkj - 5 months

Sounds like the kind of thing fuzzing would find easily, if it was applied. Getting a spare system to try it on might be hard though.

By @mmaunder - 5 months

There’s little to no authentication on filing flight plans which makes this a potentially bigger problem. I’m sure it’s fixed but the mechanism that caused the failure is an assertion that fails by disconnecting the critical systems entirely for “safety”. And the backup failed the same way. Bet there are similar bugs.

By @mirages - 5 months

"and it generated a critical exception error. This caused the FPRSA-R primary system to disconnect, as designed,"

as designed here sounds a big PR move to hide the fact they let an uncaught exception crash the entire software ...

How about : don't trust your inputs guys ?

By @polskibus - 5 months

I would’ve thought that in flight industry they got the „business key” uniqueness right ages ago. If a key is multi-part then each check should check all parts not just one. Alternatively, force all airport codes to be globally unique.

By @cryptonector - 5 months

> Just 20s elapsed between the receipt of the flightplan and the shutdown of both FPRSA-R systems, causing all automatic processing of flightplan data to cease and forcing reversion to manual procedures.

That's quite a DoS vulnerability...

By @klysm - 5 months

I’m curious what part of the code rejected the validity of the flight plan. Im also curious what keys are actually used for lookups when they aren’t unique??

By @ggm - 5 months

Could you front-end the software with a proxy which bounces code-collision requests and limit the damage to the specific route, and not the entire systems integrity?

This is hack-on-hack stuff, but I am wondering if there is a low cost fix for a design behaviour which can't alter without every airline, every other airline system worldwide, accommodating the changes to remove 3-letter code collision.

Gate the problem. Require routing for TLA collisions to be done by hand, or be fixed in post into two paths which avoid the collision. (intrude an intermediate waypoint)

By @aeroevan - 5 months

What's crazy is that this hasn't happened before, waypoints that share a name isn't uncommon

By @dboreham - 5 months

Headline still hasn't been fixed? (Correct abbreviation is NM).

By @Optimal_Persona - 5 months

Well, 3600 billionths of a meter IS kinda close...just sayin'

By @entropyie - 5 months

Initially read this as 3600 nanometres... :-)

By @mjan22640 - 5 months

The title sounds like an AMD cpu issue.

By @craigds - 5 months

oh nautical miles !

not nanometres as you might assume from being used to normal units

By @ipunchghosts - 5 months

Title should be nmi

By @jojohohanon - 5 months

Is it just me or was it basically impossible to decipher what those three letter codes were?

By @Joel_Mckay - 5 months

In other news, goat carts are still getting 100 furlong–firkin–fortnight on dandelions.

By @muffwiggler - 5 months

3600 nanometers? That's cool.

By @hobs - 5 months

People posting on this forum saying "ah well software's failure case isn't as bad"

> This forced controllers to revert to manual processing, leading to more than 1,500 flight cancellations and delaying hundreds of services which did operate.

By @J05ephu5M13r - 5 months

It's like déjà vu all over again, Yogi.

Aug 2023: “UK air traffic woes caused by 'invalid flight plan data'”

https://www.theregister.com/2023/08/30/uk_air_traffic_woes_i... --

(-11 down votes and counting)

Ryanair Boeing 737 MAX plunges 2,000ft in 17 seconds

An investigation is underway as a Ryanair Boeing 737 Max rapidly descended over 2,000ft in 17 seconds near London Stansted Airport. No injuries reported. Concerns raised over the incident's cause.