August 8th, 2024

CrowdStrike releases root cause analysis of the global Microsoft breakdown

CrowdStrike's software update error caused a global outage for 8.5 million users, disrupting various sectors and costing Australian businesses over $1 billion. Legal actions are being considered, including by Delta Airlines.

Read original articleLink Icon
CrowdStrike releases root cause analysis of the global Microsoft breakdown

CrowdStrike has released a root cause analysis (RCA) detailing the software update error that led to a significant global outage affecting over 8.5 million Microsoft users in July. The analysis revealed that a single undetected sensor in an update for its Falcon software caused the crash, which resulted in the infamous "Blue Screen of Death" (BSOD) across numerous systems. The incident, referred to as the "Channel 291 Incident," occurred when the Falcon software expected 20 input fields but encountered 21, leading to an out-of-bounds memory read and subsequent system failure. Experts criticized CrowdStrike for the oversight, noting that such basic programming errors should have been caught during quality assurance processes. The outage had widespread repercussions, disrupting operations in various sectors, including airports and supermarkets, and is estimated to have cost Australian businesses over $1 billion. CrowdStrike's CEO, George Kurtz, has been called to testify before Congress regarding the incident, and the company has committed to improving its testing protocols to prevent future occurrences. Legal implications are being considered, with Delta Airlines claiming a $500 million loss and planning to pursue compensation from CrowdStrike, which has denied any wrongdoing.

- CrowdStrike's software update error caused a global outage affecting 8.5 million users.

- The incident was due to a mismatch in expected input fields during a software update.

- The outage disrupted various sectors, leading to an estimated $1 billion loss for Australian businesses.

- CrowdStrike's CEO is set to testify before Congress regarding the incident.

- Legal actions are being considered by affected companies, including Delta Airlines.

Link Icon 3 comments
By @yuliyp - 8 months
Don't bother reading the article, read the write-up the article author attempted to paraphrase (poorly) at https://www.crowdstrike.com/wp-content/uploads/2024/08/Chann...

The way I read it was that they coded up this template type that could take 21 parameters. They expected it to take up to 21 parameters, but an intermediate representation expected it to have only 20. The code path which read the 21st parameter was not covered by tests nor the first deployed rapid response content using this template type. When they ran the untested code for the first time, kaboom.

By @PreInternet01 - 8 months
You might think that one would tire of unrealistic hot takes on the entire situation at some point (especially the 'Microsoft breakdown' in this one is grating), but I have to admit I still enjoy them, mostly since there are some promising opportunities here down the road that require the story to be kept alive.

First and foremost: bust mono-cultures in critical IT environments. Meaningful vendor diversity should be mandatory-by-law, and just like in the early days of the Internet, when nobody in their right mind would move into a data center that didn't have at least separate 'Cisco' and 'Juniper' uplink connectivity (so a vendor-specific bug wouldn't wipe you off the net: "ah, that unexpected BGP behavior took down all our peers for several hours" was, like, a weekly occurrence then), we shouldn't accept check-in-counter-workstations running exclusively on Windows today. It's already inexcusable that 'boot into the last OS snapshot that still worked' (which is commercial-off-the-shelf stuff used in most schools, just because 'little Johnny broke the PC again' is just a fact of life there) is apparently beyond oh-so-important mega-corps like Delta, but we need to go further.

So: mandatory hardware switches that select a boot into either Windows (running off one brand of SSD), or Linux or another completely-independent alternative (running on another brand of storage device, so that 'well, today is the day that all Seagate drives refuse to respond, due to an uptime integer overflow' is recoverable). One environment runs Chrome/Edge, Office365-connected-using-that-lovely-propietary-Exchange-protocol, the Microsoft TN3270 emulator and what-have-you. The other Firefox, OpenOffice with Thunderbird-over-IMAPS, Mocha TN3270 -- you get the idea. Run one environment for 3-to-4 days, then a mandatory switch-over, then back again, all year round. The training and making-everything-look-and-work-the-same development efforts will be fun, but nothing insurmountable.

Next up: Clownstrike and cohorts. Much has been made of the fact that custom kernel-mode drivers were used to track system activity, leading to this particular incident, but I can assure you, as a repeat victim of Enterprise Security Solutions, that even in plain old user mode, it's entirely feasible to mess things up beyond recovery using just the programming equivalent of a butter knife. And getting rid of layers like these will just never happen: "simply make the underlying OS secure enough" is not compatible with an open market for general computing, nor with the ambitions of Security Bureaucrats, and there will always be a need for 'zero day mitigation' which, when done reasonably well, can actually have dramatically good outcomes. I see this on a regular basis when figthing email phishing and spam: if you happen to catch a novel attack strategy and block it right away (which often requires an entirely new class of checks), you prevent a significant number of incidents. Snooze for an hour or two, and the (often painful) damage is done. So, "external vendors pushing cutting-edge code to all your devices on a real-time basis" is here to stay, and while certain players of course need to step up their basic validation game, regulation can only do so much.

So, what can be done: force Microsoft to ship https://learn.microsoft.com/en-us/sysinternals/downloads/sys... as a default and supported part of the OS, open-source it, have an industry-leading bug-bounty program, plus provide a rock-solid user-mode API that not only satisfies MS requirements, but also those of accredited third-party vendors. On Linux, where most interesting full-system observability tooling is still proprietary as well, find someone to open-source theirs (hey, Elastic has https://github.com/elastic/ebpf and some related stuff like https://github.com/open-telemetry/opentelemetry-ebpf-profile..., so perhaps that's a good basis?), with pretty much with the same requirements: rock-solid, open to reasonable criticism/improvements and a generally safe interface to whatever observability is deemed to be required.

TL;DR: the time has come for legally-mandated dual-ecosystem critical-use workstations with vendor-agnostic and safe observability APIs. First to deliver wins!