July 24th, 2024

Falcon Content Update Preliminary Post Incident Report

CrowdStrike faced Windows crashes due to a faulty update on July 19, 2024. The issue affected Falcon sensor versions 7.11 and above but not Mac or Linux systems. CrowdStrike reverted the update, plans enhanced testing, validation, and deployment strategies, and will provide more control to customers.

Read original articleLink Icon
FrustrationDisappointmentSkepticism
Falcon Content Update Preliminary Post Incident Report

CrowdStrike conducted a preliminary Post Incident Review (PIR) following a content configuration update that caused Windows system crashes on July 19, 2024. The issue stemmed from a Rapid Response Content update error, leading to a Blue Screen of Death (BSOD) on affected systems running Falcon sensor version 7.11 and above. Mac and Linux hosts were not impacted. The faulty update was reverted, preventing further impact on systems. CrowdStrike outlined the differences between Sensor Content and Rapid Response Content, emphasizing the need for rigorous testing and validation processes to prevent similar incidents. To avoid future occurrences, CrowdStrike plans to enhance Rapid Response Content testing, add validation checks, improve error handling, and implement a staggered deployment strategy with enhanced monitoring. Customers will also receive more control over content updates and detailed release notes. CrowdStrike aims to release a comprehensive Root Cause Analysis to the public once the investigation is concluded.

Related

Crowdstrike – Statement on Falcon Content Update for Windows Hosts

Crowdstrike – Statement on Falcon Content Update for Windows Hosts

CrowdStrike addresses a Windows host content update defect, reassuring Mac and Linux hosts are safe. The issue, not a cyberattack, is resolved. Impacted customers receive support and guidance for recovery.

Technical Details on Today's Outage

Technical Details on Today's Outage

CrowdStrike faced a temporary outage on July 19, 2024, caused by a sensor update on Windows systems, not a cyberattack. The issue affected some users but was fixed by 05:27 UTC. Systems using Falcon sensor for Windows version 7.11+ between 04:09-05:27 UTC might have been impacted due to a logic error from an update targeting malicious named pipes. Linux and macOS systems were unaffected. CrowdStrike is investigating the root cause and supporting affected customers.

CrowdStrike broke Debian and Rocky Linux months ago

CrowdStrike broke Debian and Rocky Linux months ago

CrowdStrike's faulty update caused a global Blue Screen of Death issue on 8.5 million Windows PCs, impacting sectors like airlines and healthcare. Debian and Rocky Linux users also faced disruptions, highlighting compatibility and testing concerns. Organizations are urged to handle updates carefully.

Technical Details: Falcon Update for Windows Hosts

Technical Details: Falcon Update for Windows Hosts

CrowdStrike issued a Windows sensor update causing crashes on July 19, 2024, fixed by 05:27 UTC. Customers using affected versions may have experienced issues. Linux and macOS systems were unaffected. CrowdStrike is investigating and providing remediation guidance.

CrowdStrike Incident Preliminary Post Incident Review

CrowdStrike Incident Preliminary Post Incident Review

CrowdStrike faced a system crash on July 19, 2024, caused by a faulty Windows content update, resulting in a BSOD. Measures were taken to prevent future incidents, with affected Windows hosts identified and addressed. CEO apologized, ensuring normal operations, while Mac and Linux hosts remained unaffected.

AI: What people are saying
The comments on the CrowdStrike incident reveal significant concerns regarding their update process and testing protocols.
  • Many commenters express disbelief that the update was deployed without adequate testing, highlighting a lack of proper validation and oversight.
  • There is a consensus that the incident reflects poor process management and a failure to learn from past mistakes.
  • Several users emphasize the need for better customer control over updates and more cautious deployment strategies.
  • Criticism is directed at the vague language used in CrowdStrike's postmortem, which obscures the real issues and fails to provide clear solutions.
  • Concerns are raised about the implications of such failures for critical infrastructure and the potential legal ramifications for CrowdStrike.
Link Icon 40 comments
By @squirrel - 6 months
There’s only one sentence that matters:

"Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed."

This is where they admit that:

1. They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.

Software that does 1. and 2. has absolutely no place in critical infrastructure like hospitals and emergency services. I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

By @openasocket - 6 months
I work on a piece of software that is installed on a very large number of servers we do not own. The crowd strike incident is exactly our nightmare scenario. We are extremely cautious about updates, we roll it out very slowly with tons of metrics and automatic rollbacks. I’ve told my manager to bookmark articles about the crowdstrike incident and share it with anyone who complains about how slow the update process is.

The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.

By @Cyphase - 6 months
Lots of words about improving testing of the Rapid Response Content, very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".

> Enhance existing error handling in the Content Interpreter.

That's it.

Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.

> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.

Could it say any less? I hope the new check is a test fleet.

But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".

By @mdriley - 6 months
> Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

It compiled, so they shipped it to everyone all at once without ever running it themselves.

They fell short of "works on my machine".

By @red2awn - 6 months
> How Do We Prevent This From Happening Again?

> Software Resiliency and Testing

> * Improve Rapid Response Content testing by using testing types such as:

> * Local developer testing

So no one actually tested the changes before deploying?!

By @amluto - 6 months
CrowdStrike is more than big enough to have a real 2000’s-style QA team. There should be actual people with actual computers whose job is to break the software and write bug reports. Nothing is deployed without QA sign off, and no one is permitted to apply pressure to QA to sign off on anything. CI/CD is simply not sufficient for a product that can fail in a non-revertable way.

A good QA team could turn around a rapid response update with more than enough testing to catch screwups like this and even some rather more subtle ones in an hour or two.

By @cataflam - 6 months
Besides missing the actual testing (!), the staged rollout (!), looks like they also weren't fuzzing this kernel driver that routinely takes instant worldwide updates. Oops.
By @rurban - 6 months
They bypassed the tests and staged deployment, because their previous update looked good. Ha.

What if they implemented a release process, and follow it? Like everyone else does. Hackers at the workplace, sigh.

By @anonu - 6 months
In my experience with outages, usually the problem lies in some human error not following the process: Someone didn't do something, checks weren't performed, code reviews were skipped, someone got lazy.

In this post mortem there are a lot of words but not one of them actually explains what the problem was. which is: what was the process in place and why did it fail?

They also say a "bug in the content validation". Like what kind of bug? Could it have been prevented with proper testing or code review?

By @aenis - 6 months
"Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production."

So they did not test this update at all, even locally. Its going to be interesting how this plays out in courts. The contract they have with us limits their liability significantly, but this - surely - is gross negligence.

By @nodesocket - 6 months
Why do they insist on using what sounds like military pseudo jargon throughout the document?

ex. sensors? I mean how about hosts, machines, clients?

By @romwell - 6 months
This reads like a bunch of baloney to obscure the real problem.

The only relevant part you need to see:

>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Problematic content? Yeah, this is telling exactly nothing.

Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.

Conspicuously absent:

— fixing whatever produced "problematic content"

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test

— allowing the sysadmins to roll back updates before the OS boots

— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients

This is a nothing sandwich, not an incident review.

By @Cyphase - 6 months
Direct link to the PIR, instead of the list of posts: https://www.crowdstrike.com/blog/falcon-content-update-preli...
By @meshko - 6 months
I so hate it when people fill these postmortems with marketing speak. Don't they know it is counterproductive?
By @dmitrygr - 6 months
> How Do We Prevent This From Happening Again?

> * Local developer testing

Yup... now that all machines are internet connected, telemetry has replaced QA departments. There are actual people in positions of power that think that they do not need QA and can just test on customers. If there is anything right in the world, crowdsuck will be destroyed by lawsuits and every decisionmaker involved will never work as such again.

By @coremoff - 6 months
Such a disingenuous review; waffle and distraction to hide the important bits (or rather bit: bug in content validator) behind a wall of text that few people are going to finish.

If this is how they are going to publish what happened, I don't have any hope that they've actually learned anything from this event.

> Throughout this PIR, we have used generalized terminology to describe the Falcon platform for improved readability

Translation: we've filled this PIR with technobable so that when you don't understand it you won't ask questions for fear of appearing slow.

By @aeyes - 6 months
Do you see how they only talk about technical changes to prevent this from happening again?

To me this was a complete failure on the process and review side. If something so blatantly obvious can slip through, how could ever I trust them to prevent an insider from shipping a backdoor?

They are auto updating code with the highest privileges on millions of machines. I'd expect their processes to be much much more cautious.

By @yashafromrussia - 6 months
Well I'm glad they at least released a public postmortem on the incident. To be honest, I feel naive saying this, but having worked at a bunch of startups my whole life, I expected companies like CrowdStrike to do better than not testing it on their own machines before deploying an update without the ability to roll it back.
By @mianos - 6 months
I see a path to this every day.

An actual scenario: Some developer starts working on pre deployment validation of config files. Let's say in a pipeline.

Most of the time the config files are OK.

Management says: "Why are you spending so long on this project, the sprint plan said one week, we can't approve anything that takes more than a week."

Developer: "This is harder than it looks" (heard that before).

Management: "Well, if the config file is OK then we won't have a problem in production. Stop working on it".

Developer: Stops working on it.

Config file with a syntax error slips through, .. The rest is history

By @notepad0x90 - 6 months
One lesson I've learned from this fiasco is to examine my own self when it comes to these situations. I am so befuddled by all the wild opinions, speculations and conclusions as well as observations of the PIR here. You can never have enough humility.
By @CommanderData - 6 months
"We didn't properly test our update."

Should be the tldr. On threads there's information about CrordStrike slashing QA team numbers, whether that was a factor should be looked at.

By @Scaevolus - 6 months
"problematic content"? It was a file of all zero bytes. How exactly was that produced?
By @nine_zeros - 6 months
Will managers continue to push engineers even when engineers advise to go slower or no?
By @thayne - 6 months
So this event is probably close to a worst case scenario for an untested sensor update. But have they never had issues with such untested updates before, like an update resulting in false positives on legitimate software? Because if they did, that should have been a clue that these types if updates should be tested too.
By @SirMaster - 6 months
I feel like for a system that is this widely used and installed in such a critical position that upon a BSOD crash due to a faulting kernel module like this, the system should be able to automatically roll back to try the previous version on subsequent boot(s).
By @jvreeland - 6 months
I really dislike reading website that take over half the screen and make me read off to the side like this. I can fix it by zooming in but I don't understand why they thought making the navigation take up that much of the screen or not be collapsable was a good move.
By @1970-01-01 - 6 months
>When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception.

Wasn't 'Channel File 291' a garbage file filled with null pointers? Meaning it's problematic content in the same way as filling your parachute bag with ice cream and screws is problematic.

By @m3kw9 - 6 months
Still have kernel access
By @sgammon - 6 months
1) Everything went mostly well

2) The things that did not fail went so great

3) Many many machines did not fail

4) macOS and Linux unaffected

5) Small lil bug in the content verifier

6) Please enjoy this $10 gift card

7) Every windows machine on earth bsod'd but many things worked

By @Ukv - 6 months
A summary, to my understanding:

* Their software reads config files to determine which behavior to monitor/block

* A "problematic" config file made it through automatic validation checks "due to a bug in the Content Validator"

* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions

* The config file causes their software to perform an out-of-bounds memory read, which it does not handle gracefully

By @gostsamo - 6 months
Cowards. Why don't you just stand up and admit that you didn't bother testing everything you send to production?

Everything else is smoke and the smell of sulfur.

By @EricE - 6 months
A file full of zeros is an "undetected error"? Good grief.
By @duped - 6 months
Here is my summary with the marketing bullshit ripped out.

Falcon configuration is shipped with both direct driver updates ("sensor content"), and out of band ("rapid response content"). "Sensor Content" are scripts (*) that ship with the driver. "Rapid response content" are data that can be delivered dynamically.

One way that "Rapid Response Content" is implemented is with templated "Sensor Content" scripts. CrowdStrike can keep the behavior the same but adjust the parameters by shipping "channel" files that fill in the templates.

"Sensor content", including the templates, are a part of the normal test and release process and goes through testing/verification before being signed/shipped. Customers have control over rollouts and testing.

"Rapid Response Content" is deployed through a different channel that customers do not have control over. Crowdstrike shipped a broken channel file that passed validation but was not tested.

They are going to fix this by adding testing of "rapid response" content updates and support the same rollout logic they do for the driver itself.

(*) I'm using the word "script" here loosely. I don't know what these things are, but they sound like scripts.

---

In other words, they have scripts that would crash given garbage arguments. The validator is supposed to check this before they ship, but the validator screwed it up (why is this a part of release and not done at runtime? (!)). It appears they did not test it, they do not do canary deployments or support rollout of these changes, and everything broke.

Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.

By @romwell - 6 months
Copying my content from the duplicate thread[1] here:

This reads like a bunch of baloney to obscure the real problem. The only relevant part you need to see:

>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Problematic content? Yeah, this is telling exactly nothing.

Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.

Conspicuously absent:

— fixing whatever produced "problematic content"

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test

— allowing the sysadmins to roll back updates before the OS boots

— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients

This is a nothing sandwich, not an incident review.

[1] https://news.ycombinator.com/item?id=41053703