Falcon Content Update Preliminary Post Incident Report
CrowdStrike faced Windows crashes due to a faulty update on July 19, 2024. The issue affected Falcon sensor versions 7.11 and above but not Mac or Linux systems. CrowdStrike reverted the update, plans enhanced testing, validation, and deployment strategies, and will provide more control to customers.
Read original articleCrowdStrike conducted a preliminary Post Incident Review (PIR) following a content configuration update that caused Windows system crashes on July 19, 2024. The issue stemmed from a Rapid Response Content update error, leading to a Blue Screen of Death (BSOD) on affected systems running Falcon sensor version 7.11 and above. Mac and Linux hosts were not impacted. The faulty update was reverted, preventing further impact on systems. CrowdStrike outlined the differences between Sensor Content and Rapid Response Content, emphasizing the need for rigorous testing and validation processes to prevent similar incidents. To avoid future occurrences, CrowdStrike plans to enhance Rapid Response Content testing, add validation checks, improve error handling, and implement a staggered deployment strategy with enhanced monitoring. Customers will also receive more control over content updates and detailed release notes. CrowdStrike aims to release a comprehensive Root Cause Analysis to the public once the investigation is concluded.
Related
Crowdstrike – Statement on Falcon Content Update for Windows Hosts
CrowdStrike addresses a Windows host content update defect, reassuring Mac and Linux hosts are safe. The issue, not a cyberattack, is resolved. Impacted customers receive support and guidance for recovery.
Technical Details on Today's Outage
CrowdStrike faced a temporary outage on July 19, 2024, caused by a sensor update on Windows systems, not a cyberattack. The issue affected some users but was fixed by 05:27 UTC. Systems using Falcon sensor for Windows version 7.11+ between 04:09-05:27 UTC might have been impacted due to a logic error from an update targeting malicious named pipes. Linux and macOS systems were unaffected. CrowdStrike is investigating the root cause and supporting affected customers.
CrowdStrike broke Debian and Rocky Linux months ago
CrowdStrike's faulty update caused a global Blue Screen of Death issue on 8.5 million Windows PCs, impacting sectors like airlines and healthcare. Debian and Rocky Linux users also faced disruptions, highlighting compatibility and testing concerns. Organizations are urged to handle updates carefully.
Technical Details: Falcon Update for Windows Hosts
CrowdStrike issued a Windows sensor update causing crashes on July 19, 2024, fixed by 05:27 UTC. Customers using affected versions may have experienced issues. Linux and macOS systems were unaffected. CrowdStrike is investigating and providing remediation guidance.
CrowdStrike Incident Preliminary Post Incident Review
CrowdStrike faced a system crash on July 19, 2024, caused by a faulty Windows content update, resulting in a BSOD. Measures were taken to prevent future incidents, with affected Windows hosts identified and addressed. CEO apologized, ensuring normal operations, while Mac and Linux hosts remained unaffected.
- Many commenters express disbelief that the update was deployed without adequate testing, highlighting a lack of proper validation and oversight.
- There is a consensus that the incident reflects poor process management and a failure to learn from past mistakes.
- Several users emphasize the need for better customer control over updates and more cautious deployment strategies.
- Criticism is directed at the vague language used in CrowdStrike's postmortem, which obscures the real issues and fails to provide clear solutions.
- Concerns are raised about the implications of such failures for critical infrastructure and the potential legal ramifications for CrowdStrike.
"Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed."
This is where they admit that:
1. They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.
Software that does 1. and 2. has absolutely no place in critical infrastructure like hospitals and emergency services. I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.
The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.
> Enhance existing error handling in the Content Interpreter.
That's it.
Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.
> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
Could it say any less? I hope the new check is a test fleet.
But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".
It compiled, so they shipped it to everyone all at once without ever running it themselves.
They fell short of "works on my machine".
> Software Resiliency and Testing
> * Improve Rapid Response Content testing by using testing types such as:
> * Local developer testing
So no one actually tested the changes before deploying?!
A good QA team could turn around a rapid response update with more than enough testing to catch screwups like this and even some rather more subtle ones in an hour or two.
What if they implemented a release process, and follow it? Like everyone else does. Hackers at the workplace, sigh.
In this post mortem there are a lot of words but not one of them actually explains what the problem was. which is: what was the process in place and why did it fail?
They also say a "bug in the content validation". Like what kind of bug? Could it have been prevented with proper testing or code review?
So they did not test this update at all, even locally. Its going to be interesting how this plays out in courts. The contract they have with us limits their liability significantly, but this - surely - is gross negligence.
ex. sensors? I mean how about hosts, machines, clients?
The only relevant part you need to see:
>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Problematic content? Yeah, this is telling exactly nothing.
Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.
Conspicuously absent:
— fixing whatever produced "problematic content"
— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes
— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test
— allowing the sysadmins to roll back updates before the OS boots
— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients
This is a nothing sandwich, not an incident review.
> * Local developer testing
Yup... now that all machines are internet connected, telemetry has replaced QA departments. There are actual people in positions of power that think that they do not need QA and can just test on customers. If there is anything right in the world, crowdsuck will be destroyed by lawsuits and every decisionmaker involved will never work as such again.
If this is how they are going to publish what happened, I don't have any hope that they've actually learned anything from this event.
> Throughout this PIR, we have used generalized terminology to describe the Falcon platform for improved readability
Translation: we've filled this PIR with technobable so that when you don't understand it you won't ask questions for fear of appearing slow.
To me this was a complete failure on the process and review side. If something so blatantly obvious can slip through, how could ever I trust them to prevent an insider from shipping a backdoor?
They are auto updating code with the highest privileges on millions of machines. I'd expect their processes to be much much more cautious.
An actual scenario: Some developer starts working on pre deployment validation of config files. Let's say in a pipeline.
Most of the time the config files are OK.
Management says: "Why are you spending so long on this project, the sprint plan said one week, we can't approve anything that takes more than a week."
Developer: "This is harder than it looks" (heard that before).
Management: "Well, if the config file is OK then we won't have a problem in production. Stop working on it".
Developer: Stops working on it.
Config file with a syntax error slips through, .. The rest is history
Should be the tldr. On threads there's information about CrordStrike slashing QA team numbers, whether that was a factor should be looked at.
Wasn't 'Channel File 291' a garbage file filled with null pointers? Meaning it's problematic content in the same way as filling your parachute bag with ice cream and screws is problematic.
2) The things that did not fail went so great
3) Many many machines did not fail
4) macOS and Linux unaffected
5) Small lil bug in the content verifier
6) Please enjoy this $10 gift card
7) Every windows machine on earth bsod'd but many things worked
* Their software reads config files to determine which behavior to monitor/block
* A "problematic" config file made it through automatic validation checks "due to a bug in the Content Validator"
* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions
* The config file causes their software to perform an out-of-bounds memory read, which it does not handle gracefully
Everything else is smoke and the smell of sulfur.
Falcon configuration is shipped with both direct driver updates ("sensor content"), and out of band ("rapid response content"). "Sensor Content" are scripts (*) that ship with the driver. "Rapid response content" are data that can be delivered dynamically.
One way that "Rapid Response Content" is implemented is with templated "Sensor Content" scripts. CrowdStrike can keep the behavior the same but adjust the parameters by shipping "channel" files that fill in the templates.
"Sensor content", including the templates, are a part of the normal test and release process and goes through testing/verification before being signed/shipped. Customers have control over rollouts and testing.
"Rapid Response Content" is deployed through a different channel that customers do not have control over. Crowdstrike shipped a broken channel file that passed validation but was not tested.
They are going to fix this by adding testing of "rapid response" content updates and support the same rollout logic they do for the driver itself.
(*) I'm using the word "script" here loosely. I don't know what these things are, but they sound like scripts.
---
In other words, they have scripts that would crash given garbage arguments. The validator is supposed to check this before they ship, but the validator screwed it up (why is this a part of release and not done at runtime? (!)). It appears they did not test it, they do not do canary deployments or support rollout of these changes, and everything broke.
Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.
This reads like a bunch of baloney to obscure the real problem. The only relevant part you need to see:
>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Problematic content? Yeah, this is telling exactly nothing.
Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.
Conspicuously absent:
— fixing whatever produced "problematic content"
— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes
— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test
— allowing the sysadmins to roll back updates before the OS boots
— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients
This is a nothing sandwich, not an incident review.
Related
Crowdstrike – Statement on Falcon Content Update for Windows Hosts
CrowdStrike addresses a Windows host content update defect, reassuring Mac and Linux hosts are safe. The issue, not a cyberattack, is resolved. Impacted customers receive support and guidance for recovery.
Technical Details on Today's Outage
CrowdStrike faced a temporary outage on July 19, 2024, caused by a sensor update on Windows systems, not a cyberattack. The issue affected some users but was fixed by 05:27 UTC. Systems using Falcon sensor for Windows version 7.11+ between 04:09-05:27 UTC might have been impacted due to a logic error from an update targeting malicious named pipes. Linux and macOS systems were unaffected. CrowdStrike is investigating the root cause and supporting affected customers.
CrowdStrike broke Debian and Rocky Linux months ago
CrowdStrike's faulty update caused a global Blue Screen of Death issue on 8.5 million Windows PCs, impacting sectors like airlines and healthcare. Debian and Rocky Linux users also faced disruptions, highlighting compatibility and testing concerns. Organizations are urged to handle updates carefully.
Technical Details: Falcon Update for Windows Hosts
CrowdStrike issued a Windows sensor update causing crashes on July 19, 2024, fixed by 05:27 UTC. Customers using affected versions may have experienced issues. Linux and macOS systems were unaffected. CrowdStrike is investigating and providing remediation guidance.
CrowdStrike Incident Preliminary Post Incident Review
CrowdStrike faced a system crash on July 19, 2024, caused by a faulty Windows content update, resulting in a BSOD. Measures were taken to prevent future incidents, with affected Windows hosts identified and addressed. CEO apologized, ensuring normal operations, while Mac and Linux hosts remained unaffected.