July 20th, 2024

Technical Details on Today's Outage

CrowdStrike faced a temporary outage on July 19, 2024, caused by a sensor update on Windows systems, not a cyberattack. The issue affected some users but was fixed by 05:27 UTC. Systems using Falcon sensor for Windows version 7.11+ between 04:09-05:27 UTC might have been impacted due to a logic error from an update targeting malicious named pipes. Linux and macOS systems were unaffected. CrowdStrike is investigating the root cause and supporting affected customers.

Read original articleLink Icon
ConfusionFrustrationSkepticism
Technical Details on Today's Outage

On July 19, 2024, CrowdStrike experienced an outage due to a sensor configuration update on Windows systems, causing a system crash and blue screen for some users. The issue was resolved by 05:27 UTC the same day and was not a result of a cyberattack. Customers using Falcon sensor for Windows version 7.11 and above between 04:09 and 05:27 UTC may have been affected. The problem stemmed from a logic error triggered by an update targeting malicious named pipes used in cyberattacks. CrowdStrike has fixed the error in Channel File 291 and continues to protect against named pipe abuse. Systems running Linux or macOS were not impacted. CrowdStrike is conducting a root cause analysis to prevent similar incidents in the future and is providing support for affected customers. For more information, customers can refer to CrowdStrike's blog or Support Portal.

AI: What people are saying
The CrowdStrike outage on July 19, 2024, has sparked a range of reactions and concerns.
  • Many commenters are frustrated with the lack of technical details and transparency in CrowdStrike's explanation.
  • There is skepticism about the company's update and deployment processes, with calls for more gradual rollouts and better testing.
  • Some users are concerned about the potential for similar issues in the future, questioning the company's assurances of "no risk."
  • Several comments highlight the potential security risks and exploitability of the configuration files involved in the incident.
  • There is a general sentiment of disappointment and distrust towards CrowdStrike's handling of the situation and their communication.
Link Icon 20 comments
By @dang - 3 months
Related ongoing thread:

CrowdStrike Update: Windows Bluescreen and Boot Loops - https://news.ycombinator.com/item?id=41002195 - July 2024 (3590 comments)

By @PedroBatista - 3 months
Light on technical and light on details.

Putting the actual blast radius aside, this whole thing seems a bit amateurish for a "security company" that pulls the contracts they do.

By @tail_exchange - 3 months
Can someone who actually understands what CrowdStrike does explain to me why on earth they don't have some kind of gradual rollout for changes? It seems like their updates go out everywhere all at once, and this sounds absolutely insane for a company at this scale.
By @rdtsc - 3 months
> The update that occurred at 04:09 UTC was designed to target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks

The obvious joke here is CS runs the malicious C2 framework. So the system worked as designed: it prevented further execution and quarantined the affected machines.

But given they say that’s just a configuration file (then why the hell is it suffixed with .sys?), it’s actually plausible. A smart attacker could disguise themselves and use the same facilities as the CS. CS will try to block them and blocks itself in the process?

By @nonfamous - 3 months
>>> Systems that are not currently impacted will continue to operate as expected, continue to provide protection, and have no risk of experiencing this event in the future.

Given that this incident has now happened twice in the space of months (first on Linux, then on Windows), and that as stated in this very post the root cause analysis is not yet complete, I find that statement of “NO RISK” very hard to believe.

By @ungreased0675 - 3 months
This seems very unsatisfying. Not sure if I was expecting too much, but that’s a lot of words for very little information.

I’d like more information on how these Channel Files are created, tested, and deployed. What’s the minimum number of people that can do it? How fast can the process go?

By @hatsunearu - 3 months
I'm not a big expert but honestly this read like a bunch of garbage.

> Although Channel Files end with the SYS extension, they are not kernel drivers.

OK, but I'm pretty sure usermode software can't cause a BSOD. Clearly something running in kernel mode ate shit and that brought the system down. Just because a channel file not in kernel mode ate shit doesn't mean your kernel mode software isn't culpable. This just seems like a sleezy dodge.

By @patrickthebold - 3 months
>The configuration update triggered a logic error that resulted in an operating system crash.

> We understand how this issue occurred and we are doing a thorough root cause analysis to determine how this logic flaw occurred.

There's always going to be flaws in the logic of the code, the trick is to not have single errors be so catastrophic.

By @pneumonic - 3 months
> we are doing a "root cause analysis to determine how this logic flaw occurred"

That's going to find a cause: a programmer made an error. That's not the root of the problem. The root of the problem is allowing such an error to be released (especially obvious because of its widespread impact).

By @kyriakos - 3 months
Why is everyone blaming Microsoft? Is this something of an oversight in their side too? Can someone explain?
By @jchiu1106 - 3 months
Where are the technical details?
By @isthisreallife2 - 3 months
So - a malformed configuration is capable of crashing a kernel process. Sounds very exploitable. Very
By @canistel - 3 months
> This issue is not the result of or related to a cyberattack.

Must be corrected to "the issue is not the result of or related to a cyberattack by external agents".

By @geuis - 3 months
Weak.

Very weak and over corporate level of ass covering. And it doesn't even come close to doing that.

They should just let the EM of the team involved provide a public detailed response that I'm sure is floating around internally. Just own the problem and address the questions rather than trying to play at politics, quite poorly.

By @0nate - 3 months
The lower you go in system architecture, the greater the impact when defects occur. In this instance, the Crowdstrike agent is embedded within the Windows Kernel, and registered with the Kernel Filter Engine illustrated in the diagram below.

https://www.nathanhandy.blog/images/blog/OSI%20Model%20in%20...

If the initial root cause analysis is correct, Crowdstrike has pushed out a bug that could have been easily stopped had software engineering best practices been followed: Unit Testing, Code Coverage, Integration Testing, Definition of Done.

By @automatoney - 3 months
To my biased ears it sounds like these configuration-like files are a borderline DSL that maybe isn't being treated as such. I feel like that's a common issue - people assume because you call it a config file, it's not a language, and so it doesn't get treated as actual code that gets interpreted.
By @bryan_w - 3 months
It kinda feels like someone added a watch for c:\COM\COM like we did back in the day on AOL
By @timbelina - 3 months
Can someone aim me at some RTFM that describes the sensor release and patching process, please? I'm lost trying to understand: When a new version 'n' of the sensor is released, we upgrade a selected batch of machines and do some tests (mostly waiting around :-)) to see that all is well. Then we upgrade the rest of the fleet by OU. However, 'cause we're scaredy cats, we leave some critical kit on n-1 for longer. And some really critical kit even on n-2. (Yeah, there's a risk in not applying patches I know but there are other outage-related risks that we balance; forget that for now) Our assumption is that n-1, n-2, etc are old, stable releases, and so when fan and shit collided yesterday, we just hopped on the console and did a policy update to revert to n-2 and assumed we'd dodged the bullet. But of course, that failed... you know what they say about assumptions :-) So in a long-winded way that leads to my three questions: Why did the 'content update' take out not just n but n-whatever sensors equally as effectively? Are the n-whatever versions not actually stable? And if the n-whatever versions are not actually stable and are being patched, what's the point of the versioning? Cheers!
By @xyst - 3 months
“Technical” detail report reads more like a lawyer generated report. This company is awful.

If I ever get a sales pitch from these shit brains, they will get immediately shut down.

Also fuck MS and their awful operating system that then spawned this god awful product/company known as “CrowdStike Falcon”