July 28th, 2024

Microsoft technical breakdown of CrowdStrike incident

The blog discusses a CrowdStrike outage caused by a memory safety issue with the CSagent driver, emphasizing the importance of Windows' security features and future enhancements for better security integration.

Read original articleLink Icon
FrustrationSkepticismDisappointment
Microsoft technical breakdown of CrowdStrike incident

Windows is a widely used platform for businesses requiring high security and availability. It offers various operating modes that allow users to restrict execution to approved software and drivers, enhancing security and reliability. Users can utilize integrated security monitoring and detection features or opt for third-party solutions from a diverse ecosystem of vendors. The blog discusses the recent CrowdStrike outage, attributing it to a memory safety issue linked to the CSagent driver. Microsoft analyzed the incident using Windows Error Reporting (WER) kernel crash dumps, identifying global crash patterns and specific faulting details related to the CSagent module. The analysis revealed a read out-of-bounds access violation, which was traced back to a failure in handling memory access correctly. The blog emphasizes the importance of leveraging Windows' integrated security capabilities to improve overall security and reliability. It also highlights future enhancements in Windows' extensibility for security products, aiming to provide better support for both customers and security vendors. The discussion includes technical insights into the crash dump analysis, showcasing the use of Microsoft WinDBG Kernel Debugger for troubleshooting. This incident serves as a reminder of the critical need for robust security practices and the potential risks associated with third-party security solutions. By understanding these challenges, organizations can better integrate and manage their security tools within the Windows environment, ensuring a more secure operational framework.

AI: What people are saying
The comments reflect a range of opinions and concerns regarding the CrowdStrike outage and its implications for security software and Windows' kernel features.
  • Many commenters question how the faulty driver passed Microsoft's quality control and WHQL verification processes.
  • There is a strong sentiment that Microsoft shares responsibility for the incident, with suggestions for better user-mode security implementations.
  • Some commenters express skepticism about CrowdStrike's technical competency and the influence of non-technical leadership.
  • Legal implications and potential negligence lawsuits against CrowdStrike and Microsoft are discussed.
  • Several users advocate for improved user feedback mechanisms in Windows to prevent similar issues in the future.
Link Icon 25 comments
By @rdtsc - 6 months
> We plan to work with the anti-malware ecosystem to take advantage of these integrated features to modernize their approach, helping to support and even increase security along with reliability.

> Providing safe rollout guidance, best practices, and technologies to make it safer to perform updates to security products.

> Reducing the need for kernel drivers to access important security data.

They are being as diplomatic as they can, but it's definitely a slap to CS. Read as "they don't know how to roll things out, they need guidance on basic QA practices, we'll happily teach them...". Then, they list a set of facilities running in user-mode to avoid needing to run as many things in kernel mode.

I would be interested what the water cooler discussion about CS was like inside Microsoft. Especially in teams needed to respond to customers about "Your windows OS is broken, our hospital patients are suffering...".

By @dmattia - 6 months
I suppose I was expecting something more authoritative here. They confirm that there was an attempted read-out-of-bounds, as CrowdStrike said, but that's not really new information at this point. I suppose we'll need to wait for more detailed analysis from CrowdStrike at some point.

This post explains why security software has historically run in kernel-mode, and really seems to be pushing new technology that Microsoft has that would push security vendors into user-mode (with APIs that attempt to assist with many of the reasons why they have historically used kernel-mode).

Crowdstrike already runs in user-mode on both Mac and Linux (from what I can tell), and it seems like running in user-mode on Windows would significantly lessen the risk of catastrophic failures like a blue-screen-of-death. I know the bulk of the failures here belong to CrowdStrike, but I can't help but think about the fact that Apple kicked security vendors out of kernel-mode a ways back, and that if Windows had done similarly, an issue like this probably wouldn't have been possible. By even offering kernel-mode options to external vendors, I believe Microsoft is creating risk for themselves.

By @Animats - 6 months
So how did this kernel level driver get through WHQL verification? The Static Driver Verifier should have caught this.[1] Do some security vendors get to bypass that? Microsoft is very quiet about that.

That's the sort of thing a negligence lawyer focuses on. Partner at Brown Rudrick: "The most likely legal theory will be one of negligence. [Congress] will drag the guy over the coals, they'll maybe implicate him and his company and put in place a negligence action. There'll maybe be a couple of plaintiffs lawyers who dig up some exceptional theory on negligence, and get some class action lawsuits going. Again, we still don't know all the facts in this case, and there are other dimensions which have not yet been fully explored, including how CrowdStrike had access to kernel level updates on the Microsoft operating system? How come Microsoft didn't have any control over these updates being pushed on their kernel?"

The first two class actions are already starting.

[1] https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

[2] https://www.channele2e.com/analysis/crowdstrike-legal-and-li...

By @akira2501 - 6 months
> where security and availability are non-negotiable.

Yep. You just have to pretend that everyone who deployed Windows had an actual competitive choice available to them.

> A second benefit of loading into kernel mode is tamper resistance.

I guess availability is negotiable after all.

By @squirrel - 6 months
Telling that there’s no mention of eBPF, which is standard on Linux and available on Windows, but hasn’t been brought into the main Windows OS. Static analysis might or might not have caught the Blue Friday bug, but it certainly increases the protection level over the current do-as-you-wish model for kernel modules.
By @EasyMark - 6 months
Oh I like this breakdown a lot. Fairly technical, links to resources used, flow of debug process, didn’t get lost in a the weeds of details and how clever they were. I wish more debug retrospectives were like this. It seems like you end up with 100 pages of analysis or a couple of vague paragraphs.
By @userbinator - 6 months
I'm going to be the controversial one here and say that, as bad as CrowdStrike was, the alternative of having only Microsoft be able to decide what people can do is far worse. I've already seen many others trying to use this incident to advocate for digital totalitarianism.
By @superposeur - 6 months
I’m surprised no one has yet noted that Microsoft itself is a chief CrowdStrike competitor.
By @zh3 - 6 months
I do have to wonder how many agonising layers of review this went through with the marketing and legal departments as part of shifting the blame.

If you want to decide which OS/distros to avoid for critical stuff, look to see who's learning from the incident (even if not bitten by it) compared to those saying "it wasn't our fault" (and that's not just MS).

By @tonymet - 6 months
Did either release from MS or Crowdstrike explain how this crash bypassed QC? I'm still baffled that a 100% repro crash even made it anywhere near the later stages of QC. This is something easily caught by the earliest CI phases , at the developer and at least first build automation phase, let alone human QC.
By @jacobgorm - 6 months
I used to work on Control Flow Integrity (CFI/XFI) research at places like MSR Silicon Valley and VMware, as far back as 2006. Back then, sandboxing a kernel module like ramdisk.sys was doable with a lot of binary rewriting magic, and later with custom LLVM passes, but nowadays it should be a simple matter of compiling the code with clang and the appropriate flags, to completely rule out this type of memory safety error, turning a BSOD into a polite log message and disabling the faulty driver.
By @eqvinox - 6 months
> Move tool-tip APIs from kernel to user mode

?!?!

By @WalterBright - 6 months
What I heard is that CrowdStrike normally rate limits pushing a fix. This is so that if the fix is bad, the damage is limited. But for some reason, the rate limiter was turned off and the update went out to everyone.
By @waterTanuki - 6 months
I am still to this day gobsmacked how a company the size of Microsoft doesn't do all of it's security in-house like Apple, which locked down kernel access to macos some time ago. The blame is mostly on CrowdStrike, but Microsoft does share responsibility in allowing third-parties to pepper the kernel with whatever code they want to.
By @ldjkfkdsjnv - 6 months
The true story is that I bet some major divisions of Crowdstrike are ran by non technical people that got there through non meritocratic means. Theres generally been no repercussions for their underperformance, much like boeing. Crowdstrike business is built on relationships, not technical supremacy. And bada bing bada boom, we have a complete failure of basic technical competency (no rigourous role out process).
By @gjsman-1000 - 6 months
Reminder that Microsoft could have programmed Windows to notice if a driver has caused a blue screen three times in a row, and prompt if you want to disable the driver on boot. After all, Windows already collects how many times a driver causes a crash. This would have made recovery one click instead of heading into Safe Mode and needing BitLocker keys.

But they didn’t.

And Microsoft, I argue, also has blood on their hands for every hospital this hit. Giving users a prompt to disable the driver, after three successive failed boots, would have saved lives.

By @rldjbpin - 6 months
one thing from this whole fiasco that i wished bring to conversation was the fact that (crucial/market-dominant) digital/IT services don't have the same level of liability as mundane, physical goods.

a simple plastic covering of your new dyson has more legal scrutiny and action (see the "children may choke" warnings they all need to come with) than software that we otherwise block in the name of "national security".

given how much overvalued tech companies are in this region, i believe it is high time to start legally recognizing the real-life impact of digital tech. to hell with the "but muh innovation" argument.

By @DeathMetal3000 - 6 months
“Windows has announced a commitment around the Rust programming language as part of Microsoft’s Secure Future Initiative (SFI) and has recently expanded the Windows kernel to support Rust.”
By @janice1999 - 6 months
At least they're not blaming the European Union in this breakdown (as they did earlier).
By @sammyteee - 6 months
I stopped reading after "Windows is an open and flexible platform"
By @someonehere - 6 months
Unless actually required by your org, choose the N -1 policy in CS to avoid snafus like this in the future. It’s in the console so use it.
By @aurelien - 6 months
You use a distribution made with foot for secretary and gamers and you blindly try to explain where the problem is.

You are the clown's of the world, that's all ... xD