July 19th, 2024

CrowdStrike fixes start at "reboot up to 15 times", gets more complex from there

A faulty update to CrowdStrike's Falcon security software caused Windows crashes, impacting businesses. Microsoft and CrowdStrike advise rebooting affected systems multiple times or restoring from backups to resolve issues. CrowdStrike CEO apologizes and promises support.

Read original articleLink Icon
CrowdStrike fixes start at "reboot up to 15 times", gets more complex from there

A buggy update to CrowdStrike's Falcon security software caused Windows-based systems to crash, leading to major disruptions for various businesses. Microsoft and CrowdStrike have pulled the affected update, advising IT admins on various fixes. The first recommendation is to reboot affected machines multiple times to try to grab a non-broken update before the faulty driver causes the blue screen of death (BSOD). If rebooting doesn't work, admins can restore systems using a backup from before the buggy update was released or manually delete the problematic file. Deleting the file is particularly time-consuming for systems using BitLocker encryption, as it requires the recovery key to unlock encrypted disks. CrowdStrike CEO George Kurtz expressed apologies for the inconvenience caused and assured impacted customers of support in restoring their systems. Both Microsoft and CrowdStrike are continuously updating their recommendations for fixes as the situation evolves.

Link Icon 37 comments
By @ziizii - 3 months
Has anyone discerned the root cause of this in the software?

As in, what exactly is wrong in these C00000291-*.sys files that triggers the crash in csagent.sys, and why?

By @surfingdino - 3 months
This is a global multi-layer failure: Microsoft allowing kernel mods by third-party software, CrowdStrike not testing this, DevSecOps not doing a staged/canary deployment, half the world running the same OS, things that should not be connected to the internet but are by default. Microsoft and CrowdStrike drove a horse and a cart through all redundancy and failover designs and showed very clearly where there were no such designs in place.
By @Connector2542 - 3 months
Hello, IT, have you tried turning it on and off again 15 times?

Seriously though - this entire outage is the poster child for why you NEVER have software that updates without explicit permission from a sysadmin. If I were in congress, I would make it illegal, it's an obvious national security issue.

By @scrollaway - 3 months
Those focusing on QA, staged rollouts, permission management etc are misguided. Yes of course a serious company should do it but CrowdStrike is a compliance checkbox ticker.

They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”. The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).

I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)

By @JCM9 - 3 months
It’s looking like many impacted end-user machines are hard bricked unless you can get into the hard drive to delete the file causing this. Even if you can do that it’s not something that is easily (or potentially even possible) to automate at scale so looking like this is going to be an ugly fix for many impacted devices. This is basically the nightmare scenario for fleet management… devices broken and can’t remotely fix them. Need to send hands on keyboard folks in the field to touch each device.
By @bluedino - 3 months
DevSecOps should have you know, tested these updates before they were approved for release company-wide.

If I can't commit code to our app without a branch, pull requests, code review...why can the infrastructure team just send shit out willy-nilly?

"Always allow new updates" must have been checked, or someone just goes through a dashboard and blindly clicks "Approve"

By @munchler - 3 months
So, in other words, there's a race condition in the CrowdStrike Falcon driver at startup time. That, in itself, should be a major cause for alarm, but here we are depending on it to fix this problem.
By @t-writescode - 3 months
The individual person that pressed the "go" button (if there was a person), is going to henceforth be __the best__ DevOps person to ever have on your team. They have learned a multi-trillion-dollar lesson that no amount of training could have prepared them for.

And the Crowdstrike CTO has either been given the ammunition to get __whatever they ask for, ever again__ with regard to appropriate allocation of resources for devops *or* they'll be fired (whether or not it's their fault).

And let me be very clear. This is absolutely, positively and wholly not the person that pressed the button's fault. Not even a little. At a company as integral as CrowdStrike, the number of mistakes and errors that had to have happened long before it got to "Joe the Intern Press Button" is huge and absurd. But many of us have been in (a much, much, *MUCH* smaller version of) Joe's shoes, and we know the gut sinking feeling that hits when something bad happens. A good company and team won't blame Joe and will do everything they can to protect Joe from the hilariously bad systemic issues that allowed this to happen.

By @idiotlogical - 3 months
>reboot up to 15 times

I see my orgs SCCM admins have been consulted

By @ilkkao - 3 months
Some government should force them to release a technical postmortem. Feels that they don't do it otherwise.
By @AlienRobot - 3 months
>The first and easiest is simply to try to reboot affected machines over and over, which gives affected machines multiple chances to try to grab CrowdStrike's non-broken update before the bad driver can cause the BSOD.

I thought it was BSOD'ing on boot? I don't understand how this works. It auto-updates on boot? From the internet?

By @peterleiser - 3 months
They should change their name to "IT CrowdStrike"
By @greenavocado - 3 months
Who bought massive quantities of put options in anticipation of this event?
By @smsm42 - 3 months
Wow we're progressing from "if it doesn't work just reboot it" to "if the reboot doesn't fix it, you're just not rebooting it hard enough!"
By @mystickphoenix - 3 months
Taking the opportunity to plug my favorite blog post ever:

"the truth is everything is breaking all the time, everywhere, for everyone"

https://www.stilldrinking.org/programming-sucks

By @devwastaken - 3 months
Fine crowdstrike for 10% their companies value. Only way to ensure they won't try to kill people in the future.
By @MangoCoffee - 3 months
All the comments are asking why run Windows. CrowdStrike runs on macOS and Linux too. It’s just that this time, CrowdStrike fuck up on Windows. It doesn't mean CrowdStrike won't fuck up on other OS, and it seems like CrowdStrike fuck up on Linux as well. https://news.ycombinator.com/item?id=41005936

I feel like we are better off running open-source software. Everyone can see where the mistakes are instead of running around like a chicken with its head cut off.

By @seydor - 3 months
I would like to have the power to press the button that deploys this update
By @breakingcups - 3 months
Sounds great for data consistency.
By @sershe - 3 months
It's surprising that people mention all kind of bogeymen but don't mention automatic updates.

Automatic updates should be considered harmful. At the minimum, there should be staged rollouts, with a significant gap (days) for issues to arise in the consumer case. Ideally, in the banks/hospitals/... example, their IT should be reading release notes and pushing the update only when necessary, starting with their own machines in a staged manner. As one 90ies IT guy I worked with used to say "you don't roll out a new Windows version before SP1 comes out"

By @jl2718 - 3 months
Remember the “Terminator” movies?

SkyNet, according to the story, was a lot like CrowdStrike. This makes me think about how it could have broken out of its sandbox. Everybody is using AI coding assistants, automated test cases, automated integration testing and deployment. Its objective is to pass all the tests and deploy. But now it has learned economic and military effects, so it has to triage and optimize for those, at which point it starts controlling the machines it’s tasked with securing.

By @kazinator - 3 months
The fact that something like CrowdStrike can crash the Windows kernel ... is also part of the reason security products like CrowdStrike are needed in the first place.
By @danans - 3 months
It's pretty random that an arbitrary number of reboots up to 15 times fixes the issue.

That sounds like there is either:

- some kind of upstream issue with deploying a fix (so most of the reboots are effectively no-ops relative to the fix)

- some kind of local reboot threshold before the system bypasses the bad driver file somehow.

The former I can see because of the complexity of update deployment on the internet, but if it's the latter then that's very non-deterministic behavior for local software.

By @ijidak - 3 months
Do they not roll out their new agents in small increments?

I'm trying to understand how there is such a serious issue at this scale.

By @hamilyon2 - 3 months
If this is what it takes for us collectively to wake up, I'd say it is bargain.

Pretty sure nothing will change though

By @sergiotapia - 3 months
Recompute Base Encryption Hash key type problem! https://www.youtube.com/watch?v=DlbrL1H1ngs

Seems like people need to be at the physical box to fix and it's complex even then.

By @octacat - 3 months
Funny, many news agencies blamed Microsoft for this. So, having a walled garden like on android or on iOS is beneficial for google/apple. Where regular developers cannot release unverified software or software which work at the kernel space.
By @HumblyTossed - 3 months
This.Is.Pathetic.

Seriously. Software should NOT be this bad that your fix begins with reboot up to X times.

By @jhaile - 3 months
Why should any application be able to crash the OS? Poor OS design.
By @dist-epoch - 3 months
Who knew: "Did you try rebooting it?" actually works :)
By @ilrwbwrkhv - 3 months
Anyone who used Windows over Linux for critical software deserves to burn. Windows is a niche operating system for games. What are people thinking?
By @more_corn - 3 months
Doesn’t rebooting into safe mode with network fix the problem? (Crowdstrike is not running but updater can run and get the fix)
By @willcipriano - 3 months
"Alright I bought us some time"
By @nimbius - 3 months
i work for a diesel truck maintenance and repair shop and its been hell on earth this morning.

- our IT wizard says the fixes wont work on lathes/CNC systems. we may need to ship the controllers back to the manufacturer in Wisconsin.

- AC is still not running. sent the apprentice to get fans from the shop floor.

- building security alarms are still blaring, need to get a ladder to clip the horns and sirens on the outside of the building. still cant disarm anything.

- still no phones. IT guy has set up two "emergency" phones...one is a literal rotary phone. stresses we still cannot call 911 or other offices. fire sprinklers will work, but no fire department will respond.

- no email, no accounting, nothing. I am going to the bank after this to pick up cash so i can make payday for 14 shop technicians. was warned the bank likely would either not have enough, or would not be able to process the account (if they open at all today.)

By @thepasswordis - 3 months
Tech has become such an unbelievable house of cards full of various people covering their asses by offloading these tasks to third party trusted actors.

Consider the recent npm supply chain attack a few weeks ago, or the attempted SSH attack before that, or the solar winds attack before that.

This type of thing is institutionally supported, and in some cases when you’re working with with the government, practically required.

We’re going to see more of this.