July 22nd, 2024

The CrowdStrike Failure Was a Warning

A systems failure at CrowdStrike led to a global IT crisis affecting various sectors, emphasizing the risks of centralized, fragile structures. The incident calls for diverse infrastructure and enhanced resilience measures.

Read original articleLink Icon
The CrowdStrike Failure Was a Warning

A crucial systems failure at CrowdStrike triggered a global IT disaster affecting banks, airlines, and health-care systems, highlighting the vulnerability of hyperconnected systems designed for optimization over resilience. The incident underscores the risks posed by centralized, fragile structures in a world where small errors can lead to widespread crises. The interconnected nature of modern societies, driven by globalization and digitization, amplifies the potential for catastrophic, instantaneous risks. The CrowdStrike outage, caused by human error, serves as a stark reminder of the fragility of our digital infrastructure and the need for greater resilience. The author argues for a shift towards more diverse digital infrastructure, stringent testing protocols, and enhanced redundancy to mitigate future disasters. The incident serves as a warning that current systems prioritize optimization at the expense of resilience, urging for a reevaluation of our approach to designing and managing critical infrastructure to prevent similar crises in the future.

Related

Microsoft has serious questions to answer after the biggest IT outage in history

Microsoft has serious questions to answer after the biggest IT outage in history

The largest IT outage in history stemmed from a faulty software update by CrowdStrike, impacting 70% of Windows computers globally. Mac and Linux systems remained unaffected. Concerns arise over responsibility and prevention measures.

It's not just CrowdStrike – the cyber sector is vulnerable

It's not just CrowdStrike – the cyber sector is vulnerable

A faulty update from CrowdStrike's Falcon Sensor caused a global outage, impacting various industries. Stock market reacted negatively. Incident raises concerns about cybersecurity reliance, industry concentration, and the need for resilient tech infrastructure.

CrowdStrike debacle provides road map of American vulnerabilities to adversaries

CrowdStrike debacle provides road map of American vulnerabilities to adversaries

A national digital meltdown caused by a software bug, not a cyberattack, exposed network fragility. CrowdStrike's flawed update highlighted cybersecurity complexity. Ongoing efforts emphasize the persistent need for digital defense.

CrowdStrike fail and next global IT meltdown

CrowdStrike fail and next global IT meltdown

A global IT outage caused by a CrowdStrike software bug prompts concerns over centralized security. Recovery may take days, highlighting the importance of incremental updates and cybersecurity investments to prevent future incidents.

Global CrowdStrike Outage Proves How Fragile IT Systems Have Become

Global CrowdStrike Outage Proves How Fragile IT Systems Have Become

A global software outage stemming from a faulty update by cybersecurity firm CrowdStrike led to widespread disruptions. The incident underscored the vulnerability of modern IT systems and the need for thorough testing.

Link Icon 17 comments
By @jpgvm - 6 months
The correct lesson is to stop introducing more vulns into your systems by running "security" products. Crowdstrike was just an outage but could have just has easily been Solarwinds 2.0.

Crowdstrike is probably less bad than the alternatives that I have run into that are largely developed by very low cost engineers cough TrendMicro cough but even so, they aren't NT kernel engineers nor do they have the NT kernel release process.

Companies need to find ways to live without this crap or this will keep happening and it will be a lot worse one day. Self-compromising your own systems with RATs/MDMs/EDR/XDR/whatever other acronym soup needed to please the satanic CISSPs are just terrible ideas in general.

By @theoa - 6 months
You are only as good as your weakest link:

> A Microsoft spokesman said it cannot legally wall off its operating system in the same way Apple does because of an understanding it reached with the European Commission following a complaint. In 2009, Microsoft agreed it would give makers of security software the same level of access to Windows that Microsoft gets.

https://www.wsj.com/tech/cybersecurity/microsoft-tech-outage...

By @mylastattempt - 6 months
I found the article unbearable and just a convoluted way to say: this incident would have had a lot less impact if CrowdStrike had less customers or more competitors. A real page filler without any insight or solutions, just look at this paragraph, completely void of anything useful.

> This time, the digital cataclysm was caused by well-intentioned people who made a mistake. That meant the fix came relatively quickly; CrowdStrike knew what had gone wrong. But we may not be so lucky next time. If a malicious actor had attacked CrowdStrike or a similarly essential bit of digital infrastructure, the disaster could have been much worse.

Gee, the damage from an honest mistake (what does the author even base that on) is most likely easier to fix than the damage done by a malicious actor with bad intent. I feel so enlightened!

By @teeheelol - 6 months
The warning in this case is hire security people who actually have a clue and include vendor software in their risk assessment.

Literally every time I see stuff like this go down, the security software had exactly zero engineering research put into it whereas everything else did.

If people did this, CrowdStrike would either not exist or look completely different.

By @Kaibeezy - 6 months
By @wruza - 6 months
It was a seizure, not warning. More stupid bandaids will be slapped in a hurry without considerations from people who understand how this tumor works.
By @iwontberude - 6 months
Having monopolies and oligopolies is like having a small gene pool.
By @zelon88 - 6 months
The culture at MS$ is to servitize enough of their products and then force customers to use them. That way, the products won't be so exposed to users and Microsoft will be able to limit their own liability without actually improving the back end product.

Servitization is a clever way to consolidate your perpetual licensed customers over to perpetual service contacts, while also further obfuscating and locking down the underlying operating environment.

This is in the best interest of Microsoft bottom line, at the expense of all private business, government, or anyone who values consumer experience really. It reduces the number of drive-by security incidents, but when WW3 happens and 75% of our economy is hosted in a whopping 12 datacenters across 3 companies I'm sure we'll be screwed. I mean just depth charging Google fiber today would probably take down 25% of the world economy.

By @akira2501 - 6 months
"Was a Warning."

No. It was just a failure. The warnings have been trumpeted for decades.

It should have been no surprise that the giant company that was trusted to secure our single source of OS software against "supply chain attacks" ended up committing the largest "supply chain attack" yet seen on Earth.

We are effectively still in the wild west. The gold rush has to end before we can truly civilize the place.

By @jijji - 6 months
is it true that the owner of crowdstrike is an ex-employee of McAfee and the same company that got sold because they had massive downtime for basically the same reason
By @gquere - 6 months
The article advocates for even more market fragmentation? Even though that isn't the issue at all?
By @mannyv - 6 months
If these machines were backend systems (which most of the ones that mattered were), you have to ask: why are they running malware detection when they should have minimal-to-no surface area for an attacker?

That's the real question here.

By @blooalien - 6 months
Sorry, but if decades of warnings from qualified security actual experts who were hired specifically for their expertise in such matters went ignored enough for long enough to reach this point, then this incident isn't gonna change much anything. It'll be news for a short while, then forgotten. No lessons will have been learned, and few if any changes will be made. More things like this will happen in the future. Guaranteed...
By @mikewarot - 6 months
I'm a grumpy old man on the internet.... let's just get that out of the way

The root cause is NOT capitalism, nor is it users, Microsoft, or even CrowdStrike. You can't legislate, regulate, or "be more careful next time" your way out of this. Hell, blaming the users won't even work.

Here are 3 stories:

---

Imagine yourself as an inspector for the Army. The 17th Fortress has exploded this month, and nobody can figure out why. You've checked all the surviving off-site records, and are reasonably sure that the crates of dynamite that used to make up the foundations and structure of the cart were properly inspected, and even updated on a regular basis.

You more closely inspect the records, looking for any possible soldier or supplier who might have caused this loss. It might possibly be communist infiltration, or one of those pacifists!

You encounter an old civilian, who remembers a time when forts were built out of wood or bricks, and suggests that. But he's not a professional solder, what could he know.

---

Imagine you're a fire inspector. You've been to your 4th case this month of complete electrical network outage. This time, the cause seems to be that Lisa Douglas at Green Acres had Eb Dawson climb the pole, and he plugged in one too many appliances to the electricial.

If only there were a way to make sure that an overload anywhere couldn't take down the grid, and ruin so many people's days. You desperately want a day without house fires, and so many linemen being called out to test and repair circuits before connecting them back to the grid.

It will take some time before the boilers and generators get back on line from their cold re-start. In the mean while, business in town has ground to a halt.

The paperwork and processes to track and certify each appliance doesn't seem efficient.

There's this grumpy old guy who talks about fuses and circuit breakers, but he's just a crank.

---

The United States found itself embedded in yet another foreign entanglement in VietNam. There was a severe problem planning air strikes, because there were multiple sources required to plan them, and no single computer could be trusted with both of them. The strikes themselves were classified, but the locations of the enemy radar installations couldn't be trusted to the computers, because they were occasionally accessed by enemy sources. Thus the methods and means of locating the enemy radar equipment could become known, and thus rendered ineffective.

A study was done[1], and the problems were solved. There were systems based on the results of these studies[5], and they worked well.[2] Unfortunately, people thought that it was un-necessary to incorporate these measures, and they defaulted to the broken ambient authority model we're stuck with today. Here's some more reading, if you're interested.[3]

---

If you're bored... I've even got a conspiracy theory that explains how I think we actually got here, it it wasn't simply historical forces (which I think it was, 95% certainty).[4] If true, those forces would still be here today, actively suppressing any such stories.

[1] https://csrc.nist.rip/publications/history/ande72.pdf

[2] https://srl.cs.jhu.edu/pubs/SRL2003-02.pdf

[3] https://github.com/dckc/awesome-ocap

[4] https://news.ycombinator.com/item?id=40107150

[5] https://web.archive.org/web/20120919111301/http://www.albany...

By @Falkon1313 - 6 months
What baffles me is just how many IT personnel in so many organizations around the world apparently just blindly hit the "Deploy this zero-day update to all production systems without any testing" button instead of the "Test this update on our test systems first" button.

Or maybe even just looking up the update online to see whether any problems had been reported before deploying it wholesale across their organizations.

Are these the same IT people whose systems all went offline in the left-pad incident because they 'accidentally' set their production servers to be dependent on a third-party repository?

I've worked at some low-budget places that didn't have much in the way of a vetting process, but even there auto-deploying unknown updates to third-party dependencies into production was always a capital N No.