July 19th, 2024

Microsoft has serious questions to answer after the biggest IT outage in history

The largest IT outage in history stemmed from a faulty software update by CrowdStrike, impacting 70% of Windows computers globally. Mac and Linux systems remained unaffected. Concerns arise over responsibility and prevention measures.

Read original articleLink Icon
Microsoft has serious questions to answer after the biggest IT outage in history

The article discusses what could potentially be the largest IT outage in history caused by a faulty software update from cybersecurity company CrowdStrike affecting over 70% of the world's desktop computers running on Microsoft Windows. The economic impact of this outage is significant, with global disruption reported. Fortunately, other software families like Mac and Linux systems were not affected. Questions are raised regarding CrowdStrike's responsibility and Microsoft's measures to prevent such outages in the future. The incident highlights the risks of over-reliance on a single system and the importance of redundancy in critical digital infrastructure. While emergency services and essential sectors seem to have weathered the outage, the event prompts a broader discussion on the reliability of the software underpinning global operations.

Related

Worldwide BSOD outage caused by Crowdstrike

Worldwide BSOD outage caused by Crowdstrike

A widespread IT outage affects Australian institutions and global companies due to a software issue with Crowdstrike. Major sectors experience disruptions, with ongoing efforts to resolve the outages.

Cybersecurity platform Crowdstrike down worldwide, users logged out of systems

Cybersecurity platform Crowdstrike down worldwide, users logged out of systems

CrowdStrike, a cybersecurity platform, faced a global outage affecting users in countries like India, Japan, Canada, and Australia due to a technical error in its Falcon product. Users encountered disruptions, including BSOD errors. CrowdStrike is actively working on a fix.

Microsoft outage: Chaos as internet down and flights grounded around the world

Microsoft outage: Chaos as internet down and flights grounded around the world

A global IT outage, possibly linked to Crowdstrike antivirus software, caused chaos worldwide. Windows crashes affected sectors like healthcare and transportation. Crowdstrike's shares dropped. Various services faced disruptions, prompting calls for system modernization.

Major Windows BSOD issue takes banks, airlines, and broadcasters offline

Major Windows BSOD issue takes banks, airlines, and broadcasters offline

A global outage caused by a faulty update from CrowdStrike led to Windows machines experiencing Blue Screen of Death issues, affecting banks, airlines, and broadcasters worldwide. Recovery efforts are ongoing.

Microsoft/Crowdstrike outage ground planes, banks and the London Stock Exchange

Microsoft/Crowdstrike outage ground planes, banks and the London Stock Exchange

A cybersecurity program update failure caused global disruptions affecting businesses and services like United Airlines, McDonald’s, and the London Stock Exchange. Microsoft and CrowdStrike faced issues, but the problem was resolved without a cyberattack. CrowdStrike's shares dropped 20%, and Microsoft's fell 2.9%. The incident, involving Windows and security software, is one of the largest IT outages, surpassing past disruptions.

Link Icon 15 comments
By @Lx1oG-AWb6h_ZG0 - 3 months
Apparently Crowdstrike also brought down Linux hosts in the same way in April but it didn’t get widely reported: https://news.ycombinator.com/item?id=41005936
By @jamescun - 3 months
Not sure what questions Microsoft have to answer. A third-party vendor shipped defective software.

I guess the only question they could answer is why they don't provide a framework like Apple do with Endpoint Security for third-party vendors to use.

By @velcrovan - 3 months
“The Entire Culture Around So-Called ‘Software Engineering’ And Our Collective Failure to Build Strong Legal Institutions Around That Culture has serious questions to answer after the biggest IT outage in history”

there fixed it for you

By @commandlinefan - 3 months
I predict that the people who were actually responsible (the "deadline above all else" crowd) will not be the ones who are actually blamed.
By @johnnyo - 3 months
I don’t see how this is Microsoft’s fault or issue.

MS can’t prevent a software vendor from breaking the machine.

By @bradford - 3 months
I see discussion about who's at fault: Microsoft or Crowdstrike.

But one thing I don't get about this: what was the role of the enterprise admins?

Most administrators at large companies are cautious about rolling out new software versions to their employees. They (normally?) test before broad deployment.

Seems like one of three things would have had to have happened for this to be missed:

1. Admins ignored testing this update prior to enterprise rollout.

2. Crowdstrike forced the update on unwilling users.

3. Crowdstrike does not provide a framework for such pre-rollout testing, and enterprises chose to use it anyway.

Can anyone offer insight?

[Disclosure: I'm a Microsoft employee, but not an enterprise admin]

By @luma - 3 months
I'm not sure what to make of this but I'm noting something odd this morning: coverage of this event out of the UK near-unanimously is laying this outage on Microsoft. BBC ran a story this morning that didn't mention Crowdstrike until the 4th paragraph, and headline after headline is repeating the message that Microsoft caused a global outage.

Reporting from the US and elsewhere seems to be a bit more on point. Is it just because the Brits went to press earlier in the day before the problem was understood?

By @multimoon - 3 months
I can’t explain enough how much I dislike Microsoft the corporation, but this wasn’t their fault - a 3rd party kernel driver crashed the system.
By @kkfx - 3 months
Well... The only question should be architectural:

- why automatic, silent upgrades

- why no boot environment/generations at boot to reboot into a previous snapshot of the system (since nfts do have snapshots indeed), meaning why no integration between the storage and the system management

- why massive rollout instead of partitioned testing rollout slowly propagating

For the rest is a third party tool, not mandated by the vendor so... It's a user choice.

By @duxup - 3 months
This is a pretty empty article that just seems like spin on the current issues going on.
By @arshiiita - 3 months
The problematic driver was dowloaded from Microsoft managed infrastructure, even though it was a third-party module. Microsoft needs to do a better job at running integration tests between windows kernel and driver updates for sure. They can’t publish security updates without running integration tests, this is basics of software engineering. Windows is their product not Crowdstrike. 100% Microsoft failure here.
By @1vuio0pswjnm7 - 3 months
I have been 100% Microsoft-free in all computers and networks I control for decades. I know I should miss Excel, etc. but strangely I feel like I have sacrificed nothing. There is more I can do without Microsoft than with it.
By @geodel - 3 months
The answer is Cloud based Windows OS which MS is working towards for many years.
By @MattGaiser - 3 months
I'd be curious whether people would want the cost effective fix for this, which is basically to eliminate vendors for anything important.

Near complete vertical integration of security, like with Apple.

By @lloydatkinson - 3 months
What an infuriatingly poor article, Tom Clarke should be ashamed.

> A software update from cybersecurity company CrowdStrike has now taken a large number of those machines offline.

So Tom opens the article with the admission that it is CrowdStrike, not Microsoft.

> Thankfully, the update that caused the Microsoft meltdown did not affect these other software families - if it had, the impacts could have been catastrophic.

This is such a strawman (like the rest of the article honestly) I don't know where to begin. Inflammatory language.

A fucking "meltdown"? A meltdown of Microsoft, no less? Putting aside the fact that Microsoft and Windows are not the same thing, it is again nothing "meltdown" like that Microsoft did or could do.

> There are serious questions of course for CrowdStrike. As a leading provider of security software for large companies like Microsoft.

Tom, was you paid by CrowdStrike or what? What do you mean "of course"? It is literally the only party that should be answering questions here. I suspect that even if Tom were to "question" Microsoft their answer about kernels, drivers, privileges, and how shipping seemingly untested code into the core of an operating system is a bad idea wouldn't even be comprehendible for him.

> The situation may also lead to calls from Microsoft users about what more the company could do to ensure products made for their software aren't going to cause major outages like this one.

This is getting absurd now, and I just can't give more energy to this. OK, it could now insist only memory safe languages such as Rust are allowed for drivers. Or outright permanently blacklisting drivers from certain vendors. The bitching and moaning from manufacturers would then, of course, have people like Tom writing articles like "Microsoft is making manufacturers lives harder, think of the poor IT professionals!".

> Any engineer will tell you over-reliance on one system leaves you open to a "single point of failure". Critical digital infrastructure has to have redundancy - back up systems - built in to ensure it is resilient.

Please Tom, tell us more on your thoughts about memory safe languages, failure recovery modes, the unikernel vs microkernel debate, and how it's just a simple matter of overnight making operating systems "not a single point of failure".

This entire article is some kind of exercise in trying to get everything wrong while meeting a minimum word count, and I bet with some ChatGPT thrown in there too.

I flagged this post because I think it's far below even the minimum quality level for HN. It's outright clickbait drivel.