July 24th, 2024

CrowdStrike Timeline Mystery

On July 19, 2024, a faulty CrowdStrike update caused system crashes globally, affecting 8.5 million devices and leading to significant disruptions, including 5,000 canceled flights, prompting recovery efforts.

Read original articleLink Icon
CrowdStrike Timeline Mystery

On July 19, 2024, a faulty update to the CrowdStrike Falcon sensor configuration for Windows systems caused widespread system crashes, affecting millions globally. The update, intended to enhance security, inadvertently triggered a logic error, leading to blue screens of death (BSOD) on impacted machines. Bitsight estimates that this incident resulted in a 15% to 20% drop in the number of systems connected to CrowdStrike Falcon servers. The timeline of events began with the release of the update at 04:09 UTC, followed by identification of the issue and a reversion of changes by 05:27 UTC. However, by then, approximately 8.5 million devices had already been affected, disrupting various sectors, including airlines and healthcare, with over 5,000 flights canceled.

CrowdStrike and Microsoft collaborated to provide remediation steps, requiring manual intervention to delete the faulty file from affected machines, complicating recovery efforts. Bitsight's analysis of traffic data revealed a significant drop in unique IPs contacting CrowdStrike servers, particularly after the update. Notably, a traffic spike was observed on July 16, three days prior to the outage, raising questions about potential correlations between these events. As organizations increasingly rely on external software, the incident underscores the importance of proper technology hygiene, including staged updates and operational disruption plans. Bitsight continues to investigate the traffic patterns and implications of this outage as organizations work to recover.

Related

Cybersecurity platform Crowdstrike down worldwide, users logged out of systems

Cybersecurity platform Crowdstrike down worldwide, users logged out of systems

CrowdStrike, a cybersecurity platform, faced a global outage affecting users in countries like India, Japan, Canada, and Australia due to a technical error in its Falcon product. Users encountered disruptions, including BSOD errors. CrowdStrike is actively working on a fix.

CrowdStrike fixes start at "reboot up to 15 times", gets more complex from there

CrowdStrike fixes start at "reboot up to 15 times", gets more complex from there

A faulty update to CrowdStrike's Falcon security software caused Windows crashes, impacting businesses. Microsoft and CrowdStrike advise rebooting affected systems multiple times or restoring from backups to resolve issues. CrowdStrike CEO apologizes and promises support.

Technical Details on Today's Outage

Technical Details on Today's Outage

CrowdStrike faced a temporary outage on July 19, 2024, caused by a sensor update on Windows systems, not a cyberattack. The issue affected some users but was fixed by 05:27 UTC. Systems using Falcon sensor for Windows version 7.11+ between 04:09-05:27 UTC might have been impacted due to a logic error from an update targeting malicious named pipes. Linux and macOS systems were unaffected. CrowdStrike is investigating the root cause and supporting affected customers.

Global CrowdStrike Outage Proves How Fragile IT Systems Have Become

Global CrowdStrike Outage Proves How Fragile IT Systems Have Become

A global software outage stemming from a faulty update by cybersecurity firm CrowdStrike led to widespread disruptions. The incident underscored the vulnerability of modern IT systems and the need for thorough testing.

Microsoft says 8.5M systems hit by CrowdStrike BSOD, releases USB recovery tool

Microsoft says 8.5M systems hit by CrowdStrike BSOD, releases USB recovery tool

Microsoft addressed issues caused by a faulty CrowdStrike security update affecting 8.5 million Windows systems. A USB recovery tool was released to delete the problematic file, emphasizing the need for thorough update testing.

Link Icon 18 comments
By @G1N - 6 months
> As Bitsight continues to investigate the traffic patterns exhibited by CrowdStrike machines across organizations globally, two distinct points emerge as “interesting” from a data perspective. Firstly, on July 16th at around 22:00 there was a huge traffic spike, followed by a clear and significant drop off in egress traffic from organizations to CrowdStrike. Second, there was a significant drop, between 15% and 20%, in the number of unique IPs and organizations connected to CrowdStrike Falcon servers, after the dawn of the 19th.

> While we can not infer what the root cause of the change in traffic patterns on the 16th can be attributed to, it does warrant the foundational question of “Is there any correlation between the observations on the 16th and the outage on the 19th?”. As more details from the event emerge, Bitsight will continue investigating the data.

Interested to know how they're capturing sample data for IPs accessing Crowdstrike Falcon APIs and the corresponding packet data.

EDIT: Not to mention that they're able to distill their dataset to group IPs by their representative organizations. Since they have that info I feel a proper analysis would include actually analyzing which orgs (types, country of origin, etc) started dropping off starting on the 16th. Alas since this seems like just a marketing fluff piece we'll never get anything substantial :(

By @paxys - 6 months
I'm not sure what exactly they are trying to say. They saw some CrowdStrike traffic logs, saw a random spike a few days before the outage, and...that's it? Why is that "strange", and how does it relate to the incident timeline?

Just a random security company with a fluff piece with "CrowdStrike" in the title trying to get in the headlines.

By @philsnow - 6 months
I would be interested to know what the distribution of release times for these "channel files" is like. Dropping them at 8pm Eastern time is in line with some companies' idea of well-timed system maintenance windows, whereas others prefer to do things during the workday so that if they need all hands on deck, they can get them more easily.

The latter works better with organizations that release often and have reasonable surety that their updates are not going to cause disruption -- it becomes a normal part of the day, most commonly it causes no noticeable disruption at all, and thus it makes sense to not have to have eng / ops working late hours for the release. This surety can come from different ways, but the one I've seen is having a very methodical rollout with at least a smoke-test (affecting a very small subset of "production", not internal or lab machines, so in CRWD's case it would be customers' machines), and then rolling out to a random %age of machines starting with 1%, and depending on your level of confidence, some schedule that gets you to 100% before the end of business for your easternmost co-workers.

Some additional things to gain confidence can include a 1% rollout to a set of machines that is picked to ideally provide exposure to every type of machine in the fleet, and 100% rollout to customers who have agreed to be at the cutting edge (how you get them to accept that risk is an exercise for the reader, but maybe cut them a deal like 30% off their license).

The reason I'm curious about the distribution of channel file drops, for the case of Crowdstrike, is that if it's an atypically-timed release, that could indicate that it's a response to whatever caused the dip in traffic on the 16th mentioned in the Bitsight article.

Edit: From what I understand, Crowdstrike does have at least some segmentation of releases for the kernel extension, but it appears the configuration file / channel file updates seem to be "Oh well, fire ze missiles".

By @subract - 6 months
How exactly is Bitsight collecting the data used in this analysis? I understand it’s just a sampling, but how are they sampling traffic between two arbitrary parties (Crowdstrike and customers in this case)?
By @ss64 - 6 months
The obvious inference from this is that the bad update was trickled out to some customers on the 16th and it took them 2 days to report the issue because they were all busy figuring out why every machine was blue-screening. Alternatively it took CrowdStrike 2 days to notice that their traffic was disappearing and put 2 and 2 together as to why.
By @Arch485 - 6 months
I wonder if CrowdStrike did do a phased rollout (as they should), but didn't notice that the update was causing crashes?

Not sure if that would make them more or less incompetent...

By @ramenmeal - 6 months
A lot of evidence, but no claim.
By @Lerc - 6 months
This image at https://www.bitsight.com/sites/default/files/2024/07/23/Uniq... makes a compelling argument that something happened concurrently with the third set of weekday peaks. Although considering how similar each peak was for the first two weeks I would say the divergence was earlier than the dotted line where the short sharp peak appeared.

Something happened, the nature of that something might be unrelated to the BSOD crash. Could just be another piece of software doing an update at a different frequency that sometimes changes the timing of the crowdstrike update.

You'd need a longer term view of data searching for beat patterns to detect that.

If the something was a one-off effect, like admins taking sick days to watch the Euro final, I'm not sure how you could positively identify the cause.

By @geor9e - 6 months
I don't know what the point of this article was, but at my work, we push updates to millions of devices in people's homes, but we only "open the flood gates" briefly. A blip of time, so 0.01% get the update, we check that they all came back online and reported in healthy, 0.1%, next day 1%, next day 20%, next day 100%. There have been a couple times where we had to refund 0.01% of the customers who got bricked as guinea pigs and called us angrily, luckily it's never been 1%. I get that security updates can't wait a whole day, but can't they at least wait until Windows reboots? I wonder why Crowdstrike pushed to all 8.5 million before checking if any came back online.
By @metadat - 6 months
PR advertisement stunt disguised as an uninformative nothingburger blog post.
By @genter - 6 months
Anyone know what tool was used to produce the graphs?
By @refulgentis - 6 months
TL;DR: Strange, our deep packet inspector traffic data shows a drop in traffic from July 16th-18th!

It's incredibly creepy that they A) are collecting this much data from customers B) are comfy drilling into it by IP/organization and C) have enough spare time to do so for a marketing blog post.

Also, for god's sake, you're a company, you're supposed to look professional. If you're going to use AI art for your blog at least don't be lazy: load up Photopea and either fix the broken text or magic wand it out. It'll take you 5 minutes.

By @bdjsiqoocwk - 6 months
"there's a spike we don't know how to explain" saved you a click
By @iJohnDoe - 6 months
How are they capturing this data?
By @MrBuddyCasino - 6 months
CrowdStrike is a rather interesting company, in that is politically connected. Some additional background by Mike Benz:

https://x.com/mikebenzcyber/status/1816177071757893823

https://x.com/mikebenzcyber/status/1816196876686999962