July 19th, 2024

Major Microsoft 365 outage caused by Azure configuration change

A Microsoft 365 outage in Central US, triggered by an Azure configuration change, impacted services like Teams and Xbox Live. Microsoft addressed the issue, rerouting traffic to mitigate disruptions.

Read original articleLink Icon
Major Microsoft 365 outage caused by Azure configuration change

A major Microsoft 365 outage was caused by an Azure configuration change, affecting customers in the Central US region. The outage, which started around 6:00 PM EST, impacted various Microsoft 365 apps and services including Microsoft Defender, Teams, and OneDrive for Business. Xbox Live service was also affected, with users experiencing login issues. Microsoft acknowledged the problem and worked on rerouting traffic to alleviate the impact. The issue stemmed from a buggy configuration change in Azure backend workloads, causing connectivity problems between Azure Storage clusters and compute resources. While most services are back online, some customers still face difficulties accessing services like Microsoft Teams. This incident follows previous outages in July 2022 and January 2023, highlighting the challenges of maintaining service reliability in cloud environments.

Link Icon 10 comments
By @macote - 9 months
That's an interesting recommendation...

> We've received feedback from customers that several reboots (as many as 15 have been reported) may be required, but overall feedback is that reboots are an effective troubleshooting step at this stage.

https://azure.status.microsoft/en-ca/status

By @gjsman-1000 - 9 months
I'm not sure if this has anything to do with it; but I got an email this morning at 2 AM saying that my Microsoft account password was changed. The email was authentic - it came from Microsoft's servers, and had no buttons for me to click. It said that the IP Address of the password change was my own ISP in my area... and that the reset came from a phone number I didn't recognize on my security info, that even now does not show on my account dashboard.

I ran in a panic to reset my password... only to discover, the password was "changed" to the exact same password? And how would they even get in the account without 2FA? My "sign in history" also showed no trace of anything unusual. At this point, and reading these headlines, I feel more confident something's broke.

By @jameskilton - 9 months
And CloudStrike hitting them at the same time must have been just unbelievable.
By @CWuestefeld - 9 months
OK, it's a pedantic nit, but doesn't anybody have editors anymore?

> This massive outage started around 6:00 PM EST

We're currently in daylight savings time, so the time should read "EDT".

By @isaacremuant - 9 months
This entire thing is not surprising but it's fascinating to see unfold. In many dimensions, even listening to the wild speculation or weird reporting around it.
By @vb-8448 - 9 months
I guess we can call it a "black Thursday" for microsoft.
By @hypeatei - 9 months
The Central US outage really exposed a lot of shortcomings both at my company and at MSFT apparently.

Even having regional fail over available wouldn't help because the control plane was unreliable meaning it couldn't be triggered by anyone.

By @terom - 9 months
https://azure.status.microsoft/en-us/status/history/ doesn't seem to have links to the individual incidents. Some reports claim the Azure / Microsoft 365 outages were related to crowdstrike, but this sounds like an entirely separate incident.

AFAIK the broken crowdstrike channel update happened at 2024-07-19 06:05 UTC and was "fixed" (rolled back) at 06:47 UTC, but I don't have a proper source for that timeline?

EDIT: https://azure.status.microsoft/en-gb/status claims 2024-07-18 19:00 UTC as the approximate start of impact for the crowdstrike update. It would be nice to find a proper source for the start and mitigation timelines...

EDIT: reddit threads reporting symptoms start at approx 2024-07-19 05:00 UTC. That would mean the crowdstrike impact started soon after the azure recovery.

---

What happened?

Between 21:56 UTC on 18 July 2024 and 12:15 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services. A storage incident impacted the availability of Virtual Machines which may have also restarted unexpectedly. Services with dependencies on the impacted virtual machines and storage resources would have experienced impact.

What do we know so far?

We determined that a backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks hosted on impacted storage resources.

How did we respond?

21:56 UTC on 18 July 2024 – Customer impact began

22:13 UTC on 18 July 2024 – Storage team started investigating

22:41 UTC on 18 July 2024 – Additional Teams engaged to assist investigations

23:27 UTC on 18 July 2024 – All deployments in Central US stopped

23:35 UTC on 18 July 2024 – All deployments paused for all regions

00:45 UTC on 18 July 2024 – A configuration change as the underlying cause was confirmed

01:10 UTC on 19 July 2024 – Mitigation started

01:30 UTC on 19 July 2024 – Customers started seeing signs of recovery

02:51 UTC on 19 July 2024 – 99% of all impacted compute resources recovered

03:23 UTC on 19 July 2024 – All Azure Storage clusters confirmed recovery

03:41 UTC on 19 July 2024 – Mitigation confirmed for compute resources

Between 03:41 and 12:15 UTC on 19 July 2024 – Services which were impacted by this outage recovered progressively and engineers from the respective teams intervened where further manual recovery was needed. Following an extended monitoring period, we determined that impacted services had returned to their expected availability levels.

By @boringg - 9 months
Honestly is anyone surprised that Microsoft is having these issues? They still run the same sloppy software as they always have.