Major Microsoft 365 outage caused by Azure configuration change
A Microsoft 365 outage in Central US, triggered by an Azure configuration change, impacted services like Teams and Xbox Live. Microsoft addressed the issue, rerouting traffic to mitigate disruptions.
Read original articleA major Microsoft 365 outage was caused by an Azure configuration change, affecting customers in the Central US region. The outage, which started around 6:00 PM EST, impacted various Microsoft 365 apps and services including Microsoft Defender, Teams, and OneDrive for Business. Xbox Live service was also affected, with users experiencing login issues. Microsoft acknowledged the problem and worked on rerouting traffic to alleviate the impact. The issue stemmed from a buggy configuration change in Azure backend workloads, causing connectivity problems between Azure Storage clusters and compute resources. While most services are back online, some customers still face difficulties accessing services like Microsoft Teams. This incident follows previous outages in July 2022 and January 2023, highlighting the challenges of maintaining service reliability in cloud environments.
> We've received feedback from customers that several reboots (as many as 15 have been reported) may be required, but overall feedback is that reboots are an effective troubleshooting step at this stage.
I ran in a panic to reset my password... only to discover, the password was "changed" to the exact same password? And how would they even get in the account without 2FA? My "sign in history" also showed no trace of anything unusual. At this point, and reading these headlines, I feel more confident something's broke.
> This massive outage started around 6:00 PM EST
We're currently in daylight savings time, so the time should read "EDT".
Even having regional fail over available wouldn't help because the control plane was unreliable meaning it couldn't be triggered by anyone.
AFAIK the broken crowdstrike channel update happened at 2024-07-19 06:05 UTC and was "fixed" (rolled back) at 06:47 UTC, but I don't have a proper source for that timeline?
EDIT: https://azure.status.microsoft/en-gb/status claims 2024-07-18 19:00 UTC as the approximate start of impact for the crowdstrike update. It would be nice to find a proper source for the start and mitigation timelines...
EDIT: reddit threads reporting symptoms start at approx 2024-07-19 05:00 UTC. That would mean the crowdstrike impact started soon after the azure recovery.
---
What happened?
Between 21:56 UTC on 18 July 2024 and 12:15 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services. A storage incident impacted the availability of Virtual Machines which may have also restarted unexpectedly. Services with dependencies on the impacted virtual machines and storage resources would have experienced impact.
What do we know so far?
We determined that a backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks hosted on impacted storage resources.
How did we respond?
21:56 UTC on 18 July 2024 – Customer impact began
22:13 UTC on 18 July 2024 – Storage team started investigating
22:41 UTC on 18 July 2024 – Additional Teams engaged to assist investigations
23:27 UTC on 18 July 2024 – All deployments in Central US stopped
23:35 UTC on 18 July 2024 – All deployments paused for all regions
00:45 UTC on 18 July 2024 – A configuration change as the underlying cause was confirmed
01:10 UTC on 19 July 2024 – Mitigation started
01:30 UTC on 19 July 2024 – Customers started seeing signs of recovery
02:51 UTC on 19 July 2024 – 99% of all impacted compute resources recovered
03:23 UTC on 19 July 2024 – All Azure Storage clusters confirmed recovery
03:41 UTC on 19 July 2024 – Mitigation confirmed for compute resources
Between 03:41 and 12:15 UTC on 19 July 2024 – Services which were impacted by this outage recovered progressively and engineers from the respective teams intervened where further manual recovery was needed. Following an extended monitoring period, we determined that impacted services had returned to their expected availability levels.