Microsoft says 365 outage was amplified by internal errors

Microsoft’s latest outage on Tuesday might have been amplified by its own unforced errors, the company said in an incident report.

“While the initial trigger event was a distributed denial-of-service (DDoS) attack, which activated our DDoS protection mechanisms, initial investigations suggest that an error in the implementation of our defenses amplified the impact of the attack rather than mitigating it,” the report said.

The Microsoft 365 outage on Tuesday is the latest in a series of unforced errors by major IT vendors.

Failure to adequately test systems before roll-out was also a factor in the CrowdStrike incident on July 19, and behind DigiCert’s short-notice revocation of erroneously issued SSL certificates earlier this week.

The July 19 incident was caused by a flaw in CrowdStrike’s security sensor software that cost users millions of dollars in repairs and lost business opportunities, and that testing had failed to uncover.

A root cause analysis of the DigiCert incident showed that there were some process failures during the modernization of a software system that had also been missed during testing.

Steps Microsoft took to mitigate the outage

The latest problems with Microsoft 365 began to appear around 11:45 UTC on Tuesday, when an unexpected usage spike resulted in Azure Front Door (AFD) and Azure Content Delivery Network (CDN) components performing below acceptable thresholds, leading to intermittent errors, timeout, and latency spikes, Microsoft said.

The dip in performance affected a subset of Microsoft 365 services and other services, including Azure App Services, Application Insights, Azure IoT Central, Azure Log Search Alerts, Azure Policy, as well as the Azure portal itself.

The services impacted included the Microsoft 365 admin center itself, Intune, Entra, and Power Platform.

In response to the outage, the company said that it had started investigations immediately and once it understood that a DDoS attack was behind the network spike, it had implemented networking configuration changes to support its DDoS protection efforts and performed failovers to alternate networking paths to provide relief.

“Our initial network configuration changes successfully mitigated majority of the impact by 14:10 UTC,” the company wrote in the report.

However, it pointed out that despite its early efforts several enterprise customers complained of less than 100% availability, which the company began mitigating at 18:00 UTC.

Without giving further details in the incident report, Microsoft said that it used a different approach to try and solve the issue starting with Asia Pacific and Europe.

“After validating that this revised approach successfully eliminated the side effect impacts of the initial mitigation, we rolled it out to regions in the Americas. Failure rates returned to pre-incident levels by 19:43 UTC,” the company wrote in the incident report, adding the incident was finally mitigated at 19:43 UTC.

Additional steps promised by Microsoft

In its initial report Microsoft said internal teams will be completing an investigation to understand the entire incident in more detail.

“We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings,” the company wrote in the report.

This is Microsoft’s 8th service status-related incident this year, according to the company’s service status page.

Last year was also riddled with outages for Microsoft 365 users. Azure’s service page shows that the last incident reported in 2023 was in September, when the US East region faced issues.

Source link