Post

Azure West US Power Failure: Why Restoring Power Did Not Restore Service

A technical analysis of the February 2026 Azure West US power interruption, where the recovery phase proved more damaging than the initial failure, as unhealthy storage components blocked compute restoration across the entire region's PaaS stack for nearly 20 hours.

Azure West US Power Failure: Why Restoring Power Did Not Restore Service

Power Is Back. Nothing Works. Now What?

When power returns to a data center, the expectation is that services come back online. Generators keep critical systems running through outages, UPS units bridge the gap, and once mains power restores, the infrastructure returns to normal operation. That is what most cloud customers assume happens when a cloud provider reports a power interruption resolved. February 7, 2026 in Azure West US demonstrated that assumption can be dangerously wrong.

A power interruption hit Availability Zone AZ01 in Azure’s West US region. By most accounts, the interruption itself lasted a manageable amount of time. Power was restored. And then the real problems started.

A subset of storage and compute infrastructure within AZ01 failed to return to a healthy state after power restoration. The storage components did not declare themselves failed. They did not trigger clean failover procedures. They came back online in a degraded, inconsistent state that Azure’s health monitoring classified as available but that caused failures in any operation that actually tried to use them. This is, operationally, one of the worst failure modes in distributed infrastructure: the component that should be failed is reporting itself as healthy.

The Twenty-Hour Cascade

The storage components’ unhealthy state became a ceiling that every dependent service hit when it tried to recover. Azure Cache for Redis could not fully initialize because its persistence layer was backed by storage in an inconsistent state. Azure Cosmos DB partition replicas could not complete failover procedures. Azure Database for PostgreSQL instances could not perform recovery operations that required writing to storage. Azure Databricks clusters came up but hit errors when reading and writing data. Azure Synapse Analytics workloads failed mid-execution. Azure Service Bus message queues became unavailable. Azure SQL Database instances entered recovery mode and stayed there.

The entire West US PaaS stack was effectively held hostage by storage components that were technically present but operationally broken.

Recovery took close to 20 hours, which is extraordinary for what was initially reported as a power interruption. The diagnostic challenge compounded the timeline: when storage components report themselves healthy but behave incorrectly, the telemetry that Azure’s operations teams rely on to detect failures sends mixed signals. Systems appear available on the health dashboard while workloads are failing in ways that look like application-level problems rather than infrastructure failures.

This delayed the root cause diagnosis. Teams investigating failures in Redis, Cosmos DB, and PostgreSQL were initially looking at the individual service layers rather than the common storage substrate underneath all of them.

What Good Cloud Architecture Should Look Like

James Hamilton, who spent years as a distinguished engineer at Amazon thinking about these exact problems, wrote extensively about the importance of designing for fast, clean failure. A component that fails in an ambiguous, partially-healthy state is harder to recover from than one that fails cleanly and completely.

The Azure West US incident is a practical illustration of that principle. Had the storage components failed completely and triggered clean failover to healthy AZ02 and AZ03 infrastructure, the cascade would have been contained and recovery would have been faster. The partial, ambiguous failure state defeated the automated recovery logic that depends on clear signal about what is healthy and what is not.

For architects designing workloads on Azure or any cloud provider, the incident highlights something about availability zone configuration that is frequently underspecified. Using multiple availability zones provides resilience only when the service is actually deployed in an active multi-zone configuration and when the application layer is designed to tolerate individual zone failures. A deployment that runs in a single zone with the expectation of automatic failover to another zone provides much weaker guarantees than an active-active multi-zone deployment where all zones are continuously serving traffic.

The Telemetry Gap

One operational detail that received attention in post-incident analysis was the role of delayed telemetry in extending the recovery timeline. Azure’s internal monitoring systems receive metrics and health signals from infrastructure components at regular polling intervals. Under normal conditions, this polling cadence provides sufficient visibility. Under the specific conditions of AZ01’s degraded storage components, the health signals were misleading and the polling cadence meant that the operations team was always working with information that was several minutes old.

By the time anomalous behavior was confirmed, diagnosed, and routed to the team with authority to take corrective action, the window for quick intervention had passed. The incident added roughly two to three hours to the recovery timeline that could potentially have been recovered with real-time health signal visibility rather than polled metrics.

This is a known trade-off in monitoring system design. Real-time health signals require more infrastructure, more bandwidth, and more processing capacity than sampled metrics. Most organizations, including cloud providers, optimize toward cost and simplicity. The West US incident is a data point for where that optimization has its limits.

A Useful Thing to Test Before You Need It

For teams running production workloads in Azure West US or any single-zone configuration, this incident provides motivation for a test that most teams avoid scheduling: actually failing over to a backup zone in a controlled way before the infrastructure does it for you in an uncontrolled one.

Failover procedures that have never been exercised under real conditions have a way of failing in their own interesting ways when called upon during an actual incident. The Azure West US recovery would have been faster for customers who had previously validated their failover path than for those who discovered the gaps in their recovery procedures at the same time they were trying to execute them.

References

  1. Azure Service Health Dashboard, West US region incident, February 7-8, 2026
  2. Microsoft Azure blog, post-incident summary, West US AZ01 power failure
  3. DataCenterDynamics, “Azure West US outage: power restoration does not equal service restoration,” February 2026
  4. James Hamilton, “On Designing and Deploying Internet-Scale Services,” USENIX LISA 2007
  5. NetworkWorld, “Azure 20-hour outage traced to storage recovery failure,” February 2026
This post is licensed under CC BY 4.0 by the author.