Introduction

A few years ago when we started building Windows Azure SQL Database, our Cloud RDBMS service, we assumed that fault-tolerance was a basic requirement of any cloud database offering. Our cloud customers have a diverse set of needs for storage solutions but our focus was to address the needs of the customers who needed an RDBMS for their application. For example, one of our early adopters was building a massive ticket reservation system in Windows Azure. Their application required relational capabilities with concurrency controls and transactional guaranties with consistency and durability.

To build a true RDBMS service we had to be fault-tolerant while ensuring that all atomicity, consistency, isolation and durability (ACID) characteristics of the service matched that of a SQL Server database. In addition, we wanted to provide elasticity and scale capabilities for customers to create and drop thousands of databases without any provisioning friction. Building a fault-tolerant (FT) system at cloud scale required a good deal of innovation.

We began by collecting a lot of data on various failure types and for a while we reveled in the academic details of the various system failure models. Ultimately, we simplified the problem space to the following two principles:

Hardware and software failures are inevitable

Operational staff make mistakes that lead to failures

There were two driving factors behind the decision to simplify our failure model. First, a fault-tolerant system requires us to deal with low-frequency failures, planned outages, as well as high-frequency failures. Second, at cloud scale, the low-frequency failures happen every week if not every day.

Our designs for fault-tolerance started to converge around a few solutions once we assumed that all components are likely to fail, and, that it was not practical to have a different FT solution for every component in the system. For example, if all components in a computer are likely to fail then we might as well have redundant computers instead of investing in redundant components, such as power supplies and RAID.

We finally decided that we would build fault-tolerant SQL databases at the highest level of the stack instead of building fault-tolerant systems that run database servers that host databases. Last but not least, the FT functionality would be an inherent part of the offering without requiring configurations and administration by operators or customers.

Fault-Tolerant SQL Databases

Customers are most interested in the resiliency of their own databases and less interested in the resiliency of the service as a whole. 99.9% uptime for a service is meaningless if “my database” is part of the 0.1% of databases that are down. Each and every database needs to be fault-tolerant and fault mitigation should never result in the loss of a committed transaction. There are two major technologies that provide the foundation for the fault-tolerant databases:

Database Replication

Failure Detection & Failover

Together, these technologies allow the databases to tolerate and mitigate failures in an automated manner without human interventions while ensuring that committed transactions are never lost in user’s databases.

Database Fault-Tolerance in a Nutshell

Windows Azure SQL Database maintains multiple copies of all databases in different physical nodes located across fully independent physical sub-systems, such as server racks and network routers. At any one time, Windows Azure SQL Database keeps three replicas of each database – one primary replica and two secondary replicas. Windows Azure SQL Database uses a quorum-based commit scheme where data is written to the primary and one secondary replica before we consider the transaction committed. If any component fails on the primary replica, Windows Azure SQL Database detects the failure and fails over to the secondary replica. In case of a physical loss of the replica, Windows Azure SQL Database creates a new replica automatically. Therefore, there are at least two replicas of each database that have transactional consistency in the data center. Other than the loss of an entire data center all other failures are mitigated by the service.

The replication, failure detection and failover mechanisms of Windows Azure SQL Database are fully automated and operate without human intervention. This architecture is designed to ensure that committed data is never lost and that data durability takes precedence over all else.

The Key Customer Benefits:

Customers get the full benefit of replicated databases without having to configure or maintain complicated hardware, software, OS or virtualization environments

Full ACID properties of relational databases are maintained by the system

Failovers are fully automated without loss of any committed data

Routing of connections to the primary replica is dynamically managed by the service with no application logic required

The high level of automated redundancy is provided at no extra charge

If you are interested in additional details, the next two sections provide more information about the internal workings of our replication and failover technologies.

Windows Azure SQL Database Replication Internals

Redundancy is the key to fault-tolerance and in Windows Azure SQL Database. Redundancy within Windows Azure SQL Database is maintained at the database level therefore each database is made physically and logically redundant. Redundancy for each database is enforced throughout the database’s lifecycle. Every database is replicated before it’s even provided to a customer to use and the replicas are maintained until the database is dropped by the customer. Each of the three replicas of a database is stored on a different node. Replicas of each database are scattered across nodes such that no two copies reside in the same “failure domain,” e.g., under the same network switch or in the same rack. Replicas of each database are assigned to nodes independently of the assignment of other databases to nodes, even if the databases belong to the same customer. That is, the fact that replicas of two databases are stored on the same node does not imply that other replicas of those databases are also co-located on another node.

For each database, at each point in time one replica is designated to be the primary. A transaction executes using the primary replica of the database (or simply, the primary database). The primary replica processes all query, update, and data definition language operations. It ships its updates and data definition language operations to the secondary replicas using the replication protocol for Windows Azure SQL Database. The system currently does not allow reads of secondary replicas. Since a transaction executes all of its reads and writes using the primary database, the node that directly accesses the primary partition does all the work against the data. It sends update records to the database’s secondary replicas, each of which applies the updates. Since secondary replicas do not process reads, each primary has more work to do than its secondary replicas. To balance the load, each node hosts a mix of primary and secondary databases. On average, with 3-way replication, each node hosts one primary database for every 2 secondary replica. Obviously, two replicas of a database are never co-located on the same physical node.

Another benefit of having each node host a mix of primary and secondary databases is that it allows the system to spread the load of a failed node across many live nodes. For example, suppose a node S hosts three primary databases PE, PF, and PG. If S fails and secondary replicas for PE, PF, and PG are spread across different nodes, then the new primary database for PE, PF, and PG can be assigned to three different nodes.

The replication protocol is specifically built for the cloud to operate reliably while running on a collection of hardware and software components that are assumed to be unreliable (component failures are inevitable). The transaction commitment protocol requires that only a quorum of the replicas be up. A consensus algorithm, similar to Paxos, is used to maintain the set of replicas. Dynamic quorums are used to maintain availability in the face of multiple failures.

The propagation of updates from primary to secondary is managed by the replication protocol. A transaction T’s primary database generates a record containing the after-image of each update by T. Such update records serve as logical redo records, identified by table key but not by page ID. These update records are streamed to the secondary replicas as they occur. If T aborts, the primary sends an ABORT message to each secondary, which deletes the updates it received for T. If T issues a COMMIT operation, then the primary assigns to T the next commit sequence number (CSN), which tags the COMMIT message that is sent to secondary replicas. Each secondary applies T’s updates to its database in commit-sequence-number order within the context of an independent local transaction that corresponds to T and sends an acknowledgment (ACK) back to the primary. After the primary receives an ACK from a quorum of replicas (including itself), it writes a persistent COMMIT record locally and returns “success” to T’s COMMIT operation. A secondary can send an ACK in response to a transaction T’s COMMIT message immediately, before T’s corresponding commit record and update records that precede it are forced to the log. Thus, before T commits, a quorum of nodes has a copy of the commit.

Updated records are eventually flushed to disk by primaries and secondary replicas. Their purpose is to minimize the delta between primary and secondary replicas in order to reduce any potential data loss during a failover event.

Updates for committed transactions that are lost by a secondary (e.g., due to a crash) can be acquired from the primary replica. The recovering replica sends to the primary the commit sequence number of the last transaction it committed. The primary replies by either sending the queue of updates that the recovering replica needs or telling the recovering replica that it is too far behind to be caught up. In the latter case, the recovering replica can ask the primary to transfer a fresh copy. A secondary promptly applies updates it receives from the primary node, so it is always nearly up-to-date. Thus, if it needs to become the primary due to a configuration change (e.g., due to load balancing or a primary failure), such reassignment is almost instantaneous. That is, secondary replicas are hot standbys and provide very high availability.

Failure Detection & Failover Internals

A large-scale distributed system needs a highly-reliable failure detection system that can detect failures reliability, quickly and as close as possible to customer. The Windows Azure SQL Database distributed fabric is paired with the SQL engine so that it can detect failures within a neighborhood of databases.

Centralized health monitoring of a very large system is inefficient and unreliable. The Windows Azure SQL Database failure detection is completely distributed so that any node in the system can be monitored by several of its neighbors. This topology allows for an extremely efficient, localized, and fast detection model that avoids the usual ping storms and unnecessarily delayed failure detections.

Although we collect detailed component-level failure telemetry for subsequent analysis we only use high-level failure signatures detected by the fabric to make failover decisions. Over the years we have improved our ability to fail-fast and recover so that degraded conditions of an unhealthy node do not persist.

Because the failover unit in Windows Azure SQL Database is the database, each database’s health is carefully monitored and failed over when required. Windows Azure SQL Database maintains a global map of all databases and their replicas in the Global Partition Manager (GPM). The global map contains the health, state, and location of every database and its replicas. The distributed fabric maintains the global map. When a node in Windows Azure SQL Database fails, the distributed fabric reliably and quickly detects the node failure and notifies the GPM. The GPM then reconfigures the assignment of primary and secondary databases that were present on the failed node.

Since Windows Azure SQL Database only needs a quorum of replicas to operate, availability is unaffected by failure of a secondary replica. In the background, the system simply creates a new replica to replace the failed one.

Replicas which are only temporarily unavailable for short periods of time are simply caught up with the small number of missing transactions that they missed. Its node asks an operational replica to send it the tail of the update queue that the replica missed while it was down. Allowing for quick synchronization of temporarily unavailable secondary replicas is an optimization which avoids the complete recreation of replicas when not strictly necessary.

If a primary replica fails, one of the secondary replicas must be designated as the new primary and all of the operational replicas must be reconfigured according to that decision. The first step in this process relies on the GPM to choose a leader to rebuild the database’s configuration. The leader attempts to contact the members of the entire replica set to ensure that there are no lost updates. The leader determines which secondary has the latest state. That most up-to-date secondary replica propagates changes that are required by the other replicas that are missing changes.

All connections to Windows Azure SQL Database databases are managed by a set of load-balanced Gateway processes. A Gateway is responsible for accepting inbound database connection requests from clients and binding them to the node that currently hosts the primary replica of a database. The Gateways coordinate with the distributed fabric to locate the primary replica of a customer’s databases. In the event of a fail-over, the Gateways renegotiate the connection binding of all connections bound to the failed primary to the new primary as soon as it is available.

The combination of connection Gateways, distributed fabric, and the GPM can detect and mitigate failures using the database replicas maintained by Windows Azure SQL Database.