In this posting we provide an overview of the Windows Azure Storage architecture to give some understanding of how it works. Windows Azure Storage is a distributed storage software stack built completely by Microsoft for the cloud.

3 Layer Architecture

The storage access architecture has the following 3 fundamental layers:

Front-End (FE) layer – This layer takes the incoming requests, authenticates and authorizes the requests, and then routes them to a partition server in the Partition Layer. The front-ends know what partition server to forward each request to, since each front-end server caches a Partition Map. The Partition Map keeps track of the partitions for the service being accessed (Blobs, Tables or Queues) and what partition server is controlling (serving) access to each partition in the system.

Partition Layer – This layer manages the partitioning of all of the data objects in the system. As described in the prior posting, all objects have a partition key. An object belongs to a single partition, and each partition is served by only one partition server. This is the layer that manages what partition is served on what partition server. In addition, it provides automatic load balancing of partitions across the servers to meet the traffic needs of Blobs, Tables and Queues. A single partition server can serve many partitions.

Distributed and replicated File System (DFS) Layer – This is the layer that actually stores the bits on disk and is in charge of distributing and replicating the data across many servers to keep it durable. A key concept to understand here is that the data is stored by the DFS layer, but all DFS servers are (and all data stored in the DFS layer is) accessible from any of the partition servers.

These layers and a high level overview are shown in the below figure:

Here we can see that the Front-End layer takes incoming requests, and a given front-end server can talk to all of the partition servers it needs to in order to process the incoming requests. The partition layer consists of all of the partition servers, with a master system to perform the automatic load balancing (described below) and assignments of partitions. As shown in the figure, each partition server is assigned a set of object partitions (Blobs, Entities, Queues). The Partition Master constantly monitors the overall load on each partition sever as well the individual partitions, and uses this for load balancing. Then the lowest layer of the storage architecture is the Distributed File System layer, which stores and replicates the data, and all partition servers can access any of the DFS severs.

Lifecycle of a Request

To understand how the architecture works, let’s first go through the lifecycle of a request as it flows through the system. The process is the same for Blob, Entity and Message requests:

DNS lookup – the request to be performed against Windows Azure Storage does a DNS resolution on the domain name for the object’s Uri being accessed. For example, the domain name for a blob request is “<your_account>.blob.core.windows.net”. This is used to direct the request to the geo-location (sub-region) the storage account is assigned to, as well as to the blob service in that geo-location.

Front-End Server Processes Request– The request reaches a front-end, which does the following:

Get the response from the partition server, and send it back to the client.

Partition Server Processes Request– The request arrives at the partition server, and the following occurs depending on whether the request is a GET (read operation) or a PUT/POST/DELETE (write operation):

GET – See if the data is cached in memory at the partition server

If so, return the data directly from memory.

Else, send a read request to one of the DFS Servers holding one of the replicas for the data being read.

PUT/POST/DELETE

Send the request to the primary DFS Server (see below for details) holding the data to perform the insert/update/delete.

DFS Server Processes Request – the data is read/inserted/updated/deleted from persistent storage and the status (and data if read) is returned. Note, for insert/update/delete, the data is replicated across multiple DFS Servers before success is returned back to the client (see below for details).

Most requests are to a single partition, but listing Blob Containers, Blobs, Tables, and Queues, and Table Queries can span multiple partitions. When a listing/query request that spans partitions arrives at a FE server, we know via the Partition Map the set of partition servers that need to be contacted to perform the query. Depending upon the query and the number of partitions being queried over, the query may only need to go to a single partition server to process its request. If the Partition Map shows that the query needs to go to more than one partition server, we serialize the query by performing it across those partition servers one at a time sorted in partition key order. Then at partition server boundaries, or when we reach 1,000 results for the query, or when we reach 5 seconds of processing time, we return the results accumulated thus far and a continuation token if we are not yet done with the query. Then when the client passes the continuation token back in to continue the listing/query, we know the Primary Key from which to continue the listing/query.

Fault Domains and Server Failures

Now we want to touch on how we maintain availability in the face of hardware failures. The first concept is to spread out the servers across different fault domains, so if a hardware fault occurs only a small percentage of servers are affected. The servers for these 3 layers are broken up over different fault domains, so if a given fault domain (rack, network switch, power) goes down, the service can still stay available for serving data.

The following is how we deal with node failures for each of the three different layers:

Front-End Server Failure – If a front-end server becomes unresponsive, then the load balancer will realize this and take it out of the available servers that serve requests from the incoming VIP. This ensures that requests hitting the VIP get sent to live front-end servers that are waiting to process requests.

Partition Server Failure – If the storage system determines that a partition server is unavailable, it immediately reassigns any partitions it was serving to other available partition servers, and the Partition Map for the front-end servers is updated to reflect this change (so front-ends can correctly locate the re-assigned partitions). Note, when assigning partitions to different partition servers no data is moved around on disk, since all of the partition data is stored in the DFS server layer and accessible from any partition server. The storage system ensures that all partitions are always served.

DFS Server Failure – If the storage system determines a DFS server is unavailable, the partition layer stops using the DFS server for reading and writing while it is unavailable. Instead, the partition layer uses the other available DFS servers which contain the other replicas of the data. If a DFS Server is unavailable for too long, we generate additional replicas of the data in order to keep the data at a healthy number of replicats for durability.

Upgrade Domains and Rolling Upgrade

A concept orthogonal to fault domains is what we call upgrade domains. Servers for each of the 3 layers are spread evenly across the different fault domains, and upgrade domains for the storage service. This way if a fault domain goes down we lose at most 1/X of the servers for a given layer, where X is the number of fault domains. Similarly, during a service upgrade at most 1/Y of the servers for a given layer are upgraded at a given time, where Y is the number of upgrade domains. To achieve this, we use rolling upgrades, which allows us to maintain high availability when upgrading the storage service.

We upgrade a single domain at a time for our storage service using rolling upgrades. A key part for maintaining availability during upgrade is that before upgrading a given domain, we proactively offload all the partitions being served on partition servers in that upgrade domain. In addition, we mark the DFS servers in that upgrade domain as being upgraded so they are not used while the upgrade is going on. This preparation is done before upgrading the domain, so that when we upgrade we reduce the impact on the service to maintain high availability.

After an upgrade domain has finished upgrading we allow the servers in that domain to serve data again. In addition, after we upgrade a given domain, we validate that everything is running fine with the service before going to the next upgrade domain. This process allows us to verify production configuration, above and beyond the pre-release testing we do, on just a small percentage of servers in the first few upgrade domains before upgrading the whole service. Typically if something is going to go wrong during an upgrade, it will occur when upgrading the first one or two upgrade domains, and if something doesn’t look quite right we pause upgrade to investigate, and we can even rollback to the prior version of the production software if need be.

Now we will go through the lower to layers of our system in more detail, starting with the DFS Layer.

DFS Layer and Replication

Durability for Windows Azure Storage is provided through replication of your data, where all data is replicated multiple times. The underlying replication layer is a Distributed File System (DFS) with the data being spread out over hundreds of storage nodes. Since the underlying replication layer is a distributed file system, the replicas are accessible from all of the partition servers as well as from other DFS servers.

The DFS layer stores the data in what are called “extents”. This is the unit of storage on disk and unit of replication, where each extent is replicated multiple times. The typical extent sizes range from approximately 100MB to 1GB in size.

When storing a blob in a Blob Container, entities in a Table, or messages in a Queue, the persistent data is stored in one or more extents. Each of these extents has multiple replicas, which are spread out randomly over the different DFS servers providing “Data Spreading”. For example, a 10GB blob may be stored across 10 one-GB extents, and if there are 3 replicas for each extent, then the corresponding 30 extent replicas for this blob could be spread over 30 different DFS servers for storage. This design allows Blobs, Tables and Queues to span multiple disk drives and DFS servers, since the data is broken up into chunks (extents) and the DFS layer spreads the extents across many different DFS servers. This design also allows a higher number of IOps and network BW for accessing Blobs, Tables, and Queues as compared to the IOps/BW available on a single storage DFS server. This is a direct result of the data being spread over multiple extents, which are in turn spread over different disks and different DFS servers, since any of the replicas of an extent can be used for reading the data.

For a given extent, the DFS has a primary server and multiple secondary servers. All writes go through the primary server, which then sends the writes to the secondary servers. Success is returned back from the primary to the client once the data is written to at least 3 DFS servers. If one of the DFS servers is unreachable when doing the write, the DFS layer will choose more servers to write the data to so that (a) all data updates are written at least 3 times (3 separate disks/servers in 3 separate fault+upgrade domains) before returning success to the client and (b) writes can make forward progress in the face of a DFS server being unreachable. Reads can be processed from any up-to-date extent replica (primary or secondary), so reads can be successfully processed from the extent replicas on its secondary DFS servers.

The multiple replicas for an extent are spread over different fault domains and upgrade domains, therefore no two replicas for an extent will be placed in the same fault domain or upgrade domain. Multiple replicas are kept for each data item, so if one fault domain goes down, there will still be healthy replicas to access the data from, and the system will dynamically re-replicate the data to bring it back to a healthy number of replicas. During upgrades, each upgrade domain is upgraded separately, as described above. If an extent replica for your data is in one of the domains currently being upgraded, the extent data will be served from one of the currently available replicas in the other upgrade domains not being upgraded.

A key principle of the replication layer is dynamic re-replication and having a low MTTR (mean-time-to-recovery). If a given DFS server is lost or a drive fails, then all of the extents that had a replica on the lost node/drive are quickly re-replicated to get those extents back to a healthy number of replicas. Re-replication is accomplished quickly, since the other healthy replicas for the affected extents are randomly spread across the many DFS servers in different fault/upgrade domains, providing sufficient disk/network bandwidth to rebuild replicas very quickly. For example, to re-replicate a failed DFS server with many TBs of data, with potentially 10s of thousands of lost extent replicas, the healthy replicas for those extents are potentially spread across hundreds to thousands of storage nodes and drives. To get those extents back up to a healthy number of replicas, all of those storage nodes and drives can be used to (a) read from the healthy remaining replicas, and (b) write another copy of the lost replica to a random node in a different fault/upgrade domain for the extent. This recovery process allows us to leverage the available network/disk resources across all of the nodes in the storage service to potentially re-replicate a lost storage node within minutes, which is a key property to having a low MTTR in order to prevent data loss.

Another important property of the DFS replication layer is checking and scanning data for bit rot. All data written has a checksum (internal to the storage system) stored with it. The data is continually scanned for bit rot by reading the data and verifying the checksum. In addition, we always validate this internal checksum when reading the data for a client request. If an extent replica is found to be corrupt by one of these checks, then the corrupted replica is discarded and the extent is re-replicated using one of the valid replicas in order to bring the extent back to healthy level of replication.

Geo-Replication

Windows Azure Storage provides durability by constantly maintaining multiple healthy replicas for your data. To achieve this, replication is provided within a single location (e.g., US South), across different fault and upgrade domains as described above. This provides durability within a given location. But what if a location has a regional disaster (e.g., wild fire, earthquake, etc.) that can potentially affect an area for many miles?

We are working on providing a feature called geo-replication, which replicates customer data hundreds of miles between two locations (i.e., between North and South US, between North and West Europe, and between East and Southeast Asia) to provide disaster recovery in case of regional disasters. The geo-replication is in addition to the multiple copies maintained by the DFS layer within a single location described above. We will have more details in a future blog post on how geo-replication works and how it provides geo-diversity in order to provide disaster recovery if a regional disaster were to occur.

Load Balancing Hot DFS Servers

Windows Azure Storage has load balancing at the partition layer and also at the DFS layer. The partition load balancing addresses the issue of a partition server getting too many requests per second for it to handle for the partitions it is serving, and load balancing those partitions across other partition servers to even out the load. The DFS layer is instead focused on load balancing the I/O load to its disks and the network BW to its servers.

The DFS servers can get too hot in terms of the I/O and BW load, and we provide automatic load balancing for DFS servers to address this. We provide two forms of load balancing at the DFS layer:

Read Load Balancing - The DFS layer maintains multiple copies of data through the multiple replicas it keeps, and the system is built to allow reading from any of the up to date replica copies. The system keeps track of the load on the DFS servers. If a DFS server is getting too many requests for it to handle, partition servers trying to access that DFS server will be routed to read from other DFS servers that are holding replicas of the data the partition server is trying to access. This effectively load balances the reads across DFS servers when a given DFS server gets too hot. If all of the DFS servers are too hot for a given set of data accessed from partition servers, we have the option to increase the number of copies of the data in the DFS layer to provide more throughput. However, hot data is mostly handled by the partition layer, since the partition layer caches hot data, and hot data is served directly from the partition server cache without going to the DFS layer.

Write Load Balancing – All writes to a given piece of data go to a primary DFS server, which coordinates the writes to the secondary DFS servers for the extent. If any of the DFS servers becomes too hot to service the requests, the storage system will then choose different DFS servers to write the data to.

Why Both a Partition Layer and DFS Layer?

When describing the architecture, one question we get is why do we have both a Partition layer and a DFS layer, instead of just one layer both storing the data and providing load balancing?

The DFS layer can be thought of as our file system layer, it understand files (these large chunks of storage called extents), how to store them, how to replicate them, etc, but it doesn’t understand higher level object constructs nor their semantics. The partition layer is built specifically for managing and understanding higher level data abstractions, and storing them on top of the DFS.

The partition layer understands what a transaction means for a given object type (Blobs, Entities, Messages). In addition, it provides the ordering of parallel transactions and strong consistency for the different types of objects. Finally, the partition layer spreads large objects across multiple DFS server chunks (called extents) so that large objects (e.g., 1 TB Blobs) can be stored without having to worry about running out of space on a single disk or DFS server, since a large blob is spread out over many DFS servers and disks.

Partitions and Partition Servers

When we say that a partition server is serving a partition, we mean that the partition server has been designated as the server (for the time being) that controls all access to the objects in that partition. We do this so that for a given set of objects there is a single server ordering transactions to those objects and providing strong consistency and optimistic concurrency, since a single server is in control of the access of a given partition of objects.

In the prior scalability targets post we described that a single partition can process up to 500 entities/messages per second. This is because all of the requests to a single partition have to be served by the assigned partition server. Therefore, it is important to understand the scalability targets and the partition keys for Blobs, Tables and Queues when designing your solutions (see the upcoming posts focused on getting the most out of Blobs, Tables and Queues for more information).

Load Balancing Hot Partition Servers

It is important to understand that partitions are not tied to specific partition servers, since the data is stored in the DFS layer. The partition layer can therefore easily load balance and assign partitions to different partition servers, since any partition server can potentially provide access to any partition.

The partition layer assigns partitions to partition severs based on each partition’s load. A given partition server may serve many partitions, and the Partition Master continuously monitors the load on all partition servers. If it sees that a partition server has too much load, the partition layer will automatically load balance some of the partitions from that partition server to a partition server with low load.

When reassigning a partition from one partition server to another, the partition is offline only for a handful seconds, in order to maintain high availability for the partition. Then in order to make sure we do not move partitions around too much and make too quick of decisions, the time it takes to decide to load balance a hot partition server is on the order of minutes.

Summary

The Windows Azure Storage architecture had three main layers – Front-End layer, Partition layer, and DFS layer. For availability, each layer has its own form of automatic load balancing and dealing with failures and recovery in order to provide high availability when accessing your data. For durability, this is provided by the DFS layer keeping multiple replicas of your data and using data spreading to keep a low MTTR when failures occur. For consistency, the partition layer provides strong consistency and optimistic concurrency by making sure a single partition server is always ordering and serving up access to each of your data partitions.

Hi, how can we achieve data consistency across data-centers? assume there are two populations of users in two locations A and B, and it would be faster for each population to go the nearby data-center. Assume each employee record is a single blob. Is there a way to keep data consistent if multiple writes/updates happen on the same employee record but on two different data-centers? or do I have to implement my own logic for that?

If you want to have consistent writes across multiple data centers, then that has to be done at the application level. The geo-replication referred to above is asynchronous, where the update is first committed at the primary data center and success returned back to the client. Then in the background the update is sent from the primary to the secondary data center. I’ll have more information posted about geo-replication by the end of the year.