This service provides a simple storage service for applications using RESTful interfaces, providing maximum data availability and storage capacity.

I explain here how some parts of the storage and replication in Swift works.

If you don’t know Swift and want to read a more “shallow” overview first, you can read John Dickinson’s Swift Tech Overview.

How Swift storage works

If we refer to the CAP theorem, Swift chose availability and partition tolerance and dropped consistency. That means that you’ll always get your data, they will be dispersed on many places, but you could get an old version of them (or no data at all) in some odd cases (like some server overload or failure). This compromise is made to allow maximum availability and scalability of the storage platform.

But there are mechanisms built into Swift to minimize the potential data inconsistency window: they are responsible for data replication and consistency.

Swift uses the principle of consistent hashing. It builds what it calls a ring. A ring represents the space of all possible computed hash values divided in equivalent parts. Each part of this space is called a partition.

The following schema (stolen from the Riak project) shows the principle nicely:

In a simple world, if you wanted to store some objects and distribute them on 4 nodes, you would split your hash space in 4. You would have 4 partitions, and computing hash(object) modulo 4 would tell you where to store your object: on node 0, 1, 2 or 3.

But since you want to be able to extend your storage cluster to more nodes without breaking the whole hash mapping and moving everything around, you need to build a lot more partitions. Let’s say we’re going to build 210 partitions. Since we have 4 nodes, each node will have 2<sup>10</sup> ÷ 4 = 256 partitions. If we ever want to add a 5th node, it’s easy: we just have to re-balance the partitions and move 1⁄4 of the partitions from each node to this 5th node. That means all our nodes will end up with 2<sup>10</sup> ÷ 5 ≈ 204 partitions. We can also define a weight for each node, in order for some nodes to get more partitions than others.

With 210 partitions, we can have up to 210 nodes in our cluster. Yeepee.

Concretely, when building one Swift ring, you’ll have to say how much partitions you want, and this is what this value is really about.

Data duplication

Now, to assure availability and partitioning (as seen in the CAP theorem) we also want to store replicas of our objects. By default, Swift stores 3 copies of every objects, but that’s configurable.

In that case, we need to store each partition defined above not only on 1 node, but on 2 others. So Swift adds another concept: zones. A zone is an isolated space that does not depends on other zone, so in case of an outage on a zone, the other zones are still available. Concretely, a zone is likely to be a disk, a server, or a whole cabinet, depending on the size of your cluster. It’s up to you to chose anyway.

Consequently, each partitions has not to be mapped to 1 host only anymore, but to N hosts. Each node will therefore store this number of partitions:

number of partition stored on one node = number of replicas × total number of partitions ÷ number of node

Examples:

We split the ring in 210 = 1024 partitions. We have 3 nodes. We want 3 replicas of data.

We split the ring in 211 = 2048 partitions. We have 5 nodes. We want 3 replicas of data.

→ Each node will store 2<sup>11</sup> × 3 ÷ 5 ≈ 1129 partitions.

We split the ring in 211 = 2048 partitions. We have 6 nodes. We want 3 replicas of data.

→ Each node will store 2<sup>11</sup> × 3 ÷ 6 = 1024 partitions.

Three rings to rule them all

In Swift, there is 3 categories of thing to store: account, container and objects.

An account is what you’d expect it to be, a user account. An account contains containers (the equivalent of Amazon S3’s buckets). Each container can contains user-defined key and values (just like a hash table or a dictionary): values are what Swift call objects.

Swift wants you to build 3 different and independent rings to store its 3 kind of things (accounts, containers and objects).

Internally, the two first categories are stored as SQLite databases, whereas the last one is stored using regular files.

Note that this 3 rings can be stored and managed on 3 completely different set of servers.

Data replication

Now that we have our storage theory in place (accounts, containers and objects distributed into partitions, themselves stored into multiple zones), let’s go the replication practice.

When you put something in one of the 3 rings (being an account, a container or an object) it is uploaded into all the zones responsible for the ring partition the object belongs to. This upload into the different zones is the responsibility of the swift-proxy daemon.

But if one of the zone is failing, you can’t upload all your copies in all zones at the upload time. So you need a mechanism to be sure the failing zone will catch up to a correct state at some point.

That’s the role of the swift-{container,account,object}-replicator processes. This processes are running on each node part of a zone and replicates their contents to nodes of the other zones.

When they run, they walk through all the contents from all the partitions on the whole file system and for each partition, issue a special REPLICATE HTTP request to all the other zones responsible for that same partition. The other zone responds with information about the local state of the partition. That allows the replicator process to decide if the remote zone has an up-to-date version of the partition.

In case of account and containers, it doesn’t check at the partition level, but check each account/container contained inside each partition.

If something is not up-to-date, it will be pushed using rsync by the replicator process. This is why you’ll read that the replication updates are “push based” in Swift documentation.

# Pseudo code describing replication process for accounts # The principle is exactly the same for containers for account in accounts: # Determine the partition used to store this account partition = hash(account) % number_of_partitions # The number of zone is the number of replicas configured for zone in partition.get_zones_storing_this_partition(): # Send a HTTP REPLICATE command to the remote swift-account-server process version_of_account = zone.send_HTTP_REPLICATE_for(account): if version_of_account < account.version() account.sync_to(zone)

This replication process is O(number of account × number of replicas). The more your number of account will increase and the more you will want replicas for your data, the more the replication time for your accounts will grow. The same rule applies for containers.

# Pseudo code describing replication process for objects for partition in partitions_storing_objects: # The number of zone is the number of replicas configured for zone in partition.get_zones_storing_this_partition(): # Send a HTTP REPLICATE command to the remote swift-object-server process verion_of_partition = zone.send_HTTP_REPLICATE_for(partition): if version_of_partition < partition.version() # Use rsync to synchronize the whole partition # and all its objects partition.rsync_to(zone)

This replication process is O(number of objects partitions × number of replicas). The more your number of objects partitions will increase, and the more you will want replicas for your data, the more the replication time for your objects will grow.

I think this is something important to know when deciding how to build your Swift architecture. Choose the right number the number of replicas, partitions and nodes.

Conclusion

I recommend to chose correctly the different Swift parameters for your setup. The replication process optimization consists in having the minimum amount of partitions per node, which can be done by: