Tuesday, February 14, 2012

Under the hood of Swift. The Ring

This is the first post in series that summarizes our analysis of Swift architecture. We've tried to highlight some points that are not clear enough in the official documentation. Our primary base was an in-depth look into the source code. The Ring is the vital part of Swift architecture. This half database, half configuration file keeps track of where all data resides in the cluster. For each possible path to any stored entity in the cluster, the Ring points to the particular device on the particular physical node.

There are three types of entities that Swift recognizes: accounts, containers and objects. Each type has the ring of its own, but all three rings are put up the same way. Swift services use the same source code to create and query all three rings. Two Swift classes are responsible for this tasks: RingBuilder and Ring respectively.

Ring data structure

Every Ring of three in Swift is the structure that consists of 3 elements:

a list of devices in the cluster, also known as devs in the Ring class;

a list of lists of devices ids indicating partition to data assignments, stored in variable named _replica2part2dev_id;

an integer number of bits to shift an MD5-hashed path to the account/container/object to calculate the partition index for the hash (partition shift value, part_shift).

List of devices

A list of devices includes all storage devices (disks) known to the ring. Each element of this list is a dictionary of the following structure:

Key

Type

Value

id

integer

Index of the devices list

zone

integer

Zone the device resides in

weight

float

The relative weight of the device to the other devices in the ring

ip

string

IP address of server containing the device

port

integer

TCP port the server uses to serve requests for the device

device

string

Disk name of the device in the host system, e.g. sda1. It is used to identify disk mount point under /srv/node on the host system

meta

string

General-use field for storing arbitrary information about the device. Not used by servers directly

Some device management can be performed using values in the list. First, for the removed devices, the 'id' value is set to 'None'. Device IDs are generally not reused. Second, setting 'weight' to 0.0 disables the device temporarily, as no partitions will be assigned to that device.

Partitions assignment list

This data structure is a list of N elements, where N is the replica count for the cluster. The default replica count is 3. Each element of partitions assignment list is an array('H'), or Python compact efficient array of short unsigned integer values. These values are actually index into a list of devices (see previous section). So, each array('H') in the partitions assignment list represents mapping partitions to devices ID.

The ring takes a configurable number of bits from a path's MD5 hash and converts it to the long integer number. This number is used as an index into the array('H'). This index points to the array element that designates an ID of the device to which the partition is mapped. Number of bits kept from the hash is known as the partition power, and 2 to the partition power indicates the partition count.

For a given partition number, each replica's device will not be in the same zone as any other replica's device. Zones can be used to group devices based on physical locations, power separations, network separations, or any other attribute that could make multiple replicas unavailable at the same time.

Partition Shift Value

This is the number of bits taken from MD5 hash of '/account/[container/[object]]' path to calculate partition index for the path. Partition index is calculated by translating binary portion of hash into integer number.

Ring operation

The structure described above is stored as pickled (see Python pickle) and gzipped (see Python gzip.GzipFile) file. There are three files, one per ring, and usually their names are:

account.ring.gz
container.ring.gz
object.ring.gz

These files must exist in /etc/swift directory on every Swift cluster node, both Proxy and Storage, as services on all these nodes use it to locate entities in cluster. Moreover, ring files on all nodes in the cluster must have the same contents, so cluster can function properly.

There are no internal Swift mechanisms that can guarantee that the ring is consistent, i.e. gzip file is not corrupt and can be read. Swift services have no way to tell if all nodes have the same version of rings. Maintenance of ring files is administrator's responsibility. These tasks can be automated by means external to the Swift itself, of course.

The Ring allows any Swift service to identify which Storage node to query for the particular storage entity. Method Ring.get_nodes(account, container=None, obj=None) is used for identification of target Storage node for the given path (/account[/container[/object]]). It returns the tuple of partition and dictionary of nodes. The partition is used for constructing the local path to object file or account/container database. Nodes dictionary elements have the same structure as the devices in list of devices (see above).

Ring management

Swift services can not change the Ring. Ring is managed by swift-ring-builder script. When new Ring is created, first administrator should specify builder file and main parameter of the Ring: partition power (or partition shift value), number of replicas of each partition in cluster, and the time in hours before a specific partition can be moved in succession:

When the temporary builder file structure is created, administrator should add devices to the Ring. For each device, required values are zone number, IP address of the Storage node, port on which server is listening, device name (e.g. sdb1), optional device meta-data (e.g., model name, installation date or anything else) and device weight:

Device weight is used to distribute partitions between the devices. More the device weight, more partitions are going to be assigned to that device. Recommended initial approach is to use the same size devices across the cluster and set weight 100.0 to each device. For devices added later, weight should be proportional to the capacity. At this point, all devices that will initially be in the cluster, should be added to the Ring. Consistency of the builder file can be verified before creating actual Ring file:

In case of successful verification, the next step is to distribute partitions between devices and create actual Ring file. It is called 'rebalance' the Ring. This process is designed to move as few partitions as possible to minimize the data exchange between nodes, so it is important that all necessary changes to the Ring are made before rebalancing it:

The whole procedure must be repeated for all three rings: account, container and object. The resulting .ring.gz files should be pushed to all nodes in cluster. Builder files are also needed for the future changes to rings, so they should be backed up and kept in safe place. One of approaches is to put them to the Swift storage as ordinary objects.

Physical disk usage

Partition is essentially the block of data stored in the cluster. This does not mean, however, that disk usage is constant for all partitions. Distribution of objects between the partitions is based on the object path hash, not the object size or other parameters. Objects are not partitioned, which means that an object is kept as a single file in storage node file system (except very large objects, greater than 5Gb, which can be uploaded in segments - see the Swift documentation).

The partition mapped to the storage device is actually a directory in structure under /srv/node/<dev_name>. The disk space used by this directory may vary from partition to partition, depending on size of objects that have been placed to this partition by mapping hash of object path to the Ring.

In conclusion it should be said that the Swift Ring is a beautiful structure, though it lacks a degree of automation and synchronization between nodes. I'm going to write about how to solve these problems in one of the following posts.