Data hosted on long-term storage systems experience gradual changes in
access patterns as part of their information lifecycles. For example,
empirical studies by companies such as Facebook show that as image data
age beyond their creation times, they become more and more unlikely to be
accessed by users, with access rates dropping exponentially at times [1].
Long retention periods, as is the case with data stored on cold storage
systems like Swift, increase the possibility of such changes.

Tiering is an important feature provided by many traditional file & block
storage systems to deal with changes in data “temperature”. It enables
seamless movement of inactive data from high performance storage media to
low-cost, high capacity storage media to meet customers’ TCO (total cost of
ownership) requirements. As scale-out object storage systems like Swift are
starting to natively support multiple media types like SSD, HDD, tape and
different storage policies such as replication and erasure coding, it becomes
imperative to complement the wide range of available storage tiers (both
virtual and physical) with automated data tiering.

Swift users and operators can adapt to changes in access characteristics of
objects by transparently converting their storage policies to cater to the
goal of matching overall business needs ($/GB, performance, availability) with
where and how the objects are stored.

Some examples of how objects can be moved between Swift containers of different
storage policies as they age.

[SSD-based container] –> [HDD-based container]

[HDD-based container] –> [Tape-based container]

[Replication policy container] –> [Erasure coded policy container]

In some customer environments, a Swift container may not be the last storage
tier. Examples of archival-class stores lower in cost than Swift include
specialized tape-based systems [2], public cloud archival solutions such as
Amazon Glacier and Google Nearline storage. Analogous to this proposed feature
of tiering in Swift, Amazon S3 already has the in-built support to move
objects between S3 and Glacier based on user-defined rules. Redhat Ceph has
recently added tiering capabilities as well.

The main goal of this document is to propose a tiering feature in Swift that
enables seamless movement of objects between containers belonging to different
storage policies. It is “seamless” because users will not experience any
disruption in namespace, access API, or availability of the objects subject to
tiering.

Through new Swift API enhancements, Swift users and operators alike will have
the ability to specify a tiering relationship between two containers and the
associated data movement rules.

The focus of this proposal is to identify, create and bring together the
necessary building blocks towards a baseline tiering implementation natively
within Swift. While this narrow scope is intentional, the expectation is that
the baseline tiering implementation will lay the foundation and not preclude
more advanced tiering features in future.

The following in-progress Swift features (aka specs) have been identified as
core dependencies for this tiering proposal.

Swift Symbolic Links [3]

Changing Storage Policies [4]

A few other specs are classified as nice-to-have dependencies, meaning that
if they evolve into full implementations we will be able to demonstrate the
tiering feature with advanced use cases and capabilities. However, they are
not considered mandatory requirements for the first version of tiering.

The proposed tiering implementation depends on several building blocks, some
of which are unique to tiering, like the requisite API changes. They will be
described in their entirety. Others like symlinks are independent features and
have uses beyond tiering. Instead of re-inventing the wheel, the tiering
implementation aims to leverage specific constructs that will be available
through these in-progress features.

For a quick overview of the tiering implementation, please refer to the Figure
(images/tiering_overview.png). It highlights the flow of actions taking place
within the proposed tiering engine.

1. Swift client creates a tiering relationship between two Swift containers by
marking the source container with appropriate metadata.
2. A background process named tiering-coordinator examines the source container
and iterates through its objects.
3. Tiering-coordinator identifies candidate objects for movement and de-stages
each object to target container by issuing a copy request to an object server.
4. After an object is copied, tiering-coordinator replaces it by a symlink in
the source container pointing to corresponding object in target container.

The metadata values can be set during the creation of the source container
(PUT) operation or they can be set later as part of a container metadata
update (POST) operation. Object age refers to the time elapsed since the
object’s creation time (creation time is stored with the object as
‘X-Timestamp’ header).

The user semantics of setting the above container metadata are as follows.
When objects in the source container become older than the specified threshold
time, they become candidates for being de-staged to the target container. There
are no guarantees on when exactly they will be moved or the precise location of
the objects at any given time. Swift will operate on them asynchronously and
relocate objects based on user-specified tiering rules. Once the tiering
metadata is set on the source container, the user can expect levels of
performance, reliability, etc. for its objects commensurate with the storage
policy of either the source or target container.

One can override the tiering metadata for individual objects in the source
container by setting the following per-object metadata,

Presence of tiering metadata on an object will imply that it will take
precedence over the tiering metadata set on the hosting container. However,
if a container is not tagged with any tiering metadata, the objects inside it
will not be considered for tiering regardless of whether they are tagged with
any tiering related metadata or not. Also, if the tiering age threshold on the
object metadata is lower than the value set on the container, it will not take
effect until the container age criterion is met.

An important invariant preserved by the tiering feature is the namespace of
objects. As will be explained in later sections, after objects are moved they
will be replaced immediately by symlinks that will allow users to continue
foreground operations on objects as if no migrations have taken place. Please
refer to section 7 on open questions for further commentary on the API topic.

To summarize, here are the steps that a Swift user must perform in order to
initiate tiering between objects from a source container (S) to a target
container (T) over time.

It will also be possible to create cascading tiering relationships between
more than two containers. For example, a sequence of tiering relationships
between containers C1 -> C2 -> C3 can be established by setting appropriate
tiering metadata on C1 and C2. When an object is old enough to be moved from
C1, it will be deposited in C2. The timer will then start on the moved object
in C2 and depending on the age settings on C2, the object will eventually be
migrated to C3.

The tiering-coordinator is a background process similar to container-sync,
container-reconciler and other container-* processes running on each container
server. We can potentially re-use one of the existing container processes,
specifically either container-sync or container-reconciler to perform the job of
tiering-coordinator, but for the purposes of this discussion it will be assumed
that it is a separate process.

The key actions performed by tiering-coordinator are

Walk through containers marked with tiering metadata

Identify candidate objects for tiering within those containers

Initiate copy requests on candidate objects

Replace source objects with corresponding symlinks

We will discuss (a) and (b) in this section and cover (c) and (d) in subsequent
sections. Note that in the first version of tiering, only one metric
<object age> will be used to determine the eligibility of an object for
migration.

The tiering-coordinator performs its operations in a series of rounds. In each
round, it iterates through containers whose SQLite DBs it has direct access to
on the container server it is running on. It checks if the container has the
right X-Container-Tiering-* metadata. If present, it starts the scanning process
to identify candidate objects. The scanning process leverages a convenient (but
not necessary) property of the container DB that objects are listed in the
chronological order of their creation times. That is, the first index in the
container DB points to the object with oldest creation time, followed by next
younger object and so on. As such, the scanning process described below is
optimized for the object age criterion chosen for tiering v1 implementation.
For extending to other tiering metrics, we refer the reader to section 6.1 for
discussion.

Each container DB will have two persistent markers to track the progress of
tiering – tiering_sync_start and tiering_sync_end. The marker tiering_sync_start
refers to the starting index in the container DB upto which objects have already
been processed. The marker tiering_sync_end refers to the index beyond which
objects have not yet been considered for tiering. All the objects that fall
between the two markers are the ones for which tiering is currently in progress.
Note that the presence of persistent markers in the container DB helps with
quickly resuming from previous work done in the event of container server
crash/reboot.

When a container is selected for tiering for the first time, both the markers
are initialized to -1. If the first object is old enough to meet the
X-Container-Tiering-Age criterion, tiering_sync_start is set to 0. Then the
second marker tiering_sync_end is advanced to an index that is lesser than
the two values - (i) tiering_sync_start + tier_max_objects_per_round (latter
will be a configurable value in /etc/swift/container.conf) or (ii) largest
index in the container DB whose corresponding object meets the tiering age
criterion.

The above marker settings will ensure two invariants. First, all objects
between (and including) tiering_sync_start and tiering_sync_end are candidates
for moving to the target container. Second, it will guarantee that the number
of objects processed on the container in a single round is bound by the
configuration parameter (tier_max_objects_per_round, say = 200). This will
ensure that the coordinator process will round robin effectively amongst all
containers on the server per round without spending undue amount of time on
only a few.

After the markers are fixed, tiering-coordinator will issue a copy request
for each object within the range. When the copy requests are completed, it
updates tiering_sync_start = tiering_sync_end and moves on to the next
container. When tiering-coordinator re-visits the same container after
completing the current round, it restarts the scanning routine described
above from tiering_sync_start = tiering_sync_end (except they are not both
-1 this time).

In a typical Swift cluster, each container DB is replicated three times and
resides on multiple container servers. Therefore, without proper
synchronization, tiering-coordinator processes can end up conflicting with
each other by processing the same container and same objects within. This
can potentially lead to race conditions with non-deterministic behavior. We
can overcome this issue by adopting the approach of divide-and-conquer
employed by container-sync process. The range of object indices between
(tiering_sync_start, tiering_sync_end) can be initially split up into as
many disjoint regions as the number of tiering-coordinator processes
operating on the same container. As they work through the object indices,
each process might additionally complete others’ portions depending on the
collective progress. For a detailed description of how container-sync
processes implicitly communicate and make group progress, please refer
to [7].

For each candidate object that the tiering-coordinator deems eligible to move to
the target container, it issues an ‘object copy’ request using an API call
supported by the object servers. The API call will map to a method used by
object-transferrer daemons running on the object servers. The
tiering-coordinator can select any of the object servers (by looking up the ring
datastructure corresponding to the object in source container policy) as a
destination for the request.

The object-transferrer daemon is supposed to be optimized for converting an
object from one storage policy to another. As per the ‘Changing policies’ spec,
the object-transferrer daemon will be equipped with the right techniques to move
objects between Replication -> EC, EC -> EC, etc. Alternatively, in the absence
of object-transferrer, the tiering coordinator can simply make use of the
server-side ‘COPY’ API that vanilla Swift exposes to regular clients. It can
send the COPY request to a swift proxy server to clone the source object into
the target container. The proxy server will perform the copy by first reading in
(GET request) the object from any of the source object servers and creating a
copy (PUT request) of the object in the target object servers. While this will
work correctly for the purposes of the tiering coordinator, making use of the
object-transferrer interface is likely to be a better option. Leveraging the
specialized code in object-transferrer through a well-defined interface for
copying an object between two different storage policy containers will make the
overall tiering process efficient.

Here is an example interface represented by a function call in the
object-transferrer code:

def copy_object(source_obj_path, target_obj_path)

The above method can be a wrapper over similar functionality used by the
object-transferrer daemon. The tiering-coordinator will use this interface to
call the function through a HTTP call.

copy_object(/A/S/O, /A/T/O)

where S is the source container and T is the target container. Note that the
object name in the target container will be the same as in the source container.

Upon receiving the copy request, the object server will first check if the
source path is a symlink object. If it is a symlink, it will respond with an
error to the tiering-coordinator to indicate that a symlink already exists.
This behavior will ensure idempotence and guard against situations where
tiering-coordinator crashes and retries a previously completed object copy
request. Also, it avoids tiering for sparse objects such as symlinks created
by users. Secondly, the object server will check if the source object has
tiering metadata in the form of X-Object-Tiering-* that overrides the default
tiering settings on the source container. It may or may not perform the object
copy depending on the result.

After an object is successfully copied to the destination container, the
tiering-coordinator will issue a ‘symlink create’ request to proxy server to
replace the source object by a reference to the destination object. Waiting
until the object copy is completed before replacing it by a symlink ensures
safety in case of failures. The system could end up with an extra target
object without a symlink pointing to it, but not the converse which
constitutes data loss. Note that the symlink feature is currently
work-in-progress and will also be available as an external API to swift clients.

When the symlink is created by the tiering-coordinator, it will need to ensure
that the original object’s ‘X-Timestamp’ value is preserved on the symlink
object. Therefore, it is proposed that in the symlink creation request, the
original time field can be provided (tiering-coordinator can quickly read the
original values from container DB entry) as object user metadata, which is
translated internally to a special sysmeta field by the symlink middleware.
On subsequent user requests, the sysmeta field storing the correct creation
timestamp will be sent to the user.

With the symlink successfully created, Swift users can continue to issue object
requests like GET, PUT to the original namespace /Account/Container/Object. The
Symlink middleware will ensure that the swift users do not notice the presence
of a symlink object unless a query parameter ‘?symlink=true’ [3] is explicitly
provided with the object request.

Users can also continue to read and update object metadata as before. It is not
entirely clear at the time of this writing if the symlink object will store a
copy of user metadata in its own extended attributes or if it will fetch the
metadata from the referenced object for every HEAD/GET on the object. We will
defer to whichever implementation that the symlink feature chooses to provide.

An interesting race condition is possible due to the time window between object
copy request and symlink creation. If there is an interim PUT request issued by
a swift user between the two, it will be overwritten by the internal symlink
created by the tiering-coordinator. This is an incorrect behavior that we need
to protect against. We can use the same technique [8] (with help of a second
vector timestamp) that container-reconciler uses to resolve a similar race
condition. The tiering-coordinator, at the time of symlink creation, can detect
the race condition and undo the COPY request. It will have to delete the object
that was created in the destination container. Though this is wasted work in
the face of such race conditions, we expect them to be rare scenarios. If the
user conceives tiering rules properly, there ought to be little to no
foreground traffic for the object that is being tiered.

The first version of tiering implementation will be heavily tailored (especially
the scanning mechanism of tiering-coordinator) to the object age criterion. The
convenient property of container DBs that store objects in the same order as
they are created/overwritten lends to very efficient linear scanning for
candidate objects.

In the future, we should be able to support advanced criteria such as read
frequency counts, object size, metadata-based selection, etc. For example,
consider the following hypothetical criterion:

“Tier objects from container S to container T if older than 1 month AND size >
1GB AND tagged with metadata ‘surveillance-video’”

When the metadata search feature [5] is available in Swift, tiering-coordinator
should be able to run queries to quickly retrieve the set of object names that
match ad-hoc criteria on both user and system metadata. As the metadata search
feature evolves, we should be able to leverage it to add custom metadata such
as read counts, etc for our purposes.

The first implementation of tiering will only support object movement between
Swift containers. In order to establish a tiering relationship between a swift
container and an external storage backend, the backend must be mounted in Swift
as a native container through the DiskFile API or other integration mechanisms.
For instance, a target container fully hosted on GlusterFS or Seagate Kinetic
drives can be created through Swift-on-file or Kinetic DiskFile implementations
respectively.

The Swift community believes that a similar integration approach is necessary
to support external storage systems as tiering targets. There is already work
underway to integrate tape-based systems in Swift. In the same vein, future
work is needed to integrate external systems like Amazon Glacier or vendor
archival products via DiskFile drivers or other means.

This section is structured as a series of questions and possible answers. With
more feedback from the swift community, the open issues will be resolved and
merged into the main document.

Q1: Can the target container exist on a different account than the source
container?

Ans: The proposed API assumes that the target container is always on the same
account as the source container. If this restriction is lifted, the proposed
API needs to be modified appropriately.

Q2: When the client sets the tiering metadata on the source container, should
the target container exist at that time? What if the user has no permissions on
the target container? When is all the error checking done?

Ans: The error checking can be deferred to the tiering-coordinator process. The
background process, upon detecting that the target container is unavailable can
skip performing any tiering activity on the source container and move on to the
next container. However, it might be better to detect errors in the client path
and report early. If the latter approach is chosen, middleware functionality is
needed to sanity check tiering metadata set on containers.

Q3: How is the target container presented to the client? Would it be just like
any other container with read/write permissions?

Ans: The target container will be just like any other container. The client is
responsible for manipulating the contents in the target container correctly. In
particular, it should be aware that there might be symlinks in source container
pointing to target objects. Deletions or overwrites of objects directly using
the target container namespace could render some symlinks useless or obsolete.

Q4: What is the behavior when conflicting tiering metadata are set over a
period of time. For example, if the tiering age threshold is increased on a
container with a POST metadata operation, will previously de-staged objects
be brought back to the source container to match the new tiering rule?

Ans: Perhaps not. The new tiering metadata should probably only be applied to
objects that have not yet been processed by tiering-coordinator. Previous
actions performed by tiering-coordinator based on older metadata need not be
reversed.

Q5: When a user issues a PUT operation to an object that has been de-staged to
the target container earlier, what is the behavior?

Ans: The default symlink behavior should apply but it’s not clear what it will
be. Will an overwrite PUT cause the symlink middleware to delete both the
symlink and the object being pointed to?

Q6: When a user issues a GET operation to an object that has been de-staged to
the target container earlier, will it be promoted back to source container?

Ans: The proposed implementation does not promote objects back to an upper tier
seamless to the user. If needed, such a behavior can be easily added with help
of a tiering middleware in the proxy server.

Q7: There is a mention of the ability to set cascading tiering relationships
between multiple containers, C1 -> C2 -> C3. What if there is a cycle in this
relationship graph?

Ans: A cycle should be prevented, else we can run into atleast one complicated
situation where a symlink might be pointing to an object on the same container
with the same name, thereby overwriting the symlink ! It is possible to detect
cycles at the time of tiering metadata creation in the client path with a
tiering-specific middleware that is entrusted with the cycle detection by
iterating through existing tiering relationships.

Q8: Are there any unexpected interactions of tiering with existing or new
features like SLO/DLO, encryption, container sharding, etc ?

Ans: SLO and DLO segments should continue to work as expected. If an object
server receives an object copy request for a SLO manifest object from a
tiering-coordinator, it will iteratively perform the copy for each constituent
object. Each constituent object will be replaced by a symlink. Encryption
should also work correctly as it is almost entirely orthogonal to the tiering
feature. Each object is treated as an opaque set of bytes by the tiering engine
and it does not pay any heed to whether the object is cipher text or not.
Dealing with container sharding might be tricky. Tiering-coordinator expects
to linearly walk through the indices of a container DB. If the container DB
is fragmented and stored in many different container servers, the scanning
process can get complicated. Any ideas there?