Contents

Ceph: A Linux petabyte-scale distributed file system

Exploring the Ceph file system and ecosystem

M. Tim JonesPublished on June 04, 2010

Connect with Tim

Tim is
one of our most popular and prolific authors. Browse
all of Tim's articles
on developerWorks. Check out
Tim's profile
and connect with him, other authors, and fellow readers in My
developerWorks.

As an architect in the storage industry, I have an affinity to file
systems. These systems are the user interfaces to storage systems, and
although they all tend to offer a similar set of features, they also can
provide notably different features. Ceph is no different, and it offers
some of the most interesting features you'll find in a file system.

Ceph began as a PhD research project in storage systems by Sage Weil at the
University of California, Santa Cruz (UCSC). But as of late March 2010,
you can now find Ceph in the mainline Linux kernel (since 2.6.34).
Although Ceph may not be ready for production environments, it's still
useful for evaluation purposes. This article explores the Ceph file system
and the unique features that make it an attractive alternative for
scalable distributed storage.

Ceph goals

Why "Ceph"?

"Ceph" is an odd name for a file system and breaks the typical
acronym trend that most follow. The name is a reference to
the mascot at UCSC (Ceph's origin), which happens to be "Sammy," the banana slug, a
shell-less mollusk in the cephalopods class. Cephalopods, with their multiple tentacles,
provide a great metaphor for a distributed file system.

Developing a distributed file system is a complex endeavor, but it's
immensely valuable if the right problems are solved. Ceph's goals can be
simply defined as:

Easy scalability to multi-petabyte capacity

High performance over varying workloads (input/output operations per
second [IOPS] and bandwidth)

Strong reliability

Unfortunately, these goals can compete with one another (for example,
scalability can reduce or inhibit performance or impact reliability). Ceph
has developed some very interesting concepts (such as dynamic metadata
partitioning and data distribution and replication), which this article
explores shortly. Ceph's design also incorporates fault-tolerance features
to protect against single points of failure, with the assumption that
storage failures on a large scale (petabytes of storage) will be the norm
rather than the exception. Finally, its design does not assume particular
workloads but includes the ability to adapt to changing distributed
workloads to provide the best performance. It does all of this with the
goal of POSIX compatibility, allowing it to be transparently deployed for
existing applications that rely on POSIX semantics (through Ceph-proposed
enhancements). Finally, Ceph is open source distributed storage and part
of the mainline Linux kernel (2.6.34).

Ceph architecture

Now, let's explore the Ceph architecture and its core elements at a high
level. I then dig down another level to identify some of the key aspects
of Ceph to provide a more detailed exploration.

The Ceph ecosystem can be broadly divided into four segments (see Figure
1): clients (users of the data), metadata servers (which cache and
synchronize the distributed metadata), an object storage cluster (which
stores both data and metadata as objects and implements other key
responsibilities), and finally the cluster monitors (which implement the
monitoring functions).

Figure 1. Conceptual architecture of
the Ceph ecosystem

As Figure 1 shows, clients perform metadata operations (to identify the
location of data) using the metadata servers. The metadata servers manage
the location of data and also where to store new data. Note that metadata
is stored in the storage cluster (as indicated by "Metadata I/O"). Actual
file I/O occurs between the client and object storage cluster. In this
way, higher-level POSIX functions (such as open, close, and rename) are
managed through the metadata servers, whereas POSIX functions (such as read
and write) are managed directly through the object storage cluster.

Another perspective of the architecture is provided in Figure 2. A set of servers
access the Ceph ecosystem through a client
interface, which understands the relationship between metadata servers and
object-level storage. The distributed storage system can be viewed in a
few layers, including a format for the storage devices (the Extent and
B-tree-based Object File System [EBOFS] or an alternative) and an
overriding management layer designed to manage data replication, failure
detection, and recovery and subsequent data migration called Reliable
Autonomic Distributed Object Storage (RADOS). Finally, monitors
are used to identify component failures, including subsequent
notification.

Figure 2. Simplified layered view of
the Ceph ecosystem

Ceph components

With the conceptual architecture of Ceph under your belts, you can dig down
another level to see the major components implemented within the Ceph
ecosystem. One of the key differences between Ceph and traditional file
systems is that rather than focusing the intelligence in the file system
itself, the intelligence is distributed around the ecosystem.

Figure 3
shows a simple Ceph ecosystem. The
Ceph Client is the user of the Ceph file system. The Ceph Metadata Daemon
provides the metadata services, while the Ceph Object Storage Daemon
provides the actual storage (for both data and metadata). Finally, the
Ceph Monitor provides cluster management. Note that there can be many Ceph
clients, many object storage endpoints, numerous metadata servers
(depending on the capacity of the file system), and at least a redundant
pair of monitors. So, how is this file system distributed?

Figure 3. Simple Ceph
ecosystem

Ceph client

Kernel or user space

Early versions of Ceph utilized Filesystems in User SpacE (FUSE), which
pushes the file system into user space and can greatly simplify its
development. But today, Ceph has been integrated into the mainline
kernel, making it faster, because user space context switches are no
longer necessary for file system I/O.

As Linux presents a common interface to the file systems (through the
virtual file system switch [VFS]), the user's perspective of Ceph is
transparent. The administrator's perspective will certainly differ, given
the potential for many servers encompassing the storage system (see the Related topics section for information on creating a
Ceph cluster). From the users' point of view, they have access to a large
storage system and are not aware of the underlying metadata servers,
monitors, and individual object storage devices that aggregate into a
massive storage pool. Users simply see a mount point, from which
standard file I/O can be performed.

The Ceph file system—or at least the client interface—is
implemented in the Linux kernel. Note that in the vast majority of file
systems, all of the control and intelligence is implemented within the
kernel's file system source itself. But with Ceph, the file system's
intelligence is distributed across the nodes, which simplifies the client
interface but also provides Ceph with the ability to massively scale (even
dynamically).

Rather than rely on allocation lists (metadata to map blocks on a disk to a
given file), Ceph uses an interesting alternative. A file from the Linux
perspective is assigned an inode number (INO) from the metadata server,
which is a unique identifier for the file. The file is then carved into
some number of objects (based on the size of the file). Using the INO and
the object number (ONO), each object is assigned an object ID (OID). Using
a simple hash over the OID, each object is assigned to a placement group.
The placement group (identified as a PGID) is a conceptual
container for objects. Finally, the mapping of the placement group to
object storage devices is a pseudo-random mapping using an algorithm called
Controlled Replication Under Scalable Hashing (CRUSH). In
this way, mapping of placement groups (and replicas) to storage devices
does not rely on any metadata but instead on a pseudo-random mapping function.
This behavior is ideal, because it minimizes the overhead of storage and
simplifies the distribution and lookup of data.

The final component for allocation is the cluster map. The cluster
map is an efficient representation of the devices representing
the storage cluster. With a PGID and the cluster map, you can locate any
object.

The Ceph metadata
server

The job of the metadata server (cmds) is to manage the file system's
namespace. Although both metadata and data are stored in the object
storage cluster, they are managed separately to support scalability. In
fact, metadata is further split among a cluster of metadata servers that
can adaptively replicate and distribute the namespace to avoid hot spots.
As shown in Figure 4, the metadata servers manage portions of the
namespace and can overlap (for redundancy and also for performance). The
mapping of metadata servers to namespace is performed in Ceph using
dynamic subtree partitioning, which allows Ceph to adapt to changing
workloads (migrating namespaces between metadata servers) while preserving
locality for performance.

Figure 4. Partitioning of the Ceph
namespace for metadata servers

But because each metadata server simply manages the namespace for the
population of clients, its primary application is an intelligent metadata
cache (because actual metadata is eventually stored within the object
storage cluster). Metadata to write is cached in a short-term journal,
which eventually is pushed to physical storage. This behavior allows the
metadata server to serve recent metadata back to clients (which is common
in metadata operations). The journal is also useful for failure recovery:
if the metadata server fails, its journal can be replayed to ensure
that metadata is safely stored on disk.

Metadata servers manage the inode space, converting file names to metadata.
The metadata server transforms the file name into an inode, file size, and
striping data (layout) that the Ceph client uses for file I/O.

Ceph monitors

Ceph includes monitors that implement management of the cluster map, but
some elements of fault management are implemented in the
object store itself. When object storage devices fail or new devices are
added, monitors detect and maintain a valid cluster map. This function is
performed in a distributed fashion where map updates are communicated with
existing traffic. Ceph uses Paxos, which is a family of algorithms for
distributed consensus.

Ceph object storage

Similar to traditional object storage, Ceph storage nodes include not only
storage but also intelligence. Traditional drives are simple targets that
only respond to commands from initiators. But object storage devices are
intelligent devices that act as both targets and initiators to support
communication and collaboration with other object storage devices.

From a storage perspective, Ceph object storage devices perform the mapping
of objects to blocks (a task traditionally done at the file system layer
in the client). This behavior allows the local entity to best decide how
to store an object. Early versions of Ceph implemented a custom low-level
file system on the local storage called EBOFS. This system
implemented a nonstandard interface to the underlying storage tuned for
object semantics and other features (such as asynchronous notification of
commits to disk). Today, the B-tree file system (BTRFS) can be used at the
storage nodes, which already implements some of the necessary features
(such as embedded integrity).

Because the Ceph clients implement CRUSH and do not have knowledge of the
block mapping of files on the disks, the underlying storage devices can
safely manage the mapping of objects to blocks. This allows the storage
nodes to replicate data (when a device is found to have failed).
Distributing the failure recovery also allows the storage system to scale,
because failure detection and recovery are distributed across the
ecosystem. Ceph calls this RADOS (see Figure 3).

Other features of interest

As if the dynamic and adaptive nature of the file system weren't enough,
Ceph also implements some interesting features visible to the user. Users
can create snapshots, for example, in Ceph on any subdirectory (including
all of the contents). It's also possible to perform file and capacity
accounting at the subdirectory level, which reports the storage size and
number of files for a given subdirectory (and all of its nested contents).

Ceph status and future

Although Ceph is now integrated into the mainline Linux kernel, it's
properly noted there as experimental. File systems in this state are
useful to evaluate but are not yet ready for production environments. But
given Ceph's adoption into the Linux kernel and the motivation by its
originators to continue its development, it should be available soon to
solve your massive storage needs.

Other distributed file systems

Ceph isn't unique in the distributed file system space, but it is unique in
the way that it manages a large storage ecosystem. Other examples of
distributed file systems include the Google File System (GFS), the General
Parallel File System (GPFS), and Lustre, to name just a few. The ideas
behind Ceph appear to offer an interesting future for distributed file
systems, as massive scales introduce unique challenges to the massive
storage problem.

Going further

Ceph is not only a file system
but an object storage ecosystem with enterprise-class features. In the Related topics section, you'll find information on
how to set up a simple Ceph cluster (including metadata servers, object
servers, and monitors). Ceph fills a gap in distributed storage, and it
will be interesting to see how the open source offering evolves in the
future.