Ceph: The Distributed File System Creature from the Object Lagoon

Did you ever see one of those terrible Sci-Fi movies involving a killer Octopus? Ceph, while named after just such an animal, is not a creature about to eat an unlucky Spring Breaker, but a new parallel distributed file system. The client portion of Ceph just went into the 2.6.34 kernel so let's learn a bit more about it.

The last two years have seen a large number of file systems added to the kernel with many of them maturing to the point where they are useful, reliable, and in production in some cases. In the run up to the 2.6.34 kernel, Linus recently added the Ceph client. What is unique about Ceph is that it is a distributed parallel file system promising scalability and performance, something that NFS lacks.

High-level view of Ceph

One might ask about the origin of Ceph since it is somewhat unusual. Ceph is really short for Cephalopod which is the class of moulluscs to which the octopus belongs. So it’s really short for octopus, sort of. If you want more detail, talk a look at the Wikipedia article about Ceph. Now that name has been partially explained, let’s look at the file system.

Ceph was started by Sage Weil for his PhD dissertation at the University of California, Santa Cruz in the Storage Systems Research Center in the Jack Baskin School of Engineering. The lab is funded by the DOE/NNSA involving LLNL (Lawrence Livermore National Labs), LANL (Los Alamos National Labs), and Sandia National Laboratories. He graduated in the fall of 2007 and has kept developing Ceph. As mentioned previously, his efforts have been rewarded with the integration of the Ceph client into the upcoming 2.6.34 kernel.

The design goals of Ceph are to create a POSIX file system (or close to POSIX) that is scalable, reliable, and has very good performance. To reach these goals Ceph has the following major features:

It is object-based

It decouples metadata and data (many parallel file systems do this as well)

It uses a dynamic distributed metadata approach

These three features and how they are implemented are at the core of Ceph (more on that in the next section).

However, probably the most fundamental core assumption in the design of Ceph is that large-scale storage systems are dynamic and there are guaranteed to be failures. The first part of the assumption, assuming storage systems are dynamic, means that storage hardware is added and removed and the workloads on the system are changing. Included in this assumption is that it is presumed there will be hardware failures and the file system needs to adaptable and resilient.

More in Depth

With the general view of Ceph in mind, let’s dive down into some more details to understand how it’s implemented and what it means. Below in Figure 1 is an overview of the layout of Ceph.

Figure 1: System layout of Ceph.

There are client nodes (the happy smiling faces), a metadata cluster, and the object storage cluster where the data is stored. When a client wants to open a file, it contacts the metadata cluster, that is referred to as the MDS, or MetaData Server, which is in fact a cluster. The MDS returns information to the client that tells it what it’s capabilities are (what it can and cannot do), file size, striping information (the data is striped across multiple storage devices for performance reasons), and something called a file inode (used by Ceph). Once the data is received the client sends/receives data directly from the Object Storage Devices (OSD’s) which make up the data storage cluster. During the data transactions the MDS is checked to see if there are any changes. If there are none, then everything proceeds normally. If there are changes the MDS notifies the client and the OSD’s. One everything is done and the close request is sent to the MDS and OSD’s to close the file, the the client updates the MDS with any details and the MDS marks the file as closed and updates the metadata information.

Object-Based Storage
The system layout serves as a guide for further discussing the details and features of Ceph. One of the first features that is important is to be explained is the object-based approach of the file system. In an object-based file system, the data is broken into objects that are assigned an object ID number and a small amount of metadata and then sent for storage on the Object Storage Devices (OSD’s). The file system metadata for that file then consists of a number of object ID’s that define all of the data as well as other information about the file (e.g. access/modify dates, etc.). Typically, the metadata does not know know precisely where the file is located and relies on the OSD’s for the storage and retrieval of the actual data. The OSD takes care of the lower-level functions itself (kind of a “smart” hard drive if you will). The file system interacts with the OSD’s at a high-level requesting the object itself or information about the object rather than asking for a range of inodes or blocks or something similar.

While there have only been experimental OSD drives the typical way of creating an OSD is to use a middle layer of software between the object based file system and the file system on the drive itself (or even the drive itself). In this approach the drive is just a regular hard drive such as those we currently use. Typically the OSD middle layer converts the object request into a file system request on the underlying drive.

Initially Ceph used something called EBOFS (Extent and B-tree based Object File System) but support was dropped in mid-2009. It was replaced with btrfs which promises to give as good or better performance than EBOFS. In addition, btrfs has a few features that EBOFS does not. Namely,

Copy-on-write semantics for file data (who doesn’t like a COW?)

Well maintained and tested (it’s in the kernel and under heavy development)

"... To avoid reinventing the wheel, Ceph will use btrfs on individual storage nodes (OSDs) to store object data, and we will focus on adding any additional functionality needed to btrfs where it will hopefully benefit non-Ceph users as well. ..."

For example, there is a recent patch that adds some features to btrfs that help Ceph.

Distributed Metadata
Another key aspect of Ceph that distinguishes itself from other file systems is that it uses something Sage terms “Dynamic Distributed Metadata Management.” The first keyword is distributed meaning multiple metadata servers unlike Lustre which only has one metadata server. Being distributed means that the lose of a metadata server (MDS) won’t cause the entire file system to crash.

The second keyword in the title is Dynamic. This means that the metadata can actually be moved or redistributed from one MDS to another. If a MDS goes down or is added, portions of the file system directory hierarchy are moved to better balance performance and capacity. This distribution is based on the workload but preserves locality in each MDS’s workload improving performance because the metadata can be aggressively prefetched.

Dynamic metadata also means that over time the metadata is redistributed to make better use of resources including load balancing for systems that don’t even add storage hardware. So if a certain part of the directory tree was used more often than others, it can either be divided across MDS nodes or consolidated to a single MDS coupled with aggressive caching.

Reliability through Replication
Typical file systems, even distributed parallel ones, rely on data storage units that have RAID or SAN fail-over mechanisms to help maintain data access. This also includes redundant power supplies, possibly redundant RAID controllers, redundant network cards, and other costly hardware solutions. An example of this is Lustre. On the opposite of this approach is Ceph that uses replication to help maintain access to data. Ceph maintains copies of data across the OSD’s to ensure that the loss of any OSD or multiple OSD’s will not cause the loss of data. If an OSD is lost the objects that it contained are on other OSD’s and are immediately copied to other remaining OSD’s so that the proper number of copies is maintained. The copies are spread out so that no “hot spots” develop in the replication process and as much replication as possible takes place in parallel.

Using replication does mean that you use more capacity to store the same data but it also means that you don’t need parity disks or “spare” disks making 100% use of all the storage in the OSD’s. It also means that you don’t develop hot spots in the OSD’s waiting for a RAID rebuild. Moreover, since you don’t need to do a RAID rebuild you don’t need the compute power, saving money and electrical power.

Distributed Object Storage
One way to achieve better performance is to stripe data across multiple OSD’s (something like RAID-0). Ceph does this and uses replication to ensure that the lose of an OSD does not mean that the data is lost. The component of Ceph that does this is called RADOS (Reliable Autonomic Distributed Object Store). Figure 2 below presents how the data from a file is broken into objects and distributed to the OSD’s.

Figure 2: Ceph Distributed Object Storage.

A file is broken into objects and then these objects are mapped into placement groups (PG’s) using a simple hash function. Then the placement groups are assigned to OSD’s using a component of Ceph called CRUSH (Controlled Replication Under Scaling Hashing). CRUSH is a pseudo-random data distribution function that efficiently maps each PG to an ordered list of OSD’s where copies of the object are stored. One feature of CRUSH is that it is a globally known function so any component of Ceph (client, MDS, OSD) can compute the location of an object. This means that you don’t have to involve the MDS to compute the location of an object.

Relaxation of POSIX (sort of)
Ceph uses the phrase “near-POSIX” because it has the ability to relax some of the POSIX semantics to improve performance (see the recent article POSIX IO Must Die!). In particular it uses a subset of a proposed set of extensions for POSIX for HPC (High-Performance Computing).

A classic example illustrating why extensions are needed for POSIX is that when a file is opened by multiple clients (usually happens in HPC) where each client has either multiple writers or a mix of readers and writers, the metadata server will revoke any read caching and write buffering capabilities to make sure that all clients access the data correctly. This forces the client IO to suddenly become synchronous and the performance drops tremendously particularly for small files (POSIX is at least enforcing consistency – always good). However, some applications already know that they don’t have consistency issues because of the design of the application (this is common in HPC applications) but they have to suffer a severe performance penalty because POSIX has chosen to trust no one – even if the application is correct because each writer or reader works on an independent part of the file.

The proposed POSIX extensions have options to address this issue as well as others. In particular, there is an option O_LAZY that is used for an open() syscall that explicitly relaxes coherency for a shared-write file. It assumes that the application is managing it’s own coherency. As previously mentioned, in HPC many applications can read/write to a single file from many processes since each process works on an independent part of the file. Using the O_LAX option means that the applications can run at higher speeds using caching and buffering that POSIX normally allows.

Summary

Ceph has a number of features which make it very attractive for the growing file systems we all are experiencing. It has designed for scalability, reliability, and performance. At the same time is assumes that hardware will fail or have hardware added, so it has a design that can adapt to these situations. Ceph breaks the file system into two pieces: (1) metadata, and (2) data. This allows each piece to be designed in the most efficient manner to achieve these three goals of Ceph.

Ceph uses a dynamic distributed metadata server (MDS) that is not only clustered but also adapts to the changing workload. It will automatically distribute portions of the hierarchical directory tree to other MDS servers in the cluster to better load balance as the workload changes. In addition, if a MDS server is added, it will move portions of the metadata to that new box, again, better distributing the load.

The concept of replication is used along with Object Server Devices (OSD’s) so that all the space on all the drives is used (no parity drives, no spare drives). During the writing of an object to Ceph, it is automatically replicated to other OSD’s so that the loss of an OSD(s) won’t result in the loss of data. If an OSD is lost, the objects are again re-replicated so that the number of copies of the objects is maintained.

While the Ceph client was recently include in the 2.6.34 kernel (it was in a “rc” version where rc = release candidate) it is still considered not ready for prime-time. It also uses btrfs as the underlying storage mechanism for the OSD’s and btrfs itself is still in development. But including the client in the kernel does three things. First, it gives a vote of confidence to Ceph. Second, since it’s in the kernel it should get some more “development eyes” examining the code. And third, it should get more testing.

If you’re feeling “experimental” or have an upcoming need for larger amounts of storage, then give Ceph a try. It’s really not a scary octopus about to eat your boat.

Comments on "Ceph: The Distributed File System Creature from the Object Lagoon"