Distributed File Systems and Object Stores on Linode (Part 2) — Ceph

In part 1 of this series, we looked at GlusterFS. Now in part 2, we look at an entirely different kind of storage system - the Ceph Object Store.

Ceph is actually an ecosystem of technologies offering three different storage models — object storage, block storage and filesystem storage. Interestingly, Ceph’s approach is to treat object storage as its foundation, and provide block and filesystem capabilities as layers built upon that foundation. Its scope is far bigger compared to GlusterFS, and consequently its architecture is more complex.

In this article, I’ll cover Ceph’s object store - its architecture, its deployment on the Linode cloud and comparison of its performance against AWS S3 object store. If you are not familiar with object stores, read the introductory section of part 1 of this series.

Ceph Architecture

Ceph Object Store Architecture

Ceph object storage is provided by two components working together:

the core Ceph Storage Cluster (or RADOS) which is the actual object store, and

the Ceph/RADOS Object Gateway (or RGW) on top of it which provides S3- and Swift- compatible HTTP interfaces for clients.

Ceph Storage Cluster

In Ceph, disks for object storage are called Object Storage Devices (OSDs). A Ceph Storage Cluster consists of OSD daemon nodes, where the objects are actually stored, and Monitor (MON) nodes that maintain and monitor the state of the cluster. Each OSD node has one or more OSDs for object storage, runs an OSD daemon per OSD, and has a dedicated journal disk to store a journal log per daemon.

Ceph Storage Cluster Architecture

Client applications store objects in the Ceph Storage Cluster by adding them to Pools.

Pools are a notional storage area in which all objects are related in some way. They are similar to GlusterFS Volumes. For example, all videos may be stored in a videos pool, and all documents in a documents pool. Or there may be one pool per client application.

One also often comes across the concept of Placement Groups (PGs) in Ceph documentation, but these are by and large an internal detail of Ceph, and you generally need not bother about them, apart from setting the number of PGs in each pool.

Expanding the capacity of the Storage Cluster is easy. Just provision a new OSD node and inform Ceph about it along with its OSD mount path. Everything else is taken care of by Ceph.

Ceph Object Gateway

Also known as RADOS Gateway (RGW), this component presents Amazon S3-compatible (and OpenStack Swift-compatible) RESTful HTTP interfaces for clients to store and retrieve objects from a Ceph Storage Cluster.

It supports federated deployment, multi-site deployment and S3 concepts such as regions and zones. While the interface is S3-compatible, there are behavioral differences, such as RGW’s strong consistency (in a single cluster, that is; multi-site is still eventual) versus S3’s eventual consistency.

RGW deployment involves deploying Ceph Object Gateway daemon (or “radosgw”) on one or more nodes. It’s responsible for resolving object URLs to pool and object names, and for user authentication and authorization.

For receiving object store HTTP requests from clients, a radosgw daemon can either use its own embedded web server, or receive them via an external FastCGI-capable web server like httpd / nginx / lighttpd.

There are many possible architectures for deploying RGW, differing in their levels of scalability and deployment complexity. Here are a few of them…

Using radosgw’s embedded web server: One of the simplest would be to deploy radosgw on MON nodes with their embedded civetweb web servers enabled, and use a Linode NodeBalancer to load balance them and provide a single HTTP endpoint with optional SSL termination.

Ceph Object Gateway using embedded web servers

It’s not the most scalable of designs (due to colocating on MON nodes and using embedded web servers), but is good enough for low to medium client loads.

Using dedicated radosgw nodes: Another strategy is to deploy radosgw on dedicated nodes. Since NodeBalancer can’t speak FastCGI, deploy lightweight FastCGI-capable web server like lighttpd along with radosgw. Finally, configure a NodeBalancer to proxy and load balance the lighttpd instances and terminate SSL.

Dedicated RGW layer, independently scalable from Storage Cluster

It’s suitable for medium loads, and can scale to high loads by adding more radosgw nodes.

Using dedicated radosgw and web server nodes: A third option would be to have dedicated web servers in addition to dedicated radosgw nodes.

Independently scalable web and RGW layers

In this architecture, the Storage Cluster, the RGW layer and the web server layer are all independently scalable, which makes this suitable for high to very high loads.

In the table above, the Object Storage column is the GBs available for object storage after dedicating 100 GB for the OSD journal disk and 5 GB for OS + swap disks.

Coming to Monitor nodes, there should always be more than one, and since they work in quorum configuration, there should always be an odd number of monitor nodes. So, a minimum of 3 is required.

Depending on where you look, other recommendations have been made, such as 1 monitor node for every 20 OSD nodes. Linode 8 GB nodes with 8 GB RAM, 4 cores and 96 GB storage are sufficient for nodes that act as dedicated monitors. But if you wish to additionally deploy radosgw and web servers — or other services such as DNS — on the same nodes, higher configurations are preferable.

Disks and filesystems

Ceph recommends that OSD data, OSD journal and OS be on separate disks. For a Linode server, this means a minimum of 4 disks — one for OSD data, one for OSD journal, one for the OS, and one for swap.

Ceph recommends XFS as the filesystem for OSD data and OSD journal disks. OS disks can be whatever filesystem is recommended for the OS.

CPU and RAM

For the candidate OSD node configurations above based on storage, we get virtualized Xeon E5–2680 processors with these cores and memories:

These are clearly far higher than recommended by Ceph. Instead of wasting these computing resources, one possibility of utilizing them is by running them as compute nodes for Hadoop or Spark (although Ceph recommends against doing this).

Network

Ceph recommends a 1 Gbps minimum network bandwidth, but this is contingent on object size and required time to replicate an object. Higher throughput configurations are better. The above three candidate configurations with 6, 8 and 10 Gbps seem sufficient.

Object Gateway nodes

The first RGW deployment option described previously — with civetweb and radosgw running on the monitor nodes and fronted by a Linode NodeBalancer — is one configuration that achieves high availability without adding any nodes at all, but with limited scalability .

But other possibilities — depending on scalability requirements — exist, such as running radosgw daemons on dedicated nodes and apache/nginx reverse proxies on their own dedicated nodes, as depicted in the other two deployment options. This way every layer is independently scalable and securable.

DNS

DNS servers should be installed in the private network for OSD node lookups. Since the Storage Cluster’s client — radosgw — is also in the same private network, there’s no need for split horizon DNS. A pair of monitor nodes themselves can act as DNS servers.

For S3 API external clients to be able to correctly resolve domain names in object and subdomain URLs, Linode’s DNS capabilities can be used. Those domains should resolve to either the NodeBalancer’s IP address, or if using a dedicated web server layer, round-robin to the apache/nginx reverse proxy nodes.

Performance of Ceph Object Store on Linode cloud

Ceph performance tests were done on a Ceph Storage Cluster of 2x Linode 24 GB instances (a modest configuration with 384 GB of disk space and 2 Gbps outgoing bandwidth on each), one 2 GB monitor node and one 2 GB admin node. Both storage instances were configured as Ceph OSD nodes as well as RGW nodes. They were fronted by a Linode NodeBalancer configured for HTTP round-robin balancing with no stickiness.

The test clients were 2x quad vCPU, 15GB RAM machines with 8Gbps outgoing bandwidth machines located in a different external cloud, and were used to test both Ceph and S3. They were configured to run the same COSBench small file and large file workloads in distributed mode using S3 API to communicate, first with the Ceph cluster on Linode and then with AWS S3.

Small files performance

Small file performance matters when the store is being used for

Ceph vs S3 small file response times

user-facing activities like web resource serving or image storage. Response times matter more than throughput.

Despite the modest 24GB instances, the Ceph cluster compared well, with consistently lower 99%-response times for both reads and writes.

Ceph vs S3 small file throughputs

Ceph’s small file throughputs were lower than S3. It’s possible that read throughputs could have been higher if higher configuration nodes were used. I was not able to pinpoint why write throughputs were lower, given the 40 Gbps incoming bandwidth of the instances. Perhaps the NodeBalancer’s bandwidths or configuration might have affected throughput.

Large files performance

For big data use cases, large file throughput matters. Response times too may be important for some use cases, such as storing large blobs of realtime sensor readings.

Ceph vs S3 large file throughputs

Here, the Linode cluster was at par with S3 while reading and writing 64 MB — 512 MB moderately large files.

It did well on 512 MB — 2 GB files too, but there was no equivalent S3 run performed for comparison.

Ceph vs S3 large file response times

Response times too were on par with S3. If larger configuration instances were used, it’s very likely that these response times would have been even lower.

System performance

During both small and large file tests, I noticed that the cluster nodes — despite running both OSD daemons and RGW daemons — hardly consumed any CPU or memory. That’s good, because it means they can also serve as compute instances for big data deployments like Hadoop or Spark.

Performance vs Costs

While running these performance comparisons against S3, I incurred a surprisingly high transfer cost after just 1 hour of testing and about 250 GB of transfer. Digging a little deeper, I found that S3’s pricing was dominated not by monthly storage capacity used but by TBs of data transferred out of S3 every month.

For example, for a 10 TB capacity Ceph cluster on Linode, total cost of ownership — including storage nodes, monitor nodes and NodeBalancer — and total free outgoing transfer are as follows:

Linode TCO for a 10TB Ceph cluster

S3 pricing for the same 10 TB capacity — just counting storage and transfer pricing, not number of requests or anything else — is comparatively much lower at low outgoing transfer volumes, but rapidly grows more expensive than Linode’s TCO at higher transfer volumes of above 40 TB/month:

This relationship does not hold for higher storage capacities, because Linode cluster TCO increases at a much higher rate than S3’s. But if outgoing data volume every month is consistently more than 6 or 7 times the total storage capacity, then S3 transfer costs start to dominate and Linode cluster TCO works out to be less expensive than S3.

Deploy on Linode

Ceph-linode is a set of interactive scripts that automatically create and provision secured Ceph object stores on the Linode cloud. They are written in Python, use Ansible for deployment, and support Ubuntu 14.04 / 16.04 LTS, Debian 8 and CentOS 7.

This is still a work in progress and the first production ready release is expected around February 15 2017. See https://github.com/pathbreak/ceph-linode for detailed installation and documentation. Contributions and suggestions welcome.

Conclusions

Ceph is an excellent object store with great features, easy deployment, good documentation and good tooling. It performs well on Linode even on modest configurations. It has advanced high availability and disaster recovery features like multi-site and federated architectures, which were not covered in this article but may prove useful for big data use cases.

TCO of self-hosted Ceph clusters can be reduced in multiple ways. Linode’s NodeBalancer is an inexpensive approach to load balance between RGW instances, without deploying any load balancing nodes or software. TCO can be further reduced by using higher configuration nodes for the cluster and additionally utilizing them as compute instances. For applications that involve very high monthly transfer volumes out of S3 compared to stored capacity, it’s possible a self-hosted cluster can actually reduce TCO, and this can be easily checked using the spreadsheet linked earlier.

Credits

A big thank you to Dave Roesch and Keith Craig for providing Linode infrastructure and suggestions that made this article possible.

About me: I’m a software consultant and architect specializing in big data, data science and machine learning, with 14 years of experience. I run Pathbreak Consulting, which provides consulting services in these areas for startups and other businesses. I blog here and I’m on GitHub. You can contact me via my website or LinkedIn.

Please feel free to share below any comments or insights about your experience using an object store or, in particular, Ceph. And if you found this blog useful, consider sharing it through social media.

While Karthik’s views and cloud situations are solely his and don’t necessarily reflect those of Linode, we are grateful for his contributions.