Month: January 2016

Ceph OSDs: A Ceph OSD Daemon (Ceph OSD) stores data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph Monitors by checking other Ceph OSD Daemons for a heartbeat. A Ceph Storage Cluster requires at least two Ceph OSD Daemons to achieve an active + clean state when the cluster makes two copies of your data (Ceph makes 3 copies by default, but you can adjust it).

Monitors: A Ceph Monitor maintains maps of the cluster state, including the monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map. Ceph maintains a history (called an “epoch”) of each state change in the Ceph Monitors, Ceph OSD Daemons, and PGs. Ceph uses the Paxos algorithm, which requires a consensus among the majority of monitors in a quorum. With Paxos, the monitors cannot determine a majority for establishing a quorum with only two monitors. A majority of monitors must be counted as such: 1:1, 2:3, 3:4, 3:5, 4:6, etc. Side note: some ceph docs advise not to comingle Montior and Ceph OSD daemons on the same host or you may encounter performance issues. But in deployment guides and the Mellanox high performance paper, they do comingle them. For all test purposes, I plan to comingle them (deploy monitor on ecs nodes) and evaluate performance under load. I am also still trying to estimate how many monitors per Ceph OSD. We’ll have 480 Ceph OSDs per rack, and we’ll want either 3, 5, or 7 monitors. I’m going to take a shot in the dark and go with 5.

RADOS GW This is what provides the S3 and SWIFT API access to Ceph file storage. You can install this on the OSD nodes (simplest) or select a handful of external VMs to run these. You would setup multiple RADOS GW nodes, and place a load balancer like haproxy or nginx/lua_proxy in front of them.

OSD Notes:

OSD Journal Location: stores a daemon’s journal by default on /var/lib/ceph/osd/$cluster-$id/journal – on a ECS node, this would be an SSD, which is recommended by CEPH. However, you could point it to an SSD partition instead of a file for even faster performance.

OSD Journal Size: The expected throughput number should include the expected disk throughput (i.e., sustained data transfer rate), and network throughput. For example, a 7200 RPM disk will likely have approximately 100 MB/s. Taking the min() of the disk and network throughput should provide a reasonable expected throughput. Some users just start off with a 10GB journal size. For example:osd journal size = 10000

OSD’s can be removed gracefully: http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

Check Max Threadcount: If you have a node with a lot of OSDs, you may be hitting the default maximum number of threads (e.g., usually 32k), especially during recovery. You can increase the number of threads using sysctl to see if increasing the maximum number of threads to the maximum possible number of threads allowed (i.e., 4194303) will help. For example:sysctl -w kernel.pid_max=4194303

Crush MAP

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.

CRUSH maps contain a list of OSDs, a list of ‘buckets’ for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reﬂecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.

The short of this is that in ceph.conf, you can define a host’s location, which subsequently defines the location of each Ceph OSD operating on that host. A location is a collection of key pairs consisting of Ceph predefined types.

Each CRUSH type has a value. The higher this value, the less specific the grouping is. So when deciding where to place data chunks or replicants of an object, Ceph OSDs will consult the crush maps to find other Ceph OSDs in other host, chassis, and racks. The fault domain policies can be defined and tweaked.

published: true

These are just some notes I took as I did my first run through on installing Ceph on some spare ECS Hardware I had access to. Note that currently, no one would actually recommend doing this, but it was a good way for me to get started with Ceph.

Installation

Following this guide http://docs.ceph.com/docs/hammer/start/quick-ceph-deploy/

I set this up the first time in the lab, nodes:

ljb01.osaas.lab (admin-node)

rain02-r01-01.osaas.lab (mon.node1)

rain02-r01-03.osaas.lab (osd.0)

rain02-r01-04.osaas.lab (osd.1)

In a more fleshed out setup, I would probably have a dedicated admin node (instead of the jump), and we would start off with the layout like this:

The first ‘caveat’ was that it tells you to configure a user (“ceph”) that can sudo up, which I did. But ceph-deploy attempts to modify the ceph user, which it can’t do while ceph is logged in.

[rain02-r01-03][DEBUG ] Setting system user ceph properties..

For step 6, adding OSDs, I diverged again to add disks instead of a directory. http://docs.ceph.com/docs/hammer/rados/deployment/ceph-deploy-osd/

I will setup sdaa, sdab, and sdac. Note that while I could use a separate disk partition (like an ssd) to maintain the journal, we only have one ssd in ECS hardware and it hosts the OS. So we’ll let each disk maintain its own journal.

PG’s are known as placement groups. http://docs.ceph.com/docs/master/rados/operations/placement-groups/
That page recommends that for 5-10 OSDs, (I have 9) we set this number to 512. I’m defaulted at 64. But then the tool tells me otherwise.

I’ll put this down as a question for later and set it to 128.
This does nothing, so I learned what I really need to do is make more pools. I make a new pool, but my HEALTH_WARN has changed to reflect my mistake.

Object Storage Gateway

Ceph does not provide a quick way to install and configure object storage gateways. You essentially have to install apache, libapache2-mod-fastcgi, rados, radosgw, and create a virtualhost. While you could do this on only a portion of your OSD nodes, it seems like it would make most sense to do it on each OSD node so that each node can be part of the pool.

Fun fact: You can set quotas and read/write capabilities on users. It also can do usage statistics for a given time period.

All of the CLI commands can be implemented over API: http://docs.ceph.com/docs/hammer/radosgw/adminops/ – in this, just adding /admin/
(configurable) to the url. You can give any S3 user admin capabilities. It’s the same backend authentication for both.

I also confirmed that by installing radosgw on a second node, all user ids and bucket was still available. Clustering confirmed.

Automation

When it comes to automating this, there are several options.

Build our own ceph-formula up into something that fully manages ceph.

Pros

It will do what we want it to.

Cons

Our current ceph-formula currently only installs packages.

Lots of work involved

Refactor public ceph-salt formula to meet our needs.

https://github.com/komljen/ceph-salt

Pros:

ceph-salt seems to cover most elements, including orcehstration

uses a global_variables.jinja much like we use map.jinja

Cons

I’m sure we’ll find something wrong with it. (big grin)

maintained by 1 person

last updated over a year ago

Use Kolla to setup Ceph:

http://docs.openstack.org/developer/kolla/ceph-guide.html

Pros:
* Openstack team might be using Kolla – standardization
* Already well built out
* Puts ceph components into docker containers (though some might consider this a con)

Cons:

It’s reported that it work primarily on Redhat/Centos; less so on Ubuntu

Uses ansible as underlying management – this introduces a secondary management system over ssh

Is heavily opinionated based on Openstack architecture (some might say this is a pro)

Use Ansible-Ceph:

https://github.com/ceph/ceph-ansible

Pros:

Already well built out

Highly flexible/configurable

Works on Ubuntu

Not opinionated

maintained by ceph project

large contribution base

Cons:

Uses ansible as underlying management – this introduces a secondary management system (in additon to salt) over ssh