Introduction

Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability - See more at: http://ceph.com.

Proxmox VE supports Ceph’s RADOS Block Device to be used for VM and container disks. The Ceph storage services are usually hosted on external, dedicated storage nodes. Such storage clusters can sum up to several hundreds of nodes, providing petabytes of storage capacity.

For smaller deployments, it is also possible to run Ceph services directly on your Proxmox VE nodes. Recent hardware has plenty of CPU power and RAM, so running storage services and VM/CTs on same node is possible.

This articles describes how to setup and run Ceph storage services directly on Proxmox VE nodes. If you want to install and configure an external Ceph storage read the Ceph documentation. To configure an external Ceph storage works as described in section Ceph Client accordingly.

Advantages

Easy setup and management with CLI and GUI support on Proxmox VE

Thin provisioning

Snapshots support

Self healing

No single point of failure

Scalable to the exabyte level

Setup pools with different performance and redundancy characteristics

Data is replicated, making it fault tolerant

Runs on economical commodity hardware

No need for hardware RAID controllers

Easy management

Open source

Why do we need a new command line tool (pveceph)?

For the use in the specific Proxmox VE architecture we use pveceph. Proxmox VE provides a distributed file system (pmxcfs) to store configuration files.

We use this to store the Ceph configuration. The advantage is that all nodes see the same file, and there is no need to copy configuration data around using ssh/scp. The tool can also use additional information from your Proxmox VE setup.

Tools like ceph-deploy cannot take advantage of that architecture.

Recommended hardware

Note:
Use only HBA cards or onboard Controller.
Raid Controller can have an extrem negative performance impact(Also in JOD Mode).

You need at least three identical servers for the redundant setup. Here is the specifications of one of our test lab clusters with Proxmox VE and Ceph (three nodes):

Single enterprise class SSD for the Proxmox VE installation (because we run Ceph monitors there and quite a lot of logs), we use one Samsung SM863 240 GB per host.

Use at least two SSD as OSD drives. You need high quality and enterprise class SSDs here, never use consumer or "PRO" consumer SSDs. In our testsetup, we have 4 Intel SSD DC S3520 1.2TB, 2.5" SATA SSD per host for storing the data (OSD, no extra journal) - This setup delivers about 14 TB storage. By using the redundancy of 3, you can store up to 4,7 TB (100%). But to be prepared for failed disks and hosts, you should never fill up your storage with 100 %.

As a general rule, the more OSD the better, fast CPU (high GHZ) is also recommended. NVMe express cards are also possible, e.g. mix of slow SATA disks with SSD/NVMe journal devices.

Again, if you expect good performance, always use enterprise class SSD only, we have good results in our testlabs with:

SATA SSDs:

Intel SSD DC S3520

Intel SSD DC S3610

Intel SSD DC S3700/S3710

Samsung SSD SM863

NVMe PCIe 3.0 x4 as journal:

Intel SSD DC P3700

By adding more OSD SSD/disks into the free drive bays, the storage can be expanded. Of course, you can add more servers too as soon as your business is growing, without service interruption and with minimal configuration changes.

If you do not want to run virtual machines and Ceph on the same host, you can just add more Proxmox VE nodes and use these for running the guests and the others just for the storage.

Installation of Proxmox VE

Before you start with Ceph, you need a working Proxmox VE cluster with 3 nodes (or more). We install Proxmox VE on a fast and reliable enterprise class SSD, so we can use all bays for OSD (Object Storage Devices) data. Just follow the well known instructions on Installation and Cluster_Manager.

Note:

Use ext4 if you install on SSD (at the boot prompt of the installation ISO you can specify parameters, e.g. "linux ext4 swapsize=4").

Ceph on Proxmox VE 5.1

In Proxmox VE 5.1 the only available Ceph version is Luminous, stable and production ready.

Network for Ceph

All nodes need access to a separate 10Gb network interface, exclusively used for Ceph. We use network 10.10.10.0/24 for this tutorial.

It is highly recommended to use 10Gb for that network to avoid performance problems. Bonding can be used to increase availability.

Third node

Installation of Ceph packages

You now need to select 3 nodes and install the Ceph software packages there. We wrote a small command line utility called 'pveceph' which helps you performing this tasks, you can also choose the version of Ceph. Login to all your nodes and execute the following on all:

node1# pveceph install --version jewel

node2# pveceph install --version jewel

node3# pveceph install --version jewel

This sets up an 'apt' package repository in /etc/apt/sources.list.d/ceph.list and installs the required software.

Create initial Ceph configuration

After installation of packages, you need to create an initial Ceph configuration on just one node, based on your private network:

node1# pveceph init --network 10.10.10.0/24

This creates an initial config at /etc/pve/ceph.conf. That file is automatically distributed to all Proxmox VE nodes by using pmxcfs. The command also creates a symbolic link from /etc/ceph/ceph.conf pointing to that file. So you can simply run Ceph commands without the need to specify a configuration file.

Creating Ceph Monitors

After that you can create the first Ceph monitor service using:

node1# pveceph createmon

Continue with CLI or GUI

As soon as you have created the first monitor, you can start using the Proxmox GUI (see the video tutorial on Managing Ceph Server) to manage and view your Ceph configuration.

Of course, you can continue to use the command line tools (CLI). We continue with the CLI in this wiki article, but you should achieve the same results no matter which way you finish the remaining steps.

Creating more Ceph Monitors

You should run 3 monitors, one on each node. Create them via GUI or via CLI. So please login to the next node and run:

node2# pveceph createmon

And execute the same steps on the third node:

node3# pveceph createmon

Note:

If you add a node where you do not want to run a Ceph monitor, e.g. another node for OSDs, you need to install the Ceph packages with 'pveceph install'.

Creating Ceph OSDs

First, please be careful when you initialize your OSD disks, because it basically removes all existing data from those disks. So it is important to select the correct device names. The Proxmox VE GUI displays a list of all disk, together with device names, usage information and serial numbers.

This partitions the disk (data and journal partition), creates filesystems, starts the OSD and add it to the existing crush map. So afterward the OSD is running and fully functional. Please create at least 12 OSDs, distributed among your nodes (4 on each node).

You can create OSDs containing both journal and data partitions or you can place the journal on a dedicated SSD. Using a SSD journal disk is highly recommended if you expect good performance.

Note:

In order to use a dedicated journal disk (SSD), the disk needs to have a GPT partition table. You can create this with 'gdisk /dev/sd(x)'. If there is no GPT, you cannot select the disk as journal. Currently the journal size is fixed to 5 GB.

Ceph Pools

The standard installation creates some default pools, so you can either use the standard 'rbd' pool, or create your own pools using the GUI.

In order to calculate your the number of placement groups for your pools, you can use:

kvm hard disk cache

noout

"Periodically, you may need to perform maintenance on a subset of your cluster, or resolve a problem that affects a failure domain (e.g., a rack). If you do not want CRUSH to automatically rebalance the cluster as you stop OSDs for maintenance, set the cluster to noout first:"

ceph osd set noout

that can be done from pve at ceph>osd . very important.

Disabling Cephx

research this 1ST . check forum , ceph mail list and http://docs.ceph.com/docs/master/rados/configuration/auth-config-ref/ :" The cephx protocol is enabled by default. Cryptographic authentication has some computational costs, though they should generally be quite low. If the network environment connecting your client and server hosts is very safe and you cannot afford authentication, you can turn it off. This is not generally recommended."

Our ceph network is isolated, and I am looking to speed up ceph performance so did this.

cd /etc/pve/priv
# do not do this line, else this file will be recreated by /usr/share/perl5/PVE/API2/Ceph.pm and using old keys later will not work.
#-#mv ceph.client.admin.keyring ceph.client.admin.keyring-old
mkdir /etc/pve/priv/ceph/old
mv /etc/pve/priv/ceph/*keyring /etc/pve/priv/ceph/old/

5- start ceph
To start all daemons on a Ceph Node (irrespective of type), execute the following: