The contents of my brain, listed for all to see.

DRBD and Nested LVM

DRBD is a opensource block-level storage system designed to provide distributed storage in cluster environments. Nodes of a cluster write changes locally and DRBD replicates those changes to other nodes. It supports both Primary/Secondary (master/slave) and Primary/Primary (master/master) configurations. I’m mostly familiar with DRBD through our virtualization infrastructure where I implemented DRBD in a two-node cluster to provide persistent storage among virtual appliances.

DRBD can be used on top of a variety backing devices such as single disks, mdadm RAIDs, hardware RAIDs, and LVMs. While configuring DRBD directly on any of these devices is relatively straightforward, using an LVM device for backing has two primary advantages that make the added complexity worth while: use of LVM tools for backing-device management and convenient online backing-device expansion. The familiar LVM tools enable us to easily resize our backing device if we want to add disks to our array later.

Let’s walk through the configuration process for a simple two-node, dual-primary cluster. I’ll assume your nodes have identical local storage configured already. I’ll also assume you have DRBD installed and the ‘drbd’ module loaded on your nodes already; LinBit, the organization that develops DRBD, has a commercial repository for subscribers while the EPEL repository has packages for RHEL/CentOS users.

Backing LVM Configuration

First, on each node, we create an LVM physical volume and volume group on our local storage device.

Next, we need to create volumes for each DRBD resource. A resource is a chunk of storage managed by DRBD. In a two-node cluster, we want a minimum of two resources, each of which will be designated to primarily operate on one node or another. A configuration like this is necessary to minimize data loss during a split-brain scenario. If only one resource is configured and a node failure occurs followed by unsuccessful fencing, both appliances will continue to write data to the same resource independent of one another; we now have a split-brain cluster. When it comes time to resolve this scenario, we will have to choose one node’s data over another, forcing us to discard data written by one node. We want to minimize the chance of data loss as much as possible and using a dedicated resource for each node helps.

In my cluster, I’ve configured three resources: one resource provides shared storage to each node for configuration files and installation images while the other resources provide the block-level storage for cluster appliances.

DRBD Configuration

With backing storage partitioned, we can start configuring DRBD itself. First, we perform the global configuration by editing global_common.conf in /etc/drbd.d:

global {
# Don't report stats to LinBit
usage-count no;
}
common {
handlers {
# Reboot on emergencies
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
# Action to take to fence a peer.
# "Fencing" means isolating problematic nodes in the cluster to maintain cluster consistency.
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
}
syncer {
# Rate limit synchronizations.
# Synchronization is performed when one node is detected to have out-of-date data
# Synchronization is uncommon during normal operation. It's counterpart is replication.
rate 20M;
al-extents 3389;
}
disk {
# Fencing policy. This tells DRBD to fence the node when the primary resource fails.
# If the fencing action fails, shoot the other node in the head (poweroff)
fencing resource-and-stonith;
# DRBD won't flush "cached" data to the disk, our hardware RAID controller will handle this
disk-flushes no;
# Disable disk-flushing on our meta device, the device that stores DRBD metadata
# In this case, our MD device is our DRBD device
md-flushes no;
# Disable disk barriers, not supported after kernel 2.6.32
disk-barrier no;
}
net {
# Use TCP send buffer auto-tuning
sndbuf-size 0;
# Protocol "C" tells DRBD not to tell the operating system that
# the write is complete until the data has reach persistent
# storage on both nodes. This is the slowest option, but it is
# also the only one that guarantees consistency between the
# nodes. It is also required for dual-primary, which we will
# be using.
protocol C;
# Tell DRBD to allow dual-primary meaning both nodes may actively
# use DRBD resources simultaneously.
allow-two-primaries yes;
# This tells DRBD what to do in the case of a split-brain when
# neither node was primary, when one node was primary and when
# both nodes are primary. In our case, we'll be running
# dual-primary, so we can not safely recover automatically. The
# only safe option is for the nodes to disconnect from one
# another and let a human decide which node to invalidate. Of
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
}

resource r0 {
# This sets the device name of this DRBD resouce.
device /dev/drbd0;
# This tells DRBD what the backing device is for this resource.
# This value will change from host to host
# Uncomment on node1
disk /dev/n1_array0/r0;
# Uncomment on node2
# disk /dev/n2_array0/r0;
# This controls the location of the metadata. When "internal" is used,
# as we use here, a little space at the end of the backing devices is
# set aside (roughly 32 MB per 1 TB of raw storage). External metadata
# can be used to put the metadata on another partition when converting
# existing file systems to be DRBD backed, when there is no extra space
# available for the metadata.
meta-disk internal;
net {
# How DRBD will verify blocks have replicated without error.
verify-alg md5;
# COMMENT OUT FOR PRODUCTION AND TEST
# Uncommenting this will make DRBD verify every single block after
# it's received instead of relying on TCP integrity checks
# It's not recommended to use this in production or for heavy loads
# as it will generate false errors.
#data-integrity-alg md5;
}
# Tell DRBD where other nodes are
# The name used here must match what is returned by "uname -n".
on node1 {
address 10.255.254.1:7788;
}
on node2 {
address 10.255.254.2:7788;
}
}
resource r1 {
device /dev/drbd1;
# Uncomment on node1
disk /dev/n1_array0/r1;
# Uncomment on node2
# disk /dev/n2_array0/r1;
meta-disk internal;
net {
verify-alg md5;
#data-integrity-alg md5;
}
on node1 {
address 10.255.254.1:7789;
}
on node2 {
address 10.255.254.2:7789;
}
}
resource r2 {
device /dev/drbd2;
# Uncomment on node1
disk /dev/n1_array0/r2;
# Uncomment on node2
# disk /dev/n2_array0/r2;
meta-disk internal;
net {
verify-alg md5;
#data-integrity-alg md5;
}
on node1 {
address 10.255.254.1:7790;
}
on node2 {
address 10.255.254.2:7790;
}
}

Clustered LVM Configuration

At this point, we’ve finished configuring DRBD and have three empty yet synchronized resources. Now, we will add clustered LVM partitioning on top of each resource. We must use cLVM because DRBD is only in charge of replication and low-level integrity of our data- it does not manage file locks which prevent the same file from being updated at the same time; these functions must come from layers above DRBD.

First, we edit ‘/etc/lvm.conf’ to enable clustering and ignore nested LVM partitions that may be created by our cluster appliances.

GFS Configuration

For r1 and r2, we can think of the LVM partitioning as a filesystem that provides block-level storage to cluster services; in my case, that means raw disks used by VMs. For r0 to share files between nodes, we need more than replicated block-level storage; we need fully functional filesystem that allows our nodes to access files. For this, we’ll use GFS2.

GFS2, Global Filesystem 2, is designed for concurrent access on shared storage, exactly what we want on top of our r0 resource. We setup GFS2 in a similar fashion as EXT4 or XFS. First, we create our clustered volume group using LVM and then format the volume with GFS2:

We can test GFS2 by creating some files on one node and verifying the data on another.

Summary

DRBD isn’t very useful by itself; it’s designed to be used in conjunction with a cluster manager running other services. The manager handles starting prerequisite services in the proper order as well as handling the promotion of resources from secondary to primary. A cluster manager will also help us during failure scenarios by fencing nodes and restarting critical services. This adds some automated intelligence to how the cluster operates, freeing up the administrator to do other things.

Keep in mind that setting up DRBD in production is not to be taken lightly even if the configuration is fairly simple. Plenty of hardware and use-case testing is required before deciding on a stable configuration that can be supported in the long term. It’s essential to have the proper hardware and networking capability that can support high rates of data transfer with low latency. Perhaps the most important consideration is the speed of the backing storage devices; LinBit recommends highly performant 10K or 15K RPM disks. Less expensive 7.2K RPM disks can be used, but they’ll likely be a bottleneck in your configuration, limiting the overall performance of your cluster. Choose wisely!