Open Source Storage

The storage landscape is evolving - moving from an expensive, proprietary model to embrace a software defined future. Even EMC acknowledge this shift with the introduction of ViPR ... so why not write about it!

Thursday, 5 April 2018

My previous post showed you how to get deduplication working on Linux with VDO. In some ways, that's the post that could cause trouble - if you start using vdo across a number of hosts, how can you easily establish monitoring or even alerting?

So that's the problem we're going to focus on in this post.

Monitoring

There are a heap of different ways to monitor systems, but the rising star currently is Prometheus. Historically, I've used monitoring systems that require clients to push data to a central server but Prometheus turns this around. With Prometheus data collection is initiated by the Prometheus server itself - it's called a 'scrape' job. This approach simplifies client configurations and management, which is a huge bonus for large installations.

To make vdo data available, we need an exporter. The exporter provides a http endpoint that the Prometheus server will scrape metrics from. There are a heap of exporters available to Prometheus covering a plethora of different subsystems, but since vdo is new there isn't something you can just pick up and run with. Well that was the case...

vdo_exporter Project

The scrape job simply issues a GET request to the "/metrics" HTTP API endpoint on a host. Developing an API endpoint for this in python is fairly straight forward, and given the metrics themselves are all nicely grouped together under sysfs, it seemed a bit of a no-brainer to develop an exporter. My exporter can be found here. The project's repo contains the python code, a systemd unit file and what I hope is a sensible README file documenting how to install the exporter (if you have a firewall active, remember to open port 9286!)

I'm leaving the installation of the exporter as an exercise for the reader, and use the rest of this article to show you how to quickly stand up prometheus and grafana to collect and visualise the vdo statistics. For this example, I'm again using Fedora so for other distributions you may have to tweak 'stuff'.

Containers to the Rescue!

The prometheus and grafana projects both provide docker images on docker hub, so assuming you already have docker installed on your machine you can grab the images with the following;

Containers are inherently stateless, but for monitoring and dashboards we need to make sure that these containers use either different docker volumes, or persist data to the host's filesystem. For this exercise, I'll be exposing some directories on the host's filesystem (change these to suit!)

With the directories in place for the persistent data within the containers, and the compose file ready you just need to start the containers. Run the docker-compose command from the directory that holds your docker-compose.yml file.

Configuring Prometheus

You should already have the vdo_exporter service running on your hosts that are using vdo, so the next step is to create a scrape job in prometheus to tell it to go and fetch the data. This is done by editing the prometheus.yml file - in my case this is in /opt/docker/grafana-prom/prom-etc. Under the scrape_configs section add something like this to collect data from your vdo host(s)

Enter the prometheus details (and ensure you set the data source as the default)

The grafana directory in the vdo_exporter project holds a file called VDO_Information.json. This json file is the dashboard definition, so we need to import it.

Click on the grafana icon again, highlight the Dashboards entry, then select the import option from the pop-up menu.

Click on the Upload.json File, and pick the VDO_Information.json file to upload.

Now select the dashboard icon (to the right of the Grafana logo), and select "VDO Information". You should then see something like this

As you add more hosts that are vdo enabled, just add the host's ip to the prometheus scrape configuration and reload prometheus. Simples..

Grafana provides a notifications feature which enables you to define threshold based alerting. You could define a trigger for low "physical space" conditions, or alert based on recovery being active - I leave that up to you! Grafana supports a number of different notification endpoints including PagerDuty, Sensu and even email! So take some time and review the docs to see how Grafana could best integrate into your environment.

And Remember...

VDO is not the proverbial "silver bullet". The savings from any compression and deduplication technology is dependent on the data you're storing, and vdo is no different. Also, each vdo volume requires additional RAM, so if you want to move vdo out of the test environment into production you'll need to plan for additional CPU and RAM to "make the magic happen"™.

Wednesday, 4 April 2018

Whether you're using proprietary storage arrays or software defined storage, the actual cost of capacity can sometimes provoke responses like, "why do you you need all that space?" or "OK, but that's all the storage you're going to get, so make it last".

The problem is that storage is a commodity resource, it's like toner or ink in a printer. When you run out, things will stop and lots of people tend to lose their sense of humor. Controlling storage growth has been going on for over 10 years in the proprietary storage space, with one of the most successful companies being NetApp who introduced data deduplication with their ASIS (advanced Single Instance Storage) feature back in 2007. The message was that if you wanted to reduce storage consumption, you basically had to buy the more expensive "stuff" in the first place.

This was the "status quo" until Red Hat acquiredPermabit in mid 2017...now compression and deduplication features are heading towards a Linux server near you!

That's the history lesson, now let's look at how you can kick the tyres on open sourced based compression and deduplication. For the remainder of this article, I'll walk through the steps you need to quickly get "dedupe" up and running with Fedora.

Installation

Since we're just testing, create a vm and install Fedora 27. Use libvirt, parallels, virtualbox...whatever takes your fancy - or maybe just use a cloud image in AWS. The choice is yours! Just try to ensure the vm has something like; 2 vcpus, 4GB RAM, an OS disk (20GB) and a data disk for vdo testing.

by default new volumes are created with compression and deduplication enabled. If you don't like that you can play with the --compression or --deduplication flags.

a vdo volume is actually a device mapper device, in this case /dev/mapper/vdo0. It's this 'dm' device that you'll use from here on in.

Usage

Now you have a vdo volume, next step is to get it deployed and understand how to report on space savings. The first thing is filesystem formatting. Make sure you use the -K switch to avoid issuing discards, remember a vdo volume is in effect a thin provisioned volume.

[root@f27-vdo ~]# mkfs.xfs -K /dev/mapper/vdo0

With the filesystem in place, the next step would normally be updating fstab...right? Well not this time. For vdo volumes, the boot time startup sequence between fstab and the vdo service is a problem - so we need to use a mount service to ensure vdo volumes are mounted correctly.

The vdo rpm provides a sample mount service definition (/usr/share/doc/vdo/examples/systemd/VDO.mount.example). For this example, I'm going to mount the vdo volume at /mnt/vdo0

At this point you've used the vdo command to create the volume, but there is also a command to look at the volume's statistics called vdostats. To give us something to look at I copied the same 200MB disk image to the volume 20 times, which will also help to explain vdo overheads.

Wait a minute...at a logical layer, the filesystem says that it's 4.5G used, but at the physical vdo layer it's saying practically the same thing AND that there's a 95% saving! So which is right? The answer is both :) The vdo subsystem persists metadata on the volume (lookup maps etc), which accounts for a chunk of the physical space used, and the savings value is derived purely from the logical blocks "in" and the physical, unique blocks written. If you need to understand more you can dive into the sysfs filesystem.

The most useful stats I've found to understand how space is consumed are;

overhead_blocks_used : metadata for the volume. The overhead is proportional to the physical size of the volume; for example, on an 8TB device, the overhead was around 9GB

data_blocks_used: this is the count of the physical blocks consumed by user data

logical_blocks_used: the count of blocks consumed at the filesystem level

In my case, the "overhead_blocks_used" was 4GB, and the "data_blocks_used" around 200MB. The savings% value is derived from data_blocks_used / logical_blocks_used, since it only applies to actual user data written to the volume, which equates to around 95%. Now it makes sense!

Final Words

Deduplication is a complex beast, but hopefully the above will at least get you up and running with this new Linux feature.

If you decide to use vdo across a number of servers, running vdostats isn't really a viable option. For that it would be more useful to leave the command line behind at look at solutions like prometheus and grafana to track capacity usage and generate alerts. Spoiler alert!...that's the subject of my next post :)

Sunday, 10 December 2017

There is no doubt that Ansible is a pretty cool automation engine for provisioning and configuration management. ceph-ansible builds on this versatility to deliver what is probably the most flexible Ceph deployment tool out there. However, some of you may not want to get to grips with Ansible before you install Ceph...weird right?

No, not really.

If you're short on time, or just want a cluster to try ceph for the first time, a more guided installation approach may help. So I started a project called ceph-ansible-copilot.

The idea is simple enough; wrap the ceph-ansible playbook with a text GUI. Very 1990's, I know, but now instead of copying and editing various files you simply start the copilot tool, enter the details and click 'deploy'. The playbook runs in the background within the GUI and any errors are shown there and then...no more drowning in an ocean of scary ansible output :)

The features and workflows of the UI are described in the project page's README file.

Enough rambling, lets look at how you test this stuff out. The process is fairly straight forward;

configure some hosts for Ceph

create the Ansible environment

run copilot

The process below describes each of these steps using CentOS7 as the deployment target for Ansible and the Ceph cluster nodes.1. Configure Some Hosts for Ceph

Call me lazy, but I'm not going to tell you how to build vm's or physical servers. To follow along, the bare minimum you need are a few virtual machines - as long as they have some disks on them for Ceph, you're all set!

2. Create the Ansible environment

Typically for a Ceph cluster you'll want to designate a host as the deployment or admin host. The admin host is just a deployment manager, so it can be a virtual machine, a container or even a real (gasp!) server. All that really matters is that your admin host has network connectivity to the hosts you'll be deploying ceph to.

The main playbook for ceph-ansible is in /usr/share/ceph-ansible - this is where you need to run copilot from (it will complain if you try to run it in some other place!)

> cd /usr/share/ceph-ansible
> copilot

Then follow the UI..Example Run

Here's a screen capture showing the whole process, so you can see what you get before you hit the command line.

The video shows the deployment of a small 3 node ceph cluster, 6 OSDs, a radosgw (for S3), and an MDS for cephfs testing. It covers the configuration of the admin host, the copilot UI and finally a quick look at the resulting ceph cluster. The video is 9mins in length, but for those of us with short attention spans, here's the timeline so you can jump to the areas that interest you.

So far I've only tested 'simple' deployments using the packages from ceph.com (community deployments) against a CentOS target. So like I said, more testing is needed, a lot more...but for now there's enough of the core code there for me to claim a victory and write a blog post!

Aside from the testing, these are the kinds of things that I'd like to see copilot handle

collocation rules (which daemons can safely run together)

resource warnings (if you have 10 HDD's but not enough RAM, or CPU...issue a warning)

handle the passwordless ssh setup. copilot already checks for passwordless ssh, so instead of leaving it to the admin to resolve any issues, just add another page to the UI.

That's my wishlist - what would you like copilot to do? Leave a comment, or drop by the project on github.

Tuesday, 5 July 2016

Recently I've been working on converging glusterfs with oVirt - hyperconverged, open source style. oVirt has supported glusterfs storage domains for a while, but in the past a virtual disk was stored as a single file on a gluster volume. This helps some workloads, but file distribution and functions like self heal and rebalance have more work to do. The larger the virtual disk, the more work gluster has to do in one go.

Enter sharding.

The shard translator was introduced with version 3.7, and enables large files to be split into smaller chunks(shards) of a user defined size. This addresses a number of legacy issues when using glusterfs for virtual machine storage - but does introduce an additional level complexity. For example, how do you now relate a file to it's shard, or vice-versa?

The great thing is that even though a file is split into shards, the implementation still allows you to relate files to shards with a few simple commands.

Firstly, let's look at how to relate a file to it's shards;

And now, let's go the other way. We start with a shard, and end with the parent file.

Sunday, 29 May 2016

These days hyperconverged strategies are everywhere. But when you think about it, sharing the finite resources within a physical host requires an effective means of prioritisation and enforcement. Luckily, the Linux kernel already provides an infrastructure for this in the shape of cgroups, and the interface to these controls is now simplified with systemd integration.

So lets look at how you could use these capabilities to make Gluster a better neighbour in a collocated or hyperconverged model.

First some common systemd terms, we should to be familiar with;

slice : a slice is a concept that systemd uses to group together resources into a hierarchy. Resource constraints can then be applied to the slice, which defines

how different slices may compete with each other for resources (e.g. weighting)

how resources within a slice are controlled (e.g. cpu capping)

unit : a systemd unit is a resource definition for controlling a specific system service

NB. More information about control groups with systemd can be found here

In this article, I'm keeping things simple by implementing a cpu cap on glusterfs processes. Hopefully, the two terms above are big clues, but conceptually it breaks down into two main steps;

define a slice which implements a CPU limit

ensure gluster's systemd unit(s) start within the correct slice.

So let's look at how this is done.

Defining a slice

Slice definitions can be found under /lib/systemd/system, but systemd provides a neat feature where /etc/systemd/system can be used provide local "tweaks". This override directory is where we'll place a slice definition. Create a file called glusterfs.slice, containing;

[Slice]
CPUQuota=200%

CPUQuota is our means of applying a cpu limit on all resources running within the slice. A value of 200% defines a 2 cores/execution threads limit.

Updating glusterd

Next step is to give gluster a nudge so that it shows up in the right slice. If you're using RHEL7 or Centos7, cpu accounting may be off by default (you can check in /etc/systemd/system.conf). This is OK, it just means we have an extra parameter to define. Follow these steps to change the way glusterd is managed by systemd

glusterd is responsible for starting the brick and self heal processes, so by ensuring glusterd starts in our cpu limited slice, we capture all of glusterd's child processes too. Now the potentially bad news...this 'nudge' requires a stop/start of gluster services. If your doing this on a live system you'll need to consider quorum, self heal etc etc. However, with the settings above in place, you can get gluster into the right slice by;

Time for some more systemd coolness ;) The resource constraints that are applied by the slice are dynamic, so if you need more cpu, you're one command away from getting it;

# systemctl set-property glusterfs.slice CPUQuota=350%

Try the 'systemd-cgtop' command to show the cpu usage across the complete control group hierarchy.

Now if jumping straight into applying resource constraints to gluster is a little daunting, why not test this approach with a tool like 'stress'. Stress is designed to simply consume components of the system - cpu, memory, disk. Here's an example .service file which uses stress to consume 4 cores

Tuesday, 26 April 2016

In the past, gluster users of have been able to open up their gluster volumes to iSCSI using the tgt daemon. This has been covered in the past on other blogs and also documented on gluster.org.

But, tgt has been superseded in more recent distro's by LIO. LIO provides a number of different local storage options to be utilised as SCSI targets, including; FILEIO, BLOCK, PSCSI and RAMDISK. These SCSI targets are implemented as modules in kernel space, but what isn't immediately obvious is that LIO also provides a userspace framework called TCMU. TCMU enables userspace files to become iSCSI targets.

With LIO, the easiest way to exploit gluster as an iSCSI target was through the FILEIO 'storage engine' over FUSE. However, the high number of context switches incurred within FUSE is likely to reduce the performance potential to your 'client' - especially for random I/O access patterns.

Until now, FUSE was your only option. But Andy Grover at Red Hat has just changed things. Andy has developed tcmu-runner which utilises the TCMU framework, allowing a glusterfs target to be used over gluster's libgfapi interface. Typically, with libgfapi you can expect less context switching, and improved performance.

For those like me, with short attention spans, here's what the improvement looked like when I compared LIO/FUSE with LIO/gfapi using a couple of fio based workloads.

Read Improvement

Mixed Workload Improvement

In both charts, IOPS and latency significantly improves using LIO/GFAPI, and further still by adopting the arbiter volume.

As you can see, for a young project, these results are really encouraging. The bad news is that to try tcmu-runner you'll need to either build systems based on Fedora F24/rawhide or
compile it yourself from the github repo. Let's face it, there's always a price to pay for new shiny stuff :)

For the remainder of this article, I'll walk through the configuration of LIO and the iSCSI client that I used during my comparisons.

Preparing Your Environment

In the interests of brevity, I'm assuming that you know how to build servers, create a gluster trusted pool and define volumes. Here's a checklist of the tasks you should do in order to prepare a test environment;

build 3 Fedora24 nodes and install gluster (3.7.11) on each peer/node

on each node, ensure /etc/gluster/glusterd.vol contains the following setting - option rpc-auth-allow-insecure on. This is needed for gfapi access. Once added, you'll need to restart glusterd.

install targetcli (targetcli-2.1.fb43-1) and tcmu-runner (tcmu-runner-1.0.4-1) on each of your gluster nodes

form a gluster trusted pool, and create a replica 3 volume or replica with arbiter volume (or both!)

Issue saveconfig to commit the configuration (config is stored in/etc/target/saveconfig.json)

Configuring LIO - Node 2

When a LUN is defined by targetcli, a wwn is automatically generated for it. This is neat, but to ensure multipathing works we need the LUN exported by the gateways to share the same wwn - if they don't match, the client will see two devices, not two paths to the same device.

So for subsequent nodes, the steps are slightly different.

On the first node, look at /etc/target/saveconfig.json. You'll see a storage object item for the gluster file you've just created, together with the wwn that was assigned (highlighted).

(if you cd to /backstores/user:glfs and use help create you'll see a summary of the options available when creating the LUN)

With the LUN in place, you can follow steps 4-7 above to create the iqn, portal and LUN masking for this node.

At this point you have;

3 gluster nodes

a gluster volume with a file defined, serving as an iscsi target

2 gluster nodes defined as iscsi gateways

each gateway exports the same LUN to a client (supporting multipathing)

Next up...configuring the client.

Client Configuration

To get the client to connect to your 'exported' LUN(s), you first need to ensure that the following rpms are installed on the client; device-mapper-multipath, iscsi-initiator-utils and preferably sg3_utils. With these packages in place you can move on to configure multipathing and connect to you LUN(s).

Multipathing : the example below shows a devices section from /etc/multipath.conf that I used to ensure my exported LUNs are seen as multipath devices. With this in place, you can take a node down for maintenance and your LUN remains accessible (as long as your volume has quorum!)

You can see in this example, I have three LUN's exported, and each one has two active paths (one to each gluster node). By default, the iscsi node definition in (/var/lib/iscsi/nodes) uses a setting of node.startup=automatic, which means LUN(s) will automagically reappear on the client following a reboot.

But from the client's perspective, how do you know which LUN is from which glusterfs volume/file? For this, sg_inqis your friend...

The highlighted text shows the configuration string you specified when you created the LUN in targetcli. If you run the same command against the devices themselves (/dev/sdf or /dev/sdg) you'd see the connection string from each of respective gateways. Nice and easy!

And Finally...

Remember, this is all shiny and new - so if you try it, expect some rough edges! However, I have to say that it looks promising, and during my tests I didn't lose any data...but YMMV :)

This step creates a private key(.key) and associated certificate(.pem) on each node. The common name (CN), I've used is the hostname, so each certificate is unique to each gluster node and/or client. You may opt for a different scheme - but the important thing is the CN chosen here is reflected in step 6.

Combine the pem files to a single file

Use scp to copy the .pem file from each node to a single node in the cluster (I'm calling it the primary host for the purpose of this article)

The comma separated list should consist of the CN's used when generating the .pem files on each host, from step '1'.

Start the volume

# gluster vol start <volume-name>

Check SSL is enabled on the I/O Path

Although you can use vol info to check the SSL setting is in place, the best way to confirm that SSL is actually being used is to look at one of the log files;

# grep SSL /var/log/glusterfs/glustershd.log
[2015-03-31 06:58:34.674091] I [socket.c:3799:socket_init] 0-vol-client-2: SSL support on the I/O path is ENABLED
[2015-03-31 06:58:34.679316] I [socket.c:3799:socket_init] 0-vol-client-1: SSL support on the I/O path is ENABLED
[2015-03-31 06:58:34.680784] I [socket.c:3799:socket_init] 0-vol-client-0: SSL support on the I/O path is ENABLED