SPOF #1 – Storage Node

To start with we’re going to need a minimum of two storage servers, although the system should scale to any (?) number of storage nodes. I’m using ZFS-Z2 as a storage platform which introduces an additional layer of complexity, but at the same time a layer of flexibility and resilience you’re not going to get from any other option. If you hear anyone mention “BTRFS” as an alternative to ZFS, run away (screaming).

Why ZFS?

Software RAID without the traditional RAID-5 failings

RAID-5 resilience with striped reads, which is MUCH faster than traditional RAID

Checksums and low-level data integrity checks

Works well on in ‘hot swap’ mode

Snapshots, incremental backups etc etc ..

You can obviously run on any hardware but ideally steer clear of “onboard” SATA controllers if you can and opt for something like an LSI MegaRAID controller. This might set you back £150 but it will be well worth it in terms of performance. You will find it difficult to push more than 500Mb through a stock motherboard, even with 8 SATA III disks, whereas the same disks on a good controller will give you double. (onboard controllers typically have a very limited bus-width, so despite for example 180Mb/channel, the controller itself can’t really handle more than 3 channels flat out)

Also worth a mention, *never* use RAID software supplied on the card, you will always get better performance by making your controller present each disk as a JBOD device (or RAID0/1 device) and then using software RAID (ZFS in this case) to do all the RAIDing. Sounds mad (!) but if you think about it, your average RAID controller uses a 750MHz PowerPC chip, whereas your average server is running a 6 or 8 core 64bit chip at 3GHz. Which one is going to give the best throughput ?!

So, starting with a stock Ubuntu 12.04, first thing to do is add ZFS as follows;

And we should be ready to roll, so next we need to see what sort of hardware we have. Now in order to be truly portable and to work with hot-swapping, i.e. so we can survive either the system or the user changing the order in which the component disks are presented to the system, we’ll work with the disk’s ID’s as identifiers, rather than relying on device names. Take a look in /dev/disk/by-id to see what’s available on your system;

As you can see I have 8 drives (which are 1Tb each) and a 64Gb SSD disk that I’m using as a root filesystem. Now we’re going to create a RAID-Z pool using all 8 disks, using 2 parity disks, which should allow the array to survive up to 2 simultaneous drive failures.

If you do a “df” you should also find that it’s created a mount-point and automatically mounted the filesystem for you. This is done automatically at boot time by the “mountall” package you installed earlier as part of the ZFS PPA. However, in order to activate this, you need to edit /etc/default/zfs and set ZFS_MOUNT=’yes’.

You should also create a file called /etc/modprobe.d/zfs.conf and insert options zfs zfs_arc_max=2684354560 zfs_arc_min=0, which will limit the amount of cache space that ZFS can use. Unfortunately ZFS uses it’s own page cache which is not integrated into Linux’s page cache (yet?) so if you don’t add this limit there is the potential for ZFS to consume all free space and this ‘can’ cause deadlock problems when the system page cache can’t get enough memory to operate. There is an argument for having this set by default, however … [!]

Now we can get ready to implement our clustered network filesystem, so we’ll make some filesystems in readiness;

zfs create srv/brickszfs create srv/isoszfs create src/images

We’re going to store filesystem data in the bricks folder, installation ISO images in the isos folder, and Virtual Machine images in the images folder.

Networking

Obviously the speed at which our network filesystems will work will be dependent on the speed of our network connections, so I’m opting for a 3-NIC approach, although you could use 5 or indeed use more expensive 10G NIC’s .. which in a few years time I’m sure everyone will be. So, assuming you have the right hardware (!), edit your /etc/network/interfaces file to look something like this;

Obviously your device names and address ranges will need to suit your hardware and network, and you need to make sure you have thebridge-utils package installed. We’re pretty much ready for the next stage now, just bear in mind you need to duplicate all this on a second server. (incidentally, I’ve called my servers data1 and data2 in this instance).

Just when you thought it was safe to dump all your MySQL tools, switch SQLAdministrator for RoboMongo and dump ApachePHP for Node.js, it appears someone has thrown a thumping great spanner into the works!