Disk Alignment

First and foremost, code dealing with raw devices usually uses the concept
of dividing the disk up in blocks of a certain size. Addressing blocks
instead of each byte separately has the huge advantage that an unsigned
integer can be used to manage 2³² Blocks instead of 2³² Bytes, thereby
easily increasing the maximum size available without needing more memory.

The downside is that this means that we need to take care when reading
and writing data: We cannot overwrite parts of a block, because the only
command we have available is "overwrite Block number 1284319 with this
data". There is simply no command to say "overwrite the second half of
block 1284319". Hence, we always need to write complete blocks.

But what happens if we don't have enough data to overwrite a complete
block? Well then the system reads the old block into RAM, modifies only
the bits and bytes that need to change, and writes back the complete
block. This is called a read-modify-write cycle.

This immediately becomes relevant as soon as we try to make actual use
of a device, because then we need to layer multiple block-oriented
somethings on top of each other. In order to prevent read-modify-write
from happening, we have to make sure that upper layers only overwrite
byte ranges that the lower layers consider a full block. We do that
by ensuring the file system writes a minimum of, say, 4096 Bytes to
a position that is a multiple of 4096 Bytes away from the beginning
of the file system. If the lower layer uses a block size of 4096 Bytes,
we will now always overwrite a full block. If the lower layer uses a
smaller block size that 4096 is a multiple of, this will also work.
So using such a file system also works when the lower layer uses 2048,
1024 or 512 Bytes as its block size. 1536 bytes would not work,
because in order to write a single file system block, the block device
would have to write 2.66 of its own blocks. For that, it would require
at least one read-modify-write cycle.

Most data handling systems divide their storage into a metadata area
and a data area. Since the data area is the one much more frequently
written to, we have to make sure its blocks are properly aligned. The
size of the metadata area is relevant in this regard since it acts as
a padding that moves the whole data area back a bit. So you probably
guessed it already, this padding needs to be a multiple of the block
size. The block size of the data area itself must then again be
aligned with the block sizes of other systems in the stack.

So let's take a quick look at paddings and block sizes of different
tools:

Tool

Padding

Block Size

Hardware RAID

Hopefully, 0

Chunk Size (e.g. 256KiB)

Stripe Width (e.g. 1MiB)

Partition Table

2048 Sectors → 1MiB

2048 Sectors → 1MiB

LVM

1MiB (1st Phys. Extent)

4MiB (VG Extent Size)

File Systems

Mostly aligned to 1MiB

4096 Bytes (4KiB)

QCow2 VM Image

2MiB

depends on guest file system

Raw VM Image

0

depends on guest file system

For qcow2 and raw VM images, make sure you don't use a smaller block
size in the guest's file systems than in the host filesystem to avoid
read-modify-writes.

Now with regard to the block sizes, there's another pitfall to be aware of:
The VG Extent size and RAID Chunk Size differ from the block size in a
file system in that they do not denote a minimum write IO size. Instead,
they are only relevant for storage allocation. An LV that uses an extent
size of 4MiB is perfectly capable of performing 4KiB IOs. The extent size
only matters for where a newly-allocated file system starts.

So the easiest way to get all those numbers to play nicely with one
another is to use the defaults wherever possible. This minimizes the amount
of fiddling you have to do, and minimum fiddling means minimum mistakes.
Since a single mistake can flush your whole system's performance down
the drain, minimum mistakes is a good thing, especially when you allow
the defaults to work with things you do frequently (like, creating a
partition table in a newly-created raw VM image in an LVM logical volume).

If you don't use RAID, make sure you use partition tables with their
partitions always starting at 2048 sectors, and partitions being a
multiple of 2048 sectors in length. That is what gparted's "Align to 1MiB"
button does, as does pretty much every modern operating system on the
planet. If your partition table starts at sector 63, your OS is not one
of those. You should then do the partitioning with a live CD or something
before installing the OS. If that is not an option and you're using a VM
store backed by a file system that does not reside on SSDs, you may also
consider switching that file system's block size to 512 Bytes.

If you do use RAID, things get a little more complicated.

Layouting RAID

Every RAID level except for RAID-1 employs a technique called striping.
This means that when writing a certain amount of data, the data gets split
into chunks, and each chunk is going to land on a different disk. For
instance, if you're writing 1MiB of data to a RAID that has a chunk size
of 256KiB, you're going to overwrite four chunks, because
1MiB / 256KiB = 4. If you're writing more chunks than your array has
data disks, at some point this is going to wrap around. So if you have 6
data disks in your array, writing 4 chunks does not need to wrap, but if
you only had 3, then one disk would get two chunks to write. The maximum
amount of data that can be written without having to wrap around is
called the Stripe Width, and is calculated as the number of data disks
times the chunk size. So for instance, a RAID-5 array of 8 disks with a
chunk size of 256KiB would then have a Stripe Width of 1792KiB:
7 * 256KiB = 1792KiB.

Now take a look at the table again. Since most offsets default to 1MiB
and LVM's extent size is also a multiple of 1MiB, it would be nice if
the start of a stripe + 1MiB = the start of another stripe. This is
easily achieved if your Stripe Width is 1MiB, but 512 KiB would
work just as well.

If all your partitions and LVs start exactly at the beginning of a RAID
stripe, you will then be able to perform filesystem tuning. At least
Ext3, Ext4 and XFS can be tuned by giving them information about the
RAID layout, but for that to work correctly, the first byte of the file
system has to be aligned to the start of a stripe. Otherwise, none
of the calculations the file system performs will match reality.

Note

Stripe over either two or four data disks. Not three, not five, not
six, not seven. Two or four. End of story.

RAID controllers are most efficient when overwriting a full stripe.
With random IO, that is never going to happen ever. Don't try to make
it happen. It won't. But we can reduce the pain by using a big chunk
size (in my experience, 256KiB works splendid), so the RAID controller
can process even large IO requests by updating a single chunk per stripe.

Note

Use a chunk size of 256KiB. If that is not possible, choose the
biggest one available.

I also did a bunch of tests comparing hardware RAID to software RAID.
The results indicate that hardware RAID is better suited for RAID levels
that involve parities and mirroring (RAID-1, 5, 6). Software RAID is
better suited for striping (RAID-0). I figured this is because then
the Linux kernel can distribute the load to multiple devices in parallel
instead of having to shove everything down one single pipe (citation needed).

Hardware and software RAID aren't mutually exclusive. For instance, layering
a software RAID-0 instance over four hardware RAID-5/6 instances with four
data disks each gets you a total of 16 data disks. If each one of those is
1.2TB in size, that's about 20TB, and you didn't even have to violate the
"four disks per array" principle from above. However, this setup does
introduce some fuzziness about what "the beginning of a stripe" exactly
is. Still, I have yet to see one of those systems be brought to its limit,
so this configuration works extremely well in practice.

Caching

Caching is the most effective optimization in IT.

First and foremost, Linux uses the host's ram for caching — so,
equipping your storage system with far more than enough RAM makes sense.
After a while, you'll find the file system cache filling up every last
bit of RAM. Curiously the block device buffer doesn't seem to be as
effective, so I prefer using a file storage backend for VM images.

And for the love of god, use your raid controller's cache. Always buy a
friggin' CacheVault. With it, you might even be able to use slower disks,
because the RAID controller gets to feed them more sequential IO.

Note

Buy a fucking CacheVault.

On the other hand, do not enable your disks' caches because CacheVault
doesn't protect those. There's an option in the RAID controller to
control disk's caches. Make sure it is set to disabled.

A word on fragmentation

You might be worried about file system fragmentation. When writing data
sequentially, fragmentation is an issue because it disturbs the sequence
in which data is written to disk, thereby causing the disk to do random-ish
IO. This reduces the maximum achieved throughput, and is therefore bad.

When putting VM images into a file system, this is not an issue because
VMs write randomly anyway.