Aligning ASM Disks on Linux

Linux is a wonderful operating system. However there are a number of things that one needs to do to make sure it runs as efficiently as possible. Today, I would like to share one of them. It has to do with using ASM (Automatic Storage Manager) disks.

In Linux, there are 2 major ways to create ASM disks.

you can use ASMlib kernel driver

you can use devmapper devices

You could also use /dev/raw devices, but I don’t recommend this at all. I will write another blog explaining why.

Regardless of which approach you take, you have to create partitions on your LUNs. Starting with version 2, ASMlib won’t let you use the entire disk. You have to create a partition.

The reason to force the creation of this partition is to make explicit that something exists on that device, and that it’s not empty. Otherwise, some OS tools assume the disk is unused and could mark it, or just begin using it, and override your precious Oracle data.

(Read more after the jump.)

Most people would use the “fdisk” command provided by Linux distributions. This command is quite old, and so has some old-fashioned DOS-style behaviours built in.

When you create your partition, by default, the unit of measure is based on cylinders. Here’s a typical print command from fdisk on a 35 GB disk:

Take a look at the Units number. It’s 8 MB minus 159.5 kB. This is a very weird number, totally misaligned with any possible stripe size or stripe with (stride).

This, by itself, is not a big deal, since the best practice is to have 1 partition per LUN, which represents the entire device. However, this is not the end of it. If you switch to sector mode, you will see what the true start offset is:

Notice the start sector, 63th sector. That’s the 31.5 kB boundary. This value does not align with any stripe size. Stripe sizes are usually multiples of 2 and above 64 kB.

The result is, every so often, a block will be split between 2 separate hard disks, and the data will be returned at the speed of the slower (busier) device.

Assuming the typical 64 kB stripe (way too low, as I will discuss in another blog), and 8 kB database block size, every 8th block will be split between 2 devices. If you do the math, that’s about 12% of all your I/O. Not a significant number, but when you consider how discs are arranged in RAID 5, instead of a logical write being 2 reads and 2 writes (data+checksum, update, then write them back), each logical write could be 4 reads and 4 writes, significantly increasing your disk activity.

The solution?

Before you create your partitions, switch to “sector” mode, and create your partitions at an offset that is a power of 2.

I typically create my partitions at the 16th megabyte (32768th sector). Essentially, I “waste” 16 MBs, but gain aligned I/O for stripe width of up to 16 MB.

This way, the disk is aligned. When it is aligned at least on 1MB, then the ASM files will be also aligned at the 1MB boundary.

This alignment also applies to ext3 file systems. This file system takes it a step further, allowing you to provide the array stride as a parameter during creation, optimizing writing performance (I have not tested this). Look in the man pages for more information.

The aligning has to do with the RAID level stripe size, not with the ASM stripe size.

There’s always a “penalty” on miss-aligned IO. You are using 2 devices, when you could have used only 1. Note, that using multiple devices is not bad by itself. What’s bad is reading small amounts of data from multiple devices, as opposed to just 1.

Hi Christo,
I was told by SAN administrators here that there’s no way to change the stripe size for EMC LUN to be 1MB here, since it has already been set to 128KB for a big chunk of RAID5, every LUN will come from that one.

In this case, there’s no choice but to accept that as a fact. Is it necessary to do alignment as above for 128KB stripe size?

Thanks Christo. Since we aren’t going to get any stripe size that would be different from 128KB in this case, I guess I could configure the first sector to be 256 (128K) instead of 32768 (16M), so that it would do alignment for 128K stripe size. Do you see any issues in doing this?
Thanks,
Hai

[Editorial Note: Christo pointed out that the sidebar on this page was getting pushed to the very bottom in IE browsers. The culprit was the long raw URL in the first comment (by LSC). To fix it, I edited the comment, condensing the URL into the words “this paper from Oracle”. The rest of that comment is untouched.]

It is very cool article thank you for sharing it with us. I do have few questions/doubts about the approach and the way to validate it.

— In Enterprise SAN infrastructure LUN-s are provided from Storage Backend to Oracle – consumer without any relatable information (I assume here please correct me if I am wrong) about physical lay out of the underlying disks. I mean that SAN infrastructure hides the information such us cylinders/sectors/etc as each LUN isn’t a separate physical disk anymore. It is combination of disks and stripe technology used by particular vendor (with hopefully their own optimization). By the time a LUN presented to Linux/fdisk (I assume) it is far way from being a simple disk. Therefore I even assuming that we making a right offset to avoid double IO in certain cases I would like to have some good method to verify my configuration (see second point)

— Let’s imagine we made a 16MB offset creating our partition. We made an assumption that it will help us to increase IO performance by 12%. Question is there any good way to validate our assumption? I would like to be certain that I am releasing a storage chunk to my production space consumers in best possible configuration. Could you think about a simple test to validate our suggested assumption? I am guessing that iostat couldn’t be used in that case as we are talking about lower level IO split on device driver layer or in case of SAN it is going to be hidden somewhere in IO backend controllers.

— It looks like the way you put the suggestion it applies to fdisk only. Most probably the other volume managers (like Veritas etc) might have the same issue however it should be checked separately (with vendor support etc).

— If the reason to introduce fdisk activity is “make explicit that something exists”, I would introduce an additional maintenance procedures for the technical stuff and would make sure that everybody aware about the configuration chosen and would go with direct device usage (/dev/sdc rather then /dev/sdc1) to avoid an additional IO layer (partition) and risks of loosing 12% of IO performance. Are there other reasons to use partitions? I assume here that fdisk doesn’t help to address devices permissions and naming issue.

[…] (Where ?? are the characters used to identify the device.) The reason we are using 32768 as the start block for the device is because of the way scsi/san volume mapping uses the first few blocks of a device and subsequently causes excessive write I/O. This is better explained in the following link: http://www.pythian.com/news/411/aligning-asm-disks-on-linux […]

Hi Christo,
Awesome article but i have few doubts?
I am trying1MB offset on one of the raw disk(/dev/sdb) and where i specified on first parition(/dev/sdb1), first sector as 2048 and last sector as +4G and when i try to create another partition (second partition) i see first sector value to be 2 ? if i select 2 then next partition will also be of 1MB offset ?