Saturday, March 17, 2012

Random notes on iSCSI storage

When you're using eSXI/vSphere for your virtualization host, iSCSI storage is actually useful. That's because VMWare's vmfs3 filesystem is by default a clustering filesystem. What that means is that if your iSCSI target is capable of operating in cluster mode -- i.e., accept initiators from multiple hosts connected at the same time -- iSCSI block storage can be used for ultra-quick failovers on your VMware servers (amongst other things it can be used for). And the performance is *significantly* better than NFS datastores, because VMware can store vmdk files as physically contiguous extents with vmfs3, while VMware has no control of how a NFS server physically lays out vmdk files on disk. This is important because all modern operating systems use an "elevator" algorithm for their filesystem cache flushes that assumes that the underlying block storage is physically contiguous from block 0 to block n, and if the underlying storage is *not* physically contiguous, you end up with either the possibility of lost writes (if the NFS host is running in asynchronous mode) or with the NFS host's disks thrashing all over the place and performance sucking like a Rentboy.com male prostitute at a Republican convention.

So anyhow, just wanted to share a technique I used to rescue a failing machine. The machine involved was a Red Hat Enterprise Linux 4 machine that I wanted to migrate to virtualization for the simple reason that one of its drives had failed. 30GB of the first drive was used for actual data, most of the system was empty.

So first thing first, I created a blank virtual machine on the ESXi host and told VMware to create a drive big enough to hold all the data on the old RHEL4 machine. Then I connected that virtual machine's hard drive to a Centos6 virtual machine as a virtual hard drive. Then I exported that virtual hard drive via tgtd / iSCSI to the RHEL4 machine and connected to that target from the RHEL4 machine's iSCSI initiator. On the RHEL4 machine I then dd'ed the first hundred blocks from its physical hard drive to the iSCSI hard drive (which was something like /dev/sdc, I'd checked /proc/partitions before telling the target to scan so I could know what showed up), did a 'sfdisk -R /dev/sdc' to re-read the partition table on /dev/sdc, then copied the /boot partition (after unmounting it) as a byte by byte copy: 'dd if=/dev/sda1 of=/dev/sdc1'. Then I did

pvcreate /dev/sdc2

vgcreate rootgroup /dev/sdc2

lvcreate -n rootvol -L 16G rootgroup

lvcreate -n swapvol -L 2G rootgroup

lvcreate -n extravol -L 16G rootgroup

vgscan -a

lvscan -a

mkfs -t ext3 /dev/mapper/rootgroup-rootvol

mkswap /dev/mapper/rootgroup-swapvol

mkfs -t ext3 /dev/mapper/rootgroup-extravol

I then mounted my new volumes in their correct hierarchy (so that when I chrooted to them I'd see /boot and etc. in their right places) and did your typical pipelined tar commands to do file-by-file copies of / and /extra to their new location, and while that was going on I edited /etc/fstab and chrooted to the new environment and mounted /proc and /sys and did a mkinitrd to capture the new root volume. Though I do suggest that you have the rescue disk handy as an ISO image on an ESXi datastore so you can mount it in case of problems -- which I did, but unrelated to any of this (it was related to the failure that caused me to do the migration in the first place).

So how did this data transfer perform? Well, basically at the full speed of the source hard drive, which was a 500GB IDE hard drive.

Anyhow, having used the Linux iSCSI target daemon, tgtd, here as well as extensively for other projects, let me just say that it sucks big-time compared to "real" targets. How does it suck? Let me count the ways:

For that matter, storage management period doesn't exist with tgtd. For example, you can't increase the size of a target once you've created it by adding more backing store to an already existing up and running iSCSI target, it simply is.

tgtd gets into regular fights with the Linux kernel about who owns the block devices that it's trying to export. It's basically useless for exporting block devices because of that -- if there's a md array on the block device or a lvm volume set on the block device, the Linux kernel will claim it long before tgtd gets ownership of it. Thing is, you don't have any control over what the initiator puts onto a block device, so you're kind of stuck there, you have to manually stop the target, deactivate the RAID array and / or volume group, then manually start the target in order to get control over the physical device to export it.

tgtd has the most obscure failure mode I've ever encountered: if it can't do something it will still happily export the volume, just as a 0-length volume. WTF?!

My conclusion: tgtd is a toy, useful only for experimenting and one-off applications. It doesn't have the storage management capabilities needed for a serious iSCSI target. Some of that storage management could be built around it, but the fact that you cannot modify a tgtd target while anybody is connected to it means that you can't do things that the big players -- or even the little guys like the Intransa appliance that I'm using for the backing store for my eSXI host -- have been able to do for years. Even on the antique nine-year-old Intransa realm that's hosting some of our older data (which is migrating to a new one but that takes time) I can expand the size of an iSCSI target in real time, for example. I then tell my initiator to re-scan, it notices "hey, my target has gotten bigger!" and informs the kernel of such, then I can use the OS's native utilities to expand a current filesystem to fill the additional space. None of that's possible with tgtd for the simple reason that tgtd won't do real-time live storage management. Toy. Just sayin'.

No comments:

Post a Comment

About Me

I am a senior lead engineer and architect who has taken multiple products from concept to market and beyond. I am also one of the original Linux penguins -- my first Linux product hit the market in June 1996 and its latest incarnation is still running to this day.