Web-Notizen von Stefan Hinker

What's up with LDoms: Part 3 - A closer look at Disk Backend Choices

In this section, we'll have a closer look at virtual disk backends and the various choises available here. As a little reminder, a disk backend, in LDoms speak, is the physical storage used when creating a virtual disk for a guest system. In other virtualization solutions, these are sometimes called virtual disk images, a term that doesn't really fit for all possible options available in LDoms.

Physical LUNs, in any variant that the Control Domain supports. This of course includes SAN, iSCSI and SAS, including the internal disks of the host system.

Logical Volumes like ZFS Volumes, but also SVM or VxVM

Regular Files. These can be stored in any filesystem, as long as they're accessible by the LDoms subsystem. This includes storage on NFS.

Each of these backend devices have their own set of characteristica that should be considered when deciding which backend type to use. Let's look at them in a little more detail.

LUNs are the most generic option. By assigning a virtual disk to a LUN backend, the guest essentially gains full access to the underlying storage device, whatever that might be. It will see the volume label of the LUN, it can see and alter the partition table of the LUN, it can also read or set SCSI reservations on that device. Depending on the way the LUN is connected to the host system, this very same LUN could also be attached to a second host and a guest residing on it, with the two guests sharing the data on that one LUN, or supporting live migration. If there is a filesystem on the LUN, the guest will be able to mount that filesystem, just like any other system with access to that LUN, be it virtualized or direct. Bear in mind that most filesystems are non-shared filesystems. This doesn't change here, either. For the IO domain (that's the domain where the physical LUN is connected) LUNs mean the least possible amount of work. All it has to do is pass data blocks up and down to and from the LUN, there is a very minimum of driver layers invovled.

Flat files, on the other hand, are the most simple option, very similar in user experience to what one would do in a desktop hypervisor like VirtualBox. The easiest way to create one is with the "mkfile" command. For the guest, there is no real difference to LUNs. The virtual disk will, just like in the LUN case, appear to be a full disk, partition table, label and all. Of course, initially, it'll be all empty, so the first thing the guest usually needs to do is write a label to the disk. The main difference to LUNs is in the way these image files are managed. Since they are files in a filesystem, they can be copied, moved and deleted, all of which should be done with care, especially if the guest is still running. They can be managed by the filesystem, which means attributes like compression, encryption or deduplication in ZFS could apply to them - fully transparent to the guest. If the filesystem is a shared filesystem like NFS or SAM-FS, the file (and thus the disk image) could be shared by another LDom on another system, for example as a shared database disk or for live migration. Their performance will be impacted by the filesystem, too. The IO domain might cache some of the file, hoping to speed operations. If there are many such image files on a single filesystem, they might impact each other's performance. These files, by the way, need not be empty initially. A typical use case would be a Solaris iso image file. Adding it to a guest as a virtual disk will allow that guest to boot (and install) off that iso image as if it were a physical CD drive.

Finally, there are logical Volumes, typically created with volume managers such as Solaris Volume Manager (SVM) or Veritas Volume Manager (VxVM) or ZFS, of course. For the guest, again, these look just like ordinary disks, very much like files. The difference to files is in the management layer; The logical volumes are created straigt from the underlying storage, without a filesystem layer in between. In the database world, we would call these "raw devices", and their device names in Solaris are very similar to those of physical LUNs. We need different commands to find out how large these volumes are, or how much space is left on the storage devices underneath. Other than that, however, they are very similar to files in many ways. Sharing them between two host systems is likely to be more complex, as one would need the corresponding cluster volume managers, which typically only really work in combination with Solaris Cluster. One type of volume that deserves special mentioning is the ZFS Volume. It offers all the features of a normal ZFS dataset: Clones, snapshots, compression, encryption, deduplication, etc. Especially with snapshots and clones, they lend themselves as the ideal backend for all use cases that make heavy use of these features.

For the sake of completeness, I'd like to mention that you can export all of these backends to a guest with or without the "slice" option, something that I consider less usefull in most cases, which is why I'd like to refer you to the relevant section in the admin guide if you want to know more about this.

Lastly, you do have the option to export these backends read-only to prevent any changes from the guests. Keep in mind that even mounting a UFS filesystem read only would require a write operation to the virtual disk. The most typical usecase for this is probably an iso-image, which can indeed be mounted read-only. You can also export one backend to more than one guest. In the physical world, this would correspond to using the same SAN LUN on several hosts, and the same restrictions with regards to shared filesystems etc. apply.

So now that we know about all these different options, when should we use which kind of backend ? The answer, as usual, is: It depends!

LUNs require a SAN (or iSCSI) infrastructure which we tend to associate with higher cost. On the other hand, they can be shared between many hosts, are easily mapped from host to host and bring a rich feature set of storage management and redundancy with them. I recommend LUNs (especially SAN) for both boot devices and data disks of guest systems in production environments. My main reasons for this are:

They are very light-weight on the IO domain

They avoid any double buffering of data in the guest and in the IO domain because there is no filesystem layer involved in the IO domain.

Redundancy for the device and the data path is easy

They allow sharing between hosts, which in turn allows cluster implementations and live migration

All ZFS features can be implemented in the guest, if desired.

For test and development, my first choice is usually the ZFS volume. Unlike VxVM, it comes free of charge, and it's features like snapshots and clones meet the typical requirements of such environments to quickly create, copy and destroy test environments. I explicitly recommend against using ZFS snapshots/clones (files or volumes) over a longer period of time. Since ZFS records the delta between the original image and the clones, the space overhead will eventually grow to a multiple of the initial size and eventually even prevent further IO to the virtual disk if the zpool is full. Also keep in mind that ZFS is not a shared filesystem. This prevents guest that use ZFS files or volumes as virtual disks from doing live migration. Which leads directly to the recommendation for files:

I recommend files on NFS (or other shared filesystems) in all those cases where SAN LUNs are not available but shared access to disk images is required because of live migration (or because of cluster software like Solaris Cluster or RAC is running in the guests). The functionality is mostly the same as for LUNs, with the exception of SCSI reservations, which don't work with a file backend. However, CPU requirements in the IO domain and performance of NFS files as compared to SAN LUNs is likely to be different, which is why I strongly recommend to use SAN LUNs for all prodution use cases.

Nice wrapup on the different choices. For the LiveMigration part, is using files in the storage box shared by different nodes the only possibility? In other words, If you allocate LUN's directly to a host, LiveMigration is not possible?

Live Migration is possible with all storage backends that allow sharing between several hosts. Specifically, this includes SAN LUNs, iSCSI LUNs and files on NFS. I did mention this in the above article.