Archive for the ‘Virtual Storage’ Category

Our open storage partner, Nexenta Systems Inc., hit a milestone this month by releasing NexentaStor 4.0.1 for general availability. This release is significant mainly because it is the first commercial release of NexentaStor based on the Open Source Illumos kernel and not Oracle’s OpenSolaris (now closed source). With this move, NexentaStor’s adhering to the company’s promise of “open source technology” that enables hardware independence and targeted flexibility.

Some highlights in 4.0.1:

Faster Install times

Better HA Cluster failover times and “easier” cluster manageability

Support for large memory host configurations – up to 512GB of DRAM per head/controller

Note, the 18TB Community Edition EULA is still hampered by the “non-commercial” language, restricting it’s use to home, education and academic (ie. training, testing, lab, etc.) targets. However, the “total amount of Storage Space” license for Community is a deviation from the Enterprise licensing (typical “raw” storage entitlement)

2.2 If You have acquired a Community Edition license, the total amount of Storage Space is limited as specified on the Site and is subject to change without notice. The Community Edition may ONLY be used for educational, academic and other non-commercial purposes expressly excluding any commercial usage. The Trial Edition licenses may ONLY be used for the sole purposes of evaluating the suitability of the Product for licensing of the Enterprise Edition for a fee. If You have obtained the Product under discounted educational pricing, You are only permitted to use the Product for educational and academic purposes only and such license expressly excludes any commercial purposes.

– NexentaStor EULA, Version 4.0; Last updated: March 18, 2014

For those who operate under the Community license, this means your total physical storage is UNLIMITED, provided your space “IN USE” falls short of 18TB (18,432 GB) at all times. Where this is important is in constructing useful arrays with “currently available” disks (SATA, SAS, etc.) Let’s say you needed 16TB of AVAILABLE space using “modern” 3TB disks. The fact that your spinning disks are individually larger than 600GB indicates that array rebuild times might run afoul of failure PRIOR to the completion of the rebuild (encountering data loss) and mirror or raidz2/raidz3 would be your best bet for array configuration.

SOLORI Note: Richard Elling made this concept exceedingly clear back in 2010, and his “ZFS data protection comparison” of 2, 3 and 4-way mirrors to raidz, raidz2 and raidz3 is still a great reference on the topic.

By “raw” licensing standards, the 3-way mirror would require a 76TB license while the raidz2 volume would require a 51TB license – a difference of 25TB in licensing (around $5,300 retail). However, under the Community License, the “cost” is exactly the same, allowing for a considerable amount of flexibility in array loadout and configuration.

Why do I need 54TiB in disk to make 16TB of “AVAILABLE” storage in Raidz2?

The RAID grouping we’ve chosen is 6-disk raidz2 – that’s akin to 4 data and 2 parity disks in RAID6 (without the fixed stripe requirement or the “write hole penalty.”) This means, on average, one third of the space consumed on-disk will be in the form of parity information. Therefore, right of the top, we’re losing 33% of the disk capacity. Likewise, disk manufacturers make TiB not TB disks, so we lose 7% of “capacity” in the conversion from TiB to TB. Additionally, we like to have a healthy amount of space reserved for new block allocation and recommend 30% unused space as a target. All combined, a 6-disk raidz array is, at best, 43% efficient in terms of capacity (by contrast, 3-way mirror is only 22% space efficient). For an array based on 3TiB disks, we therefore get only 1.3TB of usable storage – per disk – with 6-disk raidz (by contrast, 10-disk raidz nets only 160GB additional “usable” space per disk.)

SOLORI’s Take: If you’re running 3.x in production, 4.0.1 is not suitable for in-place upgrades (yet) so testing and waiting for the “non-disruptive” maintenance release is your best option. For new installations – especially inside a VM or hypervisor environment as a Virtual Storage Appliance (VSA) – version 4.0.1 presents a better option over it’s 3.x siblings. If you’re familiar with 3.x, there’s not much new on the NMV side outside better tunables and snappier response.

There have always been things about virtualizing the enterprise that have concerned me. Most boil down to Uncle Ben’s admonishment to his nephew, Peter Parker, in Stan Lee’s Spider-Man, “with great power comes great responsibility.” Nothing could be more applicable to the state of modern virtualization today.

Back in “the day” when all this VMware stuff was scary and “complicated,” it carried enough “voodoo mystique” that (often defacto) VMware admins either knew everything there was to know about their infrastructure, or they just left it to the experts. Today, virtualization has reached such high levels of accessibility that I think even my 102 year old Nana could clone a live VM; now that is scary.

Enter Veeam Backup, et al

Case in point is Veeam Backup and Recovery 6 (VBR6). Once an infrastructure exceeds the limits of VMware Data Recovery (VDR), it just doesn’t get much easier to backup your cadre of virtual machines than VBR6. Unlike VDR, VBR6 has three modes of access to virtual machine disks:

Network – VBR6 backup server/proxy accesses virtual machine disks from ESXi hosts similar in a manner similar to the way the vSphere Client grants access to virtual machine disks across the LAN – slower, with more overhead;

For block-based storage, option (1) appears to be the best way to go: it’s fast with very little overhead in the data channel. For those of us with grey hair, think VMware Consolidated Backup proxy server and you’re on the right track; for everyone else, think shared disk environment. And that, boys and girls, is where we come to the point of today’s lesson…

Enter Windows Server, Updates

For all of its warts, my favorite aspect of VMware Data Recovery is the fact that it is a virtual appliance based on a stripped-down Linux distribution. Those two aspects say “do not tamper” better than anything these days, so admins – especially Windows admins – tend to just install and use as directed. At the very least, the appliance factor offers an opportunity for “special case” handling of updates (read: very controlled and tightly scripted).

The other “advantage” to VMDR is that is uses a relatively safe method for accessing virtual machine disks: something more akin to VBR6’s “virtual appliance” mode of operation. By allowing the ESXi host(s) to “proxy” access to the datastore(s), a couple of things are accomplished:

Unlike “Direct SAN Access” mode, no additional initiators need to be added to the target(s)’ ACL;

If the host can access the VMDK, it stands a good chance of being backed-up fairly efficiently.

However, VBR6 installs onto a Windows Server and Windows Server has no knowledge of what VMFS looks like nor how to handle VMFS disks. This means Windows disk management needs to be “tweaked” to ignore VMFS targets by disabling “automount” in VBR6 servers and VCB proxies. For most, it also means keeping up with patch management and Windows Update (or appropriate derivative). For active backup servers with a (pre-approved, tested) critical update this might go something like:

And the Wheels Came Off…

Service Pack 1 for Windows Server 2008 R2 requires a BCD update, so existing installations of VCB or VBR5/6 will fail to update. In an environment where there is no VCB or VBR5/6 testing platform, this could result in a resume writing event for the patching guy or the backup administrator if they follow Microsoft’s advice and “fix” SP1. Why?

Done, right? Possibly in more ways than one. By GLOBALLY enabling automount, rebooting Windows Server and installing SP1, you’ve opened-up the potential for Windows to write a signature to the VMFS volumes holding your critical infrastructure. Fortunately, it doesn’t have to end that way.

Avoiding the Avoidable

Veeam’s been around long enough to have some great forum participants from across the administrative spectrum. Fortunately, a member posted a solution method that keeps us well away from VMFS corruption and still solves the SP1 issue in a targeted way: temporarily mounting the “hidden” system partition instead of enabling the global automount feature. Here’s my take on the process (GUI mode):

On the pop-up, click the “Add…” button and accept the default drive letter offered, click “OK”;

Now “try again” the installation of Service Pack 1 and reboot;

Once SP1 is installed, re-run Disk Management;

Right-click on the “System Reserved” partition and select “Change Drive Letter and Paths..”

Click the “Remove” button to unmap the drive letter;

Click “Yes” at the “Are you sure…” prompt;

Click “Yes” at the “Do you want to continue?” prompt;

Reboot (for good measure).

This process assumes that there are no non-standard deployments of the Server 2008 R2 boot volume. Of course, if there is no separate system reserved partition, you wouldn’t encounter the SP1 failure to install issue…

SOLORI’s Take: The takeaway here is “consider your environment” (and the people tasked with maintaining it) before deploying Direct SAN Access mode into a VMware cluster. While it may represent “optimal” backup performance, it is not without its potential pitfalls (as demonstrated herein). Native access to SAN LUNs must come with a heavy dose of respect, caution and understanding of the underlying architecture: otherwise, I recommend Virtual Appliance mode (similar to Data Recovery’s take.)

While no VMFS volumes were harmed in the making of this blog post, the thought of what could have happened in a production environment chilled me into writing this post. Direct access to the SAN layer unlocks tremendous power for modern backup: just be safe and don’t forget to heed Uncle Ben’s advice! If the idea of VMFS corruption scares you beyond your risk tolerance, appliance mode will deliver acceptable results with minimal risk or complexity.

In my last post, I mentioned a much complained about “idle” CPU utilization quirk with NexentaStor when running as a virtual machine. After reading many supposed remedies on forum postings (some reference in the last blog, none worked) I went pit-bull on the problem… and got lucky.

For me, the problem cropped-up while running storage benchmarks on some virtual storage appliances for a client. These VSA’s are bound to a dedicated LSI 9211-8i SAS/6G controller using VMware’s PCI pass-through (Host Configuration, Hardware, Advanced Settings). The VSA uses the LSI controller to access SAS/6G disks and SSDs in a connected JBOD – this approach allows for many permutations on storage HA and avoids physical RDMs and VMDKs. Using a JBOD allows for attachments to PCIe-equipped blades, dense rack servers, etc. and has little impact on VM CPU utilization (in theory).

So I was very surprised to find idle CPU utilization (according to ESXi’s performance charting) hovering around 50% from a fresh installation. This runs contrary to my experience with NexentaStor, but I’ve seen reports of such issues on the forums and even on my own blog. I’ve never been able to reproduce more than a 15-20% per vCPU bias between what’s reported in the VM and what ESXi/vCenter sees. I’ve always attributed this difference to vSMP and virtual hardware (disk activity) which is not seen by the OS but is handled by the VMM.

CPU record of idle and IOzone testing of SAS-attached VSA

During the testing phase, I’m primarily looking at the disk throughput, but I notice a persistent CPU utilization of 50% – even when idle. Regardless, the 4 vCPU VSA appears to perform well (about 725MB/sec 2-process throughput on initial write) despite the CPU deficit (3 vCPU test pictured above, about 600MB/sec write). However, after writing my last blog entry, the 50% CPU leach just kept bothering me.

After wasting several hours researching and tweaking with very little (positive) effect, a client e-mail prompted a NMV walk through with resulted in an unexpected consequence: the act of powering-off the VSA from web GUI (NMV) resulted is significantly reduced idle CPU utilization.

Getting lucky: noticing a trend after using NMV to reboot for a client walk-through of the GUI.

Working with the 3 vCPU VSA over the previous several hours, I had consistently used the NMC (CLI) to reboot and power-off the VM. The fact of simply using the NMV to shutdown the VSA couldn’t have anything to do with idle CPU consumption, could it? Remembering that these were fresh installations I wondered if this was specific to a fresh installation or could it show up in an upgrade. According to the forums, this only hampered VMs, not hardware.

I grabbed a NexentaStor 3.1.0 VM out of the library (one that had been progressively upgraded from 3.0.1) and set about the upgrade process. The result was unexpected: no difference in idle CPU from the previous version; this problem was NOT specific to 3.1.2, but specific to the installation/setup process itself (at least that was the prevailing hypothesis.)

Regression into my legacy VSA library, upgrading from 3.1.1 to 3.1.2 to check if the problem follows the NexentaStor version.

If anything, the upgraded VSA exhibited slightly less idle CPU utilization than its previous version. Noteworthy, however, was the extremely high CPU utilization as the VSA sat waiting for a yes/no response (NMC/CLI) to the “would you like to reboot now” question at the end of the upgrade process (see chart above). Once “no” was selected, CPU dropped immediately to normal levels.

Now it seemed apparent that perhaps an vestige of the web-based setup process (completed by a series of “wizard” pages) must be lingering around (much like the yes/no CPU glutton.) Fortunately, I had another freshly installed VSA to test with – exactly configured and processed as the first one. I fired-up the NMV and shutdown the VSA…

Confirming the impact of the "fix" on a second fresh installed NexentaStor VSA

After powering-on the VM from the vSphere Client it was obvious. This VSA had been running idle for some time, so it’s idle performance baseline – established prior across several reboots from CLI – was well recorded by the ESXi host (see above.) The resulting drop in idle CPU was nothing short of astounding: the 3 vCPU configuration has dropped from a 50% average utilization to 23% idle utilization. Naturally, these findings (still anecdotal) have been forwarded on to engineers at Nexenta. Unfortunately, now I have to go back and re-run my storage benchmarks; hopefully clearing the underlying bug has reduced the needed vCPU count…

In this In-the-Lab segment we’re going to look at how to recover from a failed ZFS version update in case you’ve become ambitious with your NexentaStor installation after the last Short-Take on ZFS/ZPOOL versions. If you used the “root shell” to make those changes, chances are your grub is failing after reboot. If so, this blog can help, but before you read on, observe this necessary disclaimer:

NexentaStor is an appliance operating system, not a general purpose one. The accepted way to manage the system volume is through the NMC shell and NMV web interface. Using a “root shell” to configure the file system(s) is unsupported and may void your support agreement(s) and/or license(s).

That said, let’s assume that you updated the syspool filesystem and zpool to the latest versions using the “root shell” instead of the NMC (i.e. following a system update where zfs and zpool warnings declare that your pool and filesystems are too old, etc.) In such a case, the resulting syspool will not be bootable until you update grub (this happens automagically when you use the NMC commands.) When this happens, you’re greeted with the following boot prompt:

grub>

Grub is now telling you that it has no idea how to boot your NexentaStor OS. Chances are there are two things that will need to happen before your system boots again:

We’ll update both in the same recovery session to save time (this assumes you know or have a rough idea about your intended boot checkpoint – it is usually the highest numbered rootfs-nmu-NNN checkpoint, where NNN is a three digit number.) The first step is to load the recovery console. This could have been done from the “Safe Mode” boot menu option if grub was still active. However, since grub is blown-away, we’ll boot from the latest NexentaStor CD and select the recovery option from the menu.

Import the syspool

Then, we login as “root” (empty password.) From this “root shell” we can import the existing (disks connected to active controllers) syspool with the following command:

# zpool import -f syspool

Note the use of the “-f” card to force the import of the pool. Chances are, the pool will not have been “destroyed” or “exported” so zpool will “think” the pool belongs to another system (your boot system, not the rescue system). As a precaution, zpool assumes that the pool is still “in use” by the “other system” and the import is rejected to avoid “importing an imported pool” which would be completely catastrophic.

With the syspool imported, we need to mount the correct (latest) checkpointed filesystem as our boot reference for grub, destroy the local zfs.cache file (in case the pool disks have been moved, but still all there), update the boot archive to correspond to the mounted checkpoint and install grub to the disk(s) in the pool (i.e. each mirror member).

List the Checkpoints

# zfs list -r syspool

From the resulting list, we’ll pick our highest-numbered checkpoint; for the sake of this article let’s say it’s “rootfs-nmu-013” and mount it.

Install Grub to Each Mirror Disk

Unmount and Reboot

# umount /tmp/syspool
# sync
# reboot

Now, the system should be restored to a bootable configuration based on the selected system checkpoint. A similar procedure can be found on Nexenta’s site when using the “Safe Mode” boot option. If you follow that process, you’ll quickly encounter an error – likely intentional and meant to elicit a call to support for help. See if you can spot the step…

As features are added to ZFS – the ZFS (filesystem) code may change and/or the underlying ZFS POOL code may change. When features are added, older versions of ZFS/ZPOOL will not be able to take advantage of these new features without the ZFS filesystem and/or pool being updated first.

Since ZFS filesystems exist inside of ZFS pools, the ZFS pool may need to be upgraded before a ZFS filesystem upgrade may take place. For instance, in ZFS pool version 24, support for system attributes was added to ZFS. To allow ZFS filesystems to take advantage of these new attributes, ZFS filesystem version 4 (or higher) is required. The proper order to upgrade would be to bring the ZFS pool up to at least version 24, and then upgrade the ZFS filesystem(s) as needed.

Systems running a newer version of ZFS (pool or filesystem) may “understand” an earlier version. However, older versions of ZFS will not be able to access ZFS streams from newer versions of ZFS.

For NexentaStor users, here are the current versions of the ZFS filesystem (see “zfs upgrade -v”):

Here’s a maintenance note for SMB environments attempting 100% virtualization and relying on SAN-based file shares to simplify backup and storage management: beware the chicken-and-egg scenario on restart before going home to capture much needed Zzz’s. If your domain controller is virtualized and it’s VMDK file lives on SAN/NAS, you’ll need to restart SMB services on the NexentaStor appliance before leaving the building.

Here’s the scenario:

An afterhours SAN upgrade in non-HA environment (maybe Auto-CDP for BC/DR, but no active fail-over);

Shutdown of SAN requires shutdown of all dependent VM’s, including domain controllers (AD);

AD-based DNS will not be available at SAN re-boot if all DNS servers are virtual and only one SAN is involved;

Lack of DNS resolution on re-boot will cause a failure for DNS name based NTP service synchronization;

NexentaStor SMB service will fail to properly initialize AD credentials;

VMware 4.1 now pushes AD authentication all the way to ESX hosts, enabling better credential management and security but creating a potential AD dependency as well;

Using auto-startup order on the remaining ESX host for AD and vCenter could automate the process (steps 17 & 19), however, I prefer the “manual” approach after a SAN upgrade in case the upgrade failure is detected only after ESX host is restarted (i.e. storage service interaction in NFS/iSCSI after upgrade).

SOLORI’s Take: This is a great opportunity to re-think storage resources in the SMB as the linchpin to 100% virtualization. Since most SMB’s will have a tier-2 or backup NAS/SAN (auto-sync or auto-CDP) for off-rack backup, leveraging a shared LUN/volume from that SAN/NAS for a backup domain controller is a smart move. Since tier-2 SAN’s may not have the IOPs to run ALL mission critical applications during the maintenance interval, the presence of at least one valid AD server will promote a quicker RTO, post-maintenance, than coming up cold. [This even works with DAS on the ESX host]. Solution – add the following and you can ignore step 15:

3a. Migrate always-on AD server to LUN/volume on tier-2 SAN/NAS;

24. Migrate always-on AD server from LUN/volume on tier-2 SAN/NAS back to tier-1;

Since even vSphere Essentials Plus has vMotion now (a much requested and timely addition) collapsing all remaining VM’s to a single ESX host is a no brainer. However, migrating the storage is another issue which cannot be resolved without either a shutdown of the VM (off-line storage migration) or Enterprise/Enterprise Plus version of vSphere. That is why the migration of the AD server from tier-2 is reserved for last (step 17) – it will likely need to be shutdown to migrate the storage between SAN/NAS appliances.

The early summer storms have taken its toll on Alabama and UPS failures (and short-falls) have been popping-up all over. Add consolidated, shared storage to the equation and you have a recipe for potential data loss – at least this is what we’ve been seeing recently. Add JBOD’s with separate power rails and limited UPS life-time and/or no generator backup and you’ve got a recipe for potential data loss.

Even with ZFS pools, data integrity in a power event cannot be guaranteed – especially when employing “desktop” drives and RAID controllers with RAM cache and no BBU (or perhaps a “bad storage admin” that has managed to disable the ZIL). When this happens, NexentaStor (an other ZFS storage devices) may even show all members in the ZFS pool as “ONLINE” as if they are awaiting proper import. However, when an import is attempted (either automatically on reboot or manually) the pool fails to import.

From the command line, the suspect pool’s status might look like this:

Not good. This probably indicates that something is not right with the array. Let’s try to force the import and see what happens:

Nope. Now this is the point where most people start to get nervous, their neck tightens-up a bit and they begin to flip through a mental calendar of backup schedules and catalog backup repositories – I know I do. However, it’s the next one that makes most administrators really nervous when trying to “force” the import:

In this case, something must have happened to corrupt metadata – perhaps the non-BBU cache on the RAID device when power failed. Expensive lesson learned? Not yet. The ZFS file system still presents you with options, namely “acceptable data loss” for the period of time accounted for in the RAID controller’s cache. Since ZFS writes data in transaction groups and transaction groups normally commit in 20-30 second intervals, that RAID controller’s lack of BBU puts some or all of that pending group at risk. Here’s how to tell by testing the forced import as if data loss was allowed:

root@NexentaStor:~# zpool import -nfF pool0
Would be able to return data to its state as of Fri May 7 10:14:32 2010.
Would discard approximately 30 seconds of transactions.

If the first output is acceptable, then proceeding without the “n” option will produce the desired effect by “rewinding” the last couple of transaction groups (read ignoring) and imported the “truncated” pool. The “import” option will report the exact number of “seconds” worth of data that cannot be restored. Depending on the bandwidth and utilization of your system, this could be very little data or several MB worth of transaction(s).

What to do about the second option? From the man pages on “zpool import” Sun/Oracle says the following:

Imports all pools found in the search directories. Identical to the previous command, except that all pools with a sufficient number of devices available are imported. Destroyed pools, pools that were previously destroyed with the “zpool destroy” command, will not be imported unless the-D option is specified.

-omntopts

Comma-separated list of mount options to use when mounting datasets within the pool. See zfs(1M) for a description of dataset properties and mount options.

-oproperty=value

Sets the specified property on the imported pool. See the “Properties” section for more information on the available pool properties.

-ccachefile

Reads configuration from the given cachefile that was created with the “cachefile” pool property. This cachefile is used instead of searching for devices.

-ddir

Searches for devices or files in dir. The -d option can be specified multiple times. This option is incompatible with the -c option.

-D

Imports destroyed pools only. The -f option is also required.

-f

Forces import, even if the pool appears to be potentially active.

-F

Recovery mode for a non-importable pool. Attempt to return the pool to an importable state by discarding the last few transactions. Not all damaged pools can be recovered by using this option. If successful, the data from the discarded transactions is irretrievably lost. This option is ignored if the pool is importable or already imported.

-a

Searches for and imports all pools found.

-Rroot

Sets the “cachefile” property to “none” and the “altroot” property to “root”.

-n

Used with the -F recovery option. Determines whether a non-importable pool can be made importable again, but does not actually perform the pool recovery. For more details about pool recovery mode, see the -F option, above.

No real help here. What the documentation omits is the “-X” option. This option is only valid with the “-F” recovery mode setting, however it is NOT well documented suffice to say it is the last resort before acquiescing to real problem solving… Assuming the standard recovery mode “depth” of transaction replay is not quite enough to get you over the hump, the “-X” option gives you an “extended replay” by seemingly providing a scrub-like search through the transaction groups (read “potentially time consuming”) until it arrives at the last reliable transaction group in the dataset.

Lessons to be learned from this excursion into pool recovery are as follows:

The data integrity functions in ZFS are solid when used appropriately. When architecting your HOME/SOHO/SMB NAS appliance, pay attention to the hidden risks of “promised performance” that may walk you down the plank towards a tape backup (or resume writing) event. Better to leave the 5-15% performance benefit on the table or purchase adequate BBU/UPS/Generator resources to supplant your system in worst-case events. In complex environments, a pending power loss can be properly mitigated through management supervisors and clever scripts: turning down resources in advance of total failure. How valuable is your data???

In Medio Stat Veritas

SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.

Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.