Wednesday, August 17, 2011

I have been working on upgrading our vSphere host hardware and migrating VMs from our old EMC Celerra NS350 to a newer HP EVA4400 (that in of itself is worthy of its own blog post). We had purchased the EVA two years ago to host our ERP data and it has run with a single host accessing it ever since.

So we ordered additional disks and shelves for the EVAs at both primary and DR datacenters. I installed the HBAs in the hosts, added them to the fabric, created zones, created the LUNs, masked theme off, etc, etc. Everything was going great until...

I went to setup replication. I setup array-based replication (ABR) for the first LUN - no problem. Storage vMotioned a VM over to it and it replicated without issue. Tried to setup replication for the second LUN - major obstacle time. HP's replication mechanism for the EVA, Continuous Access (CA), is licensed based on capacity. And of course, we had licensed 1TB but needed more like 12TB. Great. Meanwhile, there's grumblings and doubt by others on the IT team that CA would even be the right choice for replicating this data.

Come on HP, really? Does any vendor license replication by capacity anymore? You don't do this with LeftHand/P4000 or 3PAR arrays. Fustrating...

Now I'll be the first to tell you that I hate, hate, hate vendor lock-in. Technology changes so fast that whatever you're using today, probably isn't what you'll be using 3-5-10 years from now. Again, a good topic that deserves its own post. This is one reason that, as a vSphere and storage engineer, I've become a fan of host-based replication (HBR). There are third-party products that provide this capability for virtual machines today: Veeam Backup and Replication and Quest vReplicator just to name a couple.

But here comes vSphere 5 and SRM 5. We'll be entitled to both when they're released. As part of the upgrade we'll get the capability to replicate VMs using vSphere Replication 1.0 for free. I've started setting up a testing environment and will post my experiences with this new feature. One thing I'm really curious about is how the bits actually get replicated. Different arrays handle this differently. I will have my investigative hat on at VMworld and will ask the storage vendors all the gory details. I'll follow-up with another article detailing how different vendors implement their replication (geesh, I've got a lot of writing to do!).

In the mean-time, I've gathered some information on vSphere Replication 1.0, all of which is publicly available. Exciting stuff! Here are the details:

This feature is included with all editions of SRM 5

VMs can be replicated from any storage to any storage, including local disk

Replicated disks can be place on any ESXi-compatible disks/filesystem

Breaks storage vendor lock-in

Replication is an attribute of the VM (not the LUN or some other element)

You can choose which VMDKs to replicate within the VM

In some cases you may not want to replicate the system drive/VMDK, only the data drive/VMDK

Disks are replicated in a "group consistent" manner

Does not use CBT to track and replicate deltas. Instead, VMware developed another technology that tracks I/O changes to VMDKs and captures them in a "PSF" or persistent state file. It does not use VM snapshots

Initial "seed" copy can be made in advanced by FTP, external disk/sneaker net, etc.

Saves bandwidth - great if you have a slower WAN connection and/or a large number of VMs to replicate

RPO can be set on a per-VM basis

5 - minutes to ?

If you need an RPO smaller than 5 minutes, you got other challenges to face!

Some limitations:

VM must be powered-on

My guess is that the thinking here is that if it's powered-off it must not be critical enough to recovery in a DR scenario. I hope VMware reconsiders on this one. I don't have any of these today, but it I can see the possibility of it in the future.

Will not replicate swap, logs, dumps

Will replicate VMs with snapshots. However, snapshots will not be replicated. Instead, the I/O from the source snapshot is written to the destination VM, effectively making the destination VM look like the source VM after collapsing the snapshot.

If you need to protect more that 500 VMs, not only do you have a large environment, you'll need to use ABR or find an alternative HBR solution that can scale higher (if it exists). With that size of an environment I'd recommend working with your VMware account representative and/or storage vendor.

For a storage geek like me this is pretty exciting stuff. I think a lot of VMware customers, from the small SMB to the mid-sized and even some larger companies, are going to benefit from this new feature.

Monday, August 1, 2011

Don't you love it when, during a standard log review of your vSphere environment, you find an error like this that zaps the next four hours of your time? Not! Maybe this will save you some time.

Scenario
I had the ESXi 4.1 hosts in my vSphere cluster setup to remote syslog to the VMware vMA appliance per Simon's excellent instructions: Using vMA as Your ESXi Syslog Server
I recently upgraded our vSphere cluster hardware which included a fresh installation of ESXi.
With that in mind, after recently reviewing tasks and events in vCenter, I noticed the error message "Cannot login vi-admin00@IPADDRESS" where IPADDRESS was the IP of the vMA system. I found this error on all of the hosts' local events and it occurred often.

Troubleshooting
Reading through the comments of the above post, I noticed someone else had the same problem, but no responses. I did the "chown" change on the syslog directory but this did not solve the problem.

I then did ran the following command directly on the vMA appliance:vilogger list --server SERVERNAME
Per the results, I found that the host was "enabled" but it had an "Authentication Failure". This got me wondering about that vi-admin00 account in the original error message. The vMA has a "vi-admin" local account, but what is "vi-admin00"? I fired up the vSphere client and logged directly in to one of the host. Sure enough, the account didn't exist.

Solution
A little more investigation (er, Google searching), and I found the answer here:How to Remove Stale Targets from vMA
Apparently, rebuilding/replacing the hosts wiped out the accounts vilogger creates including vi-admin00!

First step to fix this is to remove the server. I did not need to use the "force" parameter:sudo vifp removeserver SERVERNAME