vSphere Replication to secondary data center.

Team Members

Tagged MSPs

Categories

This is stage 1 of the next phase of our DR plan, replicating all mission critical VMs to the secondary data center (which is comprised of the hardware that used to be our primary data center). Stage 2 will be migrating some more site-specific VMs to the secondary dc and replicating them back to the primary. Stage 3 will be a fully-automated failover scenario via SRM (yay!).

We have two buildings 14 miles apart, connected via Metro-Ethernet (layer 2, 100Mb/s, ~3ms latency! Not cheap, but you get what you pay for.). The secondary DC is the old gear from that other project here on SW, the one about migrating live into a new VMware cluster? Yeah, that one.

Funny bit about the OS on that long-in-the-tooth SAN, the silly flash web interface wouldn't let me resize a LUN to the newly available disk, because of some errant bounds checking, but with the actual space on disk it let me create a new LUN of the correct size. Really? I'm going to have to juggle? Yup, really. Grab your unicycle. After some flaffing about and creating some tiny placeholder datastores, I finally got my LUNs set up at DC2 to mirror my production LUNs at DC1, and storage was ready for some replicatin'!

Then there was the vSphere part. This is where I made use of all that labor time my vendor quoted me. I said, "Call me when it works.", and they did just that. You're in for some fun times when you install with credentials that change, when the front end has no interface to update those same credentials. Yay block time, my vendor got to sit on the phone with support instead of me.

Oh hey, guess what? Those RPOs? And how VMware calculates them? Yeah, that means you have 0 control over the replication schedule. To their credit, I was hammering my tiny 10M pipe for, uh, 13 days (?) to get the initial replication seeded, which actually worked. Thereafter, with 24 hour RPOs for 18 different VMs, VMware just kind of figures it out, but that meant that I was getting 101% bandwidth saturation between the two sites - both of which house approximately 40% each of our total user population - during production hours! Oh, joy. There's nothing like getting a call from your CFO and being able to tell that it's Not Good (TM), but not actually being able to distinguish the words, because that voice traffic rides the same lines as my randomly scheduled replication data dumps. We got that 10M bumped to 100 pretty quickly after that.

... that's all for now, folks. I'll maybe come back here and re-update later, but suffice to say for now that I'm happy with replication, and 6 minutes per VM to update ~3G per day and hit my 24h RPOs makes me very happy indeed.