current community

more communities

Kyle Brandt

Our New York Peer 1 datacenter at 75 Broad is still running on generator power, but as a precaution we decided to failover Stack Overflow, Careers, and the rest of the Stack Exchange network to our secondary datacenter in Oregon last evening. It turned out to be the right call because the refueling trucks can’t get to the facility, so Peer 1 is shutting down all power in about 30 minutes.

We actually recently tested a lot of this, but this is our first time failing over everything at once. So far it is going pretty well, but we have run into a few issues so far:

An index reorg job kicked off right before failing over. This meant that our SQL replication partners across the country were 40 Gigabytes behind. So Stack Overflow had to remain in read only for about an hour

Because the status message on our sites is stored in the database, that was readonly, so we couldn’t update it to let everyone know it would be read only for about an hour

We realized we have to transfer the AD FSMO role forcefully since the NY DCs were shut down, and we don’t know how much fuel is left

Our backup monitoring system isn’t permitted as an SNMP manager via the Group Policy, so we have to update that

We have some open concerns, and will be keeping a close eye on the following:

Oregon has some lower end Dell switches, we hope they handle the load. We will be shipping the current 2960S switches to OR once we upgrade our NY switches to the Nexus 5k/2k line in a couple of weeks

Our load balancers out in OR are a little tight on CPU

We have 5 web servers in OR instead of 10. However, the combined CPU load of NY on the web tier is ususally 100-200% (Out of 1000%) so I think it will be okay:

However, in the big picture, we have successfully failed over to Oregon! Today is going to feel like taking the sub below its depth rating, you can watch the Das Boot Video to share our feelings.

As an admin who had to deal with switching around FSMO roles because of a hurricane, I will be interested in how you handle bringing NY back up when it’s said and done. It became quite the mess for a very young straight-outta-school me, and I haven’t had to handle a similar situation since, so I’d like to know the “right” way just in case

You guys will Love those Nexus switches once you get them up, crazy fast. Just implemented them at my workplace about a month ago, and we yanked a couple of miles of Cat6 from an ESX cluster. 1/4th the cabling at 10 times the speed.

Oh and FSMO roles can be a PITA, especially if you have to ‘offline’ transfer them and then move them back later without hosing everything. Getting them before primary DC’s go down saves some headache, but in this case I can see how time wasn’t in your favor.

From what I can remember (this was 2005… so forgive me if I’m wrong), I thought once you moved them offline, you could never even boot the original FSMO role holder again or Very Bad Things™ would happen to the entire forest. Am I wrong?

If you have two RID masters, you will get problems. The other four you might get away with, but don’t take the risk. Take the old DCs out of AD completely (there’s a KB article on removing a failed DC somewhere). Don’t switch them back on, wipe the hard drives and reinstall the OS from scratch. And give them a new name – I never trust that a forcibly-removed DC is completely out of AD.

It’s remarkably hard to break anything with role seizure, and it’s impossible when the systems that have had their role seized are turned off. All the roles are relinquished when the DC gets the knowledge of the role seizure replicated inbound; the only way it breaks is if you have changes made on both sides that conflict.

We are planning on moving our NYC facility to a Internap at 111th so we have more room for expansion. In order not to be down for the move, we were preparing to fail over to OR for the move. We have always had servers in the second datacenter, so we probably could have waddled through a failover in any case, but because of our practice sessions we were in a much better position.