Nuts & Bolts: Potpourri

As my final installment in the Nuts & Bolts series, I want to hit a few of the questions that were sent in that I didn’t get a chance to get to earlier in the week. I hope you’ve enjoyed reading these as much as I’ve enjoyed writing them.

What colocation provider did you choose, and why?

After an exhaustive (and exhausting!) selection process, we chose ServerCentral to host our infrastructure. They have an awesome facility that has some of the most thoughtful and redundant datacenter design I’ve ever seen. On top of top notch facilities they have a great network via their sister company nLayer.

Finding a partner who could manage the hardware for us without us having to be onsite was a big deal for us too. The quality of “remote hands” support from datacenter to datacenter is, well let’s just call it inconsistent and be generous. ServerCentral has a great reputation with its customers in that regard and we’ve found their support to be excellent. They manage all of the physical installations, hardware troubleshooting, and maintenance for us.

They do a mean cabling job too.

How do you bootstrap new hardware when it is installed?

We have a PXE installation server that handles installation of bare metal machines using the Ubuntu preseed unattended installation mechanism. All we do is add a small snippet of data with things like the MAC address of the primary network interface and a hostname to our Chef configuration management system and it generates the required configuration on our installation server. Our installations are extremely bare bones with just enough operating system to run our Chef configuration management recipes for final configuration.

The typical workflow is that hardware arrives at ServerCentral and is installed by their technicians. The technicians cable it per our specifications, configure the DRAC remote access cards for us, and provide us with the MAC address of the primary interface. From there we can configure our installation server, power the machine on, and within about 5 minutes have a machine that is ready to go. It works great.

How did you do such a large migration from Rackspace to your own colo? I’m assuming you had something like VMWare Motion to move some of it without downtime/interruption?

Actually, it was pretty straightforward, at least in the broad strokes. There was a lot of work involved by the general process goes something like this:

Setup a database server at the new facility, restore a recent backup to it, and connect it to the production server over a VPN for replication. We also setup the old production server to warm the cache of the new server continually so that it’s ready as soon as we flip the switch.

Setup the production web/application/proxy tier at the new facility. The vast majority of this is boilerplate that is already setup in our configuration management system so it’s largely a matter of adding a configuration entry and running Chef.

Test!

Change DNS to point to the new site, and setup the old one to proxy to the new one to catch any DNS stragglers.

There are a lot of details about timing of the few dozen steps involved in the final switchover, but we were able to reduce that to a checklist and repeat it successfully. Campfire, which was the last application to move, was only down for about 25 minutes during the migration. This is one situation where having all of our data on S3 was admittedly a huge benefit, since we didn’t have to worry about immediately replicating a huge amount of data (outside of the database) to the new site.

Why did you choose the 2U Dell R710 servers instead of a 1U server since it looks like you’re only using 2 drives in most of them.

First of all, you guys looked pretty closely at the pictures.

There were three main reasons why we stuck with the 2U Dell R710s:

Flexibility. What is an application server today with a small amount of disk requirement may need to be repurposed down the road into a role that requires more capacity.

Consistency. We use the Dell R710 chassis for absolutely everything in our infrastructure; database servers, application servers, proxy servers, everything. This makes configuration and spares much easier to manage since they’re all very similarly configured with perhaps some changes in the memory, hard drive, and CPU configurations.

They’re small enough. The limiting factor in most modern datacenters is much more likely to be power than it is to be physical space and the cost structure reflects that. The 2U form factor strikes a nice balance between ease of maintenance and space efficiency. There wouldn’t be a significant change in our hosting bill by going with 1U devices.

How do you manage redundancy and fault-tolerance?

Absolutely nothing goes into production unless it has a minimum of one other system to fail over to. This philosophy carries through from our network, to our databases, to our application servers, and so on.

To give just one example, let me talk about how our servers are connected to the network. At the top of each of our cabinets are a pair of 48 port Cisco 3750G switches, in a stacked configuration. We run three main networks (VLANs) in our environment, a general purpose network, a storage dedicated network, and a network for our remote management systems.

On each server, we have 5 network interfaces in use. One interface for the remote access card in the server, and the other 4 are in use for the general purpose and storage networks. We run one cable to each switch for each network and bond the two ports into a single logical interface using 802.3ad link aggregation. This configuration ensures that we can lose any cable, network port, or switch without losing connectivity to a server.

This same kind of thought process was repeated throughout our entire environment and played a big role in the reason we chose the Isilon storage systems. I won’t say that we don’t have any single points of failure, but we try to eliminate them as much as we possibly can. We make sure that we know about the ones we can’t eliminate and have plans to respond to failures.

By the way, the photo above and the rest of the photos in the series were taken by John Williams, one of the great sysadmins on my team. Who knew he was multitalented?!

Jeffrey R.

on 30 Jul 10

Mark: Excellent work on the “Nuts & Bolts” series. This type of information is invaluable and really points to why (at least some of the why) 37S is so successful.

One constructive criticism though. I would add cross-references between the different posts. That way it is easier to follow the whole series (and send pointers to your friends). Hmmm, I wonder what inspired this comment?

scotts

on 31 Jul 10

This has been an extremely informative set of articles, with tangible information you don’t see very often… I encourage you to make it a regular series.

Mark, thank you for those posts. Having a peek into how large web operations are working is invaluable and I am happy the devops (or whatever you want to call them) are sharing more and more.

I have a question: as you said, all your systems are highly available on several levels. What I have been wondering for a while is: is this really necessary to invest in network (switch and bonded network cards) or even electrical redundancy when you have application/service redundancy? Is it really necessary to double the number of switches and power supplies when you have several application servers (for example)?

MI

on 31 Jul 10

Eric, that’s a great question. The short answer, at least in our case, is that we don’t have sufficient hardware where we can back down on redundancy per server. If you have a hundred racks of equipment and you lose one because its switch died or the power strip failed, that’s probably not a big deal. If you have fewer servers such that losing a half a rack or a rack of equipment would be s significant percentage of your capacity, the story is very different.

We’re very firmly in the second camp. We considered building an architecture such that we could lose an entire cabinet of equipment without suffering an outage. As we looked at it, we came to the conclusion that it was infeasible given the aggregate amount of equipment that is involved in our infrastructure.

Walt

on 01 Aug 10

I can agree with Mark’s comment. I am responsible for a infrastructure with 600 operating system instances and 400+ physical servers. This is spread across multiple physical hosts and plenty of backup everything – network, power, cooling, clustered servers, you name it.

All I can tell you after living with my own operation for a while is regardless of what you do, if you haven’t identified the single point of failure in your infrastructure, you will – eventually, and probably soon. It is out there, just waiting to bite you in your ass.

Paul

on 02 Aug 10

Who many hours did it take you to test the apps after moving them to the new facitilites?