Scaling SoftLayer

SoftLayer is in the business of helping businesses scale. You need 1,000 cloud computing instances? We'll make sure our system can get them online in 10 minutes. You need to spin up some beefy dedicated servers loaded with dual 8-core Intel Xeon E5-2670 processors and high-capacity SSDs for a new application's I/O-intensive database? We'll get it online anywhere in the world in under four hours. Everywhere you look, you'll see examples of how we help our customers scale, but what you don't hear much about is how our operations team scales our infrastructure to ensure we can accommodate all of our customers' growth.

When we launch a new data center, there's usually a lot of fanfare. When AMS01 and SNG01 came online, we talked about the thousands of servers that are online and ready. We meet huge demand for servers on a daily basis, and that presents us with a challenge: What happens when the inventory of available servers starts dwindling?

Truck Day.

Truck Day not limited to a single day of the year (or even a single day in a given month) ... It's what we call any date our operations team sets for delivery and installation of new hardware. We communicate to all of our teams about the next Truck Day in each location so SLayers from every department can join the operations team in unboxing and preparing servers/racks for installation. The operations team gets more hands to speed up the unloading process, and every employee has an opportunity to get first-hand experience in how our data centers operate.

If you want a refresher course about what happens on a Truck Day, you can reference Sam Fleitman's "Truck Day Operations" blog, and if you want a peek into what it looks like, you can watch Truck Day at SR02.DAL05. I don't mean to make this post all about Truck Day, but Truck Day is instrumental in demonstrating the way SoftLayer scales our own infrastructure.

Let's say we install 1,000 servers to officially launch a new pod. Because each pod has slots for 5,000 servers, we have space/capacity for 3,000-4,000 more servers in the server room, so as soon as more server hardware becomes available, we'll order it and start preparing for our next Truck Day to supplement the pod's inventory. You'd be surprised how quickly 1,000 servers can be ordered, and because it's not very easy to overnight a pallet of servers, we have to take into account lead time and shipping speeds ... To accommodate our customers' growth, we have to stay one step ahead in our own growth.

This morning in a meeting, I saw a pretty phenomenal bullet that got me thinking about this topic:

Truck Day — 4/3 (All Sites): 2,673 Servers

In nine different data center facilities around the world, more than 2,500 servers were delivered, unboxed, racked and brought online. Last week. In one day.

Now I know the operations team wasn't looking for any kind of recognition ... They were just reporting that everything went as planned. Given the fact that an accomplishment like that is "just another day at SoftLayer" for those guys, they definitely deserve recognition for the amazing work they do. We host some of the most popular platforms, games and applications on the Internet, and the DC-Ops team plays a huge role in scaling SoftLayer so our customers can scale themselves.

Comments

May I ask, however, whether you'd disclose whether or not your ops team opens up the skins on all of these boxes? I use to take delivery of *much* smaller quantities from an unnamed tier 1 x86 vendor and in spite of the fact that the SKUs were identical I could see the components were different even with my bare eyes. That is, the vend had such wide variation in their suppliers flow of components that my measly (<dozen) deliveries would consist of what boils down to different computers with the same SKU. This variation made it very, very as form a given lot there would be varying uptime attributes. We switched vendors.

Check out the video on this page, and you can see the process the servers go through from box to rack. At around 1:35, you can watch time-lapse of some of the Server Build Technicians booting up each of the servers to check the config ... From there, the servers will be sorted into stacks of their respective configs. We'll add barcodes to the boxes for our internal tracking purposes, and the servers are racked.

Our server vendor is SuperMicro, and we typically don't see many inconsistencies in build quality or components ... Thankfully!

Hi Kevin,
Great question! The answer is of course that is automated too. We run every server through a checkin process that includes pulling very detailed information about all hardware components. Including models, versions, bus speeds, bios dates, chipset versions and the big one firmware versions. We utilize several tools to pull all this information (as no one tool ive found does it all). We database that information and compare it against hardware information we assumed to be coming in from the vendor. Anything not matching gets stopped to look at by human eyes. Firmware is automatically updated and hardware is tested.
As Mr. Hazard mentioned Supermicro is pretty good about not changing major components on a whim. When they do make changes they usually mark it by changing the revision on the board. That gets checked and logged as well. One secret other unnamed vendors usually have ways of electronically marking the changes they make, but you have to dig deep on where they store that data (that gets checked too :).
All that being said three’s no electronic check like human eyes. We do have people in the know at each site spot verify the models that come in. Check for modified chasis, fans, chipsets and general this looks different checks.

Comments

May I ask, however, whether you'd disclose whether or not your ops team opens up the skins on all of these boxes? I use to take delivery of *much* smaller quantities from an unnamed tier 1 x86 vendor and in spite of the fact that the SKUs were identical I could see the components were different even with my bare eyes. That is, the vend had such wide variation in their suppliers flow of components that my measly (<dozen) deliveries would consist of what boils down to different computers with the same SKU. This variation made it very, very as form a given lot there would be varying uptime attributes. We switched vendors.

Check out the video on this page, and you can see the process the servers go through from box to rack. At around 1:35, you can watch time-lapse of some of the Server Build Technicians booting up each of the servers to check the config ... From there, the servers will be sorted into stacks of their respective configs. We'll add barcodes to the boxes for our internal tracking purposes, and the servers are racked.

Our server vendor is SuperMicro, and we typically don't see many inconsistencies in build quality or components ... Thankfully!

Hi Kevin,
Great question! The answer is of course that is automated too. We run every server through a checkin process that includes pulling very detailed information about all hardware components. Including models, versions, bus speeds, bios dates, chipset versions and the big one firmware versions. We utilize several tools to pull all this information (as no one tool ive found does it all). We database that information and compare it against hardware information we assumed to be coming in from the vendor. Anything not matching gets stopped to look at by human eyes. Firmware is automatically updated and hardware is tested.
As Mr. Hazard mentioned Supermicro is pretty good about not changing major components on a whim. When they do make changes they usually mark it by changing the revision on the board. That gets checked and logged as well. One secret other unnamed vendors usually have ways of electronically marking the changes they make, but you have to dig deep on where they store that data (that gets checked too :).
All that being said three’s no electronic check like human eyes. We do have people in the know at each site spot verify the models that come in. Check for modified chasis, fans, chipsets and general this looks different checks.