Author Archive: Sam Fleitman

Ever wonder what a SoftLayer data center looked like before it became a SoftLayer data center? Each one of our facilities is built from a "pod" concept: You can walk into any of our server rooms in any of our facilities around the country (soon to be "around the world"), and you'll see same basic layout, control infrastructure and servers. By building our data center space in this way, we're able to provide an unparalleled customer experience. Nearly every aspect of our business benefits from this practice, many in surprising ways.

From an operations perspective, our staff can work in any facility without having to be retrained and the data center construction process becomes a science that can be replicated quicker with each subsequent build-out. From a sales perspective, every product and technology can be made available from all of our locations. From a network perspective, the network architecture doesn't deviate significantly from place to place. From a finance perspective, if we're buying the same gear from the same vendors, we get better volume pricing. From a marketing perspective ... I guess we have a lot of really pretty data center space to show off.

We try to keep our customers in the loop when it comes to our growth and expansion plans by posting pictures and updates as we build new pods, and with our newest facility in San Jose, CA, we've been snapping photos throughout the construction progress. If you've been patiently reading this part of the blog before scrolling down to the pictures, you get bonus points ... If you looked at the pictures before coming back up to this content, you already know that I've included several snapshots that show some of the steps we take when outfitting new DC space.

The first look at our soon-to-be data center is not the flashiest, but it shows you how early we get involved in the build-out process. The San Jose facility is brand new, so we have a fresh canvas for our work of art. If I were to start talking your ear off about the specifics of the space, this post would probably go into next week, so I'll just show you some of the most obvious steps in the evolution of the space.

The time gap between the first picture and the second picture is pretty evident, but the drastic change is pretty impressive. Raised floor, marked aisles, PDUs ... But no racks.

Have no fear, the racks are being assembled.

They're not going to do much good sitting in the facility's office space, though. Something tells me the next picture will have them in a different setting.

Lucky guess, huh? You can see in this picture that the racks are installed in front of perforated tiles (on the cold aisle side) and on top of special tiles that allow for us to snake cabling from under the floor to the rack without leaving open space for the cold air to sneak out where it's not needed.

The next step in the process requires five very expensive network switches in each rack. Two of the switches are for public network traffic, two are for private network traffic and one is for out-of-band management network traffic.

Those switches won't do much good for the servers if the servers can't be easily connected to them, so the next step is to attach and bind all of the network cable from the switches to where the servers will be. As you'll see in the next pictures, the cabling and binding is done with extreme precision ... If any of the bundles aren't tightly wound, the zip ties are cut and the process has to be restarted.

While the cables are being installed, we also work to prepare our control row with servers, switches, routers and appliances that mirror the configurations we have in our other pods.

When the network cables are all installed, it's a pretty amazing sight. When the cables are plugged into the servers, it's even more impressive ... Each cable is pre-measured and ready to be attached to its server with enough length to get it to the port but not too much to leave much slack.

One of the last steps before we actually get the servers installed is to install the server rails (which make installing the server a piece of cake).

The servers tend to need power, so the power strips are installed on each rack, and each power strip is fed from the row's PDU.

Every network and power cable in the data center is labeled and positioned exactly where it needs to be. The numbers on the cables correspond with ports on our switches, spots in the rack and plugs on the power strip so we can immediately track down and replace any problem cables we find.

If you've hung around with me for this long, I want to introduce you to a few of the team members that have been working night and day to get this facility ready for you. While I'd like to say I could have done all of this stuff myself, that would be a tremendous lie, and without the tireless efforts of all of these amazing SoftLayer folks, this post would be a whole lot less interesting.

A funny realization you might come to is that in this entire "data center" post, there's not a single picture of a customer server ... Is it a data center if it doesn't have data yet?

In catching up on some of my blog reading, I ran across this blog by Jill Eckhaus of AFCOM (a professional organization for data center managers). Yes, I realize that article is four months old, but like I said – I’m catching up.

One of the things that really concerns me with articles and blogs such is this one are the repetitive concerns about “data security” and “loss of control” of your infrastructure. Both of those points are easy to state because they prey on the natural fear of any system administrator or data center manager.

System administrators have long ago come to realize that, in the proper environment, there is no real downside to not being able to physically place their hands upon their servers. In the proper environment the system administrator can power on or off the server, can get instant KVM access to the server, can boot the server into a rescue kernel to try to salvage a corrupt file system, can control network port speeds and connectivity, can reload the operating system, can instantly add and manage services such as load balancers and firewalls, can manage software licenses and naturally, can control full access to the server with root or administrator level privileges. In other words, there is no “loss of control” and “data security” is still up to the system administrator.

The data center managers are understandably concerned about outsourcing because it can potentially impact their jobs. But let’s face it – in today’s economy, the capital outlay required to acquire new datacenter space or additional datacenter equipment is extremely difficult to justify. In those cases sometimes the only two options are to do nothing or to outsource to an available facility. Of course, another option is to jeopardize your existing facility by trying to cram even more services into an already overloaded data center. If a data center manager is trying to build a fiefdom of facilities and personnel, outsourcing is certainly going to be a concern. One interesting aspect of outsourcing is – datacenter management jobs are still there; they are just at consolidated and often times more efficient facilities.

In reality, “data security” and “loss of control” should be of no more or less concern if you are using your own data center versus if you are doing the proper research and selecting a viable outsourcing opportunity with a provider that can prove it has the processes, procedures and tools in place to handle the job for you.

(In the spirit of full disclosure; I am both a local and national AFCOM member and find the organization and the information they make available to be quite useful.)

A lot of the session was focused on end user security regarding spyware, rogue anti-virus, malware and other general badware. But part of the discussion was in regards to the security efforts of the hosting industry in general and SoftLayer specifically. Some of the things we deal with in the hosting industry are second nature to those of us that have been here for a while. But when you start talking about it in front of a different crowd, you begin to appreciate the different perspectives that are out there.

For instance, one common perception (held by some, but obviously not by all) is that once we are made aware of a server that has malware on it, all we have to do is pull the plug on the server and the problem is resolved. However, sometimes the consequences of doing so are high enough to be worthy of a second look. For instance, consider the scenario where SoftLayer rents a server to a customer. That customer slices the server into virtuals using Parallel’s Virtuozzo product and rents a virtual to another customer. That customer puts Cpanel on it to sell shared hosting accounts. Now SoftLayer is 2 layers removed from the actual end user. If that end user’s website gets compromised and begins to distribute malware, how do we at SoftLayer deal with the problem. Ideally, we tell our customer and they tell their customer and they tell the end user about the problem. The end user reacts quickly and cleans up the site. That’s not anywhere close to “best case scenario”, but I would call that a reasonable real-world response.

The problem is, if any of the individuals in that chain of communication fails to react quickly, then the response time for that issue is drastically impacted and more people are potentially victimized by the malware. At what point do we pull the plug on the server? At what point do we decide that all of the other customers on the server have to suffer because of the one bad apple or because of a slow response time from one customers in the chain of communication? Websense did a study that showed in the second half of 2007, over half of all sites distributing malware were themselves compromised sites so the scenario described above is actually a very common problem. It also highlights that there is one more victim in the incident; the web site owner.

We tend to deal with each case as prudently and expeditiously as possible in every abuse report that we receive. In some cases, we pull the plug immediately. In others, we try very hard to work with the customer to resolve the issue. But in all cases, we are constantly working to act as quickly as possible on each individual case.

This is just one of the many scenarios that we have to deal with and it highlights why having a good relationship with your provider is such an important factor when choosing someone to help supply or service your IT needs.

No – this isn’t one of those blogs or editorials ranting and railing about how no one out there is able to provide good customer service anymore. This isn’t about how no one in the service industry – from restaurants to retail and everything in between – seems to care about the customer anymore. People have been writing those stories for the past 50 years (about half as long as they have been writing about the coming demise of baseball). This is just a short little missive lamenting how the same people that complain about lack of service are often people that work in the service industry themselves.

I often find myself in a retail store wondering why I can’t get help locating an object. Or in a restaurant wondering where the wait staff is. Or trying to work my way through an automated phone help system. Part of me sympathizes with the wait staff knowing that they are probably just too busy to get to my table. Maybe the restaurant is understaffed or maybe they have an unexpected rush of customers. And part of me even realizes the operational value of the automated phone system. The ability to reduce head count and lower costs with an automated system seems like a great idea (and sometimes it is).

But when I find myself in those aggravating situations and my anger is just about to get the better of me, I generally come back to the fact that myself and everyone else that works at SoftLayer is in the customer service industry. Oh, I might complain to a manager or I might tip less or I might shop at that location less. But more important than that, I try to use that experience as a reminder of how important customer service is. I’m not talking about just the ability to provide the product the customer is looking for – I mean the ability to be able to answer questions in a timely manner, to answer the phone as quickly as possible, to handle outages as quickly and professionally as possible, to provide customers with frequent updates and most importantly, to treat every customer interaction with the level of urgency that the customer thinks it deserves.

And THAT’s the important part – not just solving the problem, but making sure that the customer’s expectations are met.

If you read through some of the previous blogs on this site such as our CEO’s “SoftLayer Thinks ‘Outside the Box’” or the blog written by one of our super developers, Mr McAloon, entitled “Simplicity”, or Mr Rushe’s “An Interview with an elevator” (OK – that has nothing to do with what I’m referring to, but it’s one of the funniest blogs on this site), one thing you’ll notice is that at SoftLayer, we try to automate and simplify things for the customer. Our customer portal has a LOT of customer features. There are automated OS reloads, the ability to boot into a rescue kernel, the capability to add IP addresses on demand, add and configure a firewall or a local or global load balancer, the ability to edit your DNS settings (forward or reverse) and – my favorite – the ability to reboot your server via IPMI or the power strip. You can also manage your CDN services, monitor your NAS or iSCSI storage, configure backups, use the free KVM services, check your bandwidth and of course, handle all of the usual things like opening support tickets or checking your invoice. Or, if you want to integrate any or all of those features into your own management system, there is a full API available for your use.

With all of that functionality in the portal, one of the challenges we continuously run into is educating new customers on all of the features. Not just educating them on how to use the features – but that the features actually exist in the customer portal. A lot of our customers are either new to On-Demand IT Infrastructure Services (aka the hosting environment) or come from other competitors that only offer a fraction of the features that we are able to provide. For instance, you would be amazed at how many customers open “reboot” tickets. While we respond to tickets quickly, it is actually faster for the customer to click on the “reboot” button in the portal than to click on the “create new ticket” link in the portal and then type out a reboot request.

As ways to address that issue, we created a private customer forum so that customers can share ideas, comments and suggestions with each other. We have also not only created the KnowledgeLayer FAQ database, but we have integrated that directly into the support ticket feature of the portal (when you open a ticket, the FAQ system will automatically recommend related fixes before you even submit the ticket). We also have tutorials directly linked inside the portal and even have all of our API documentation available for review.

So one of the challenges we have at SoftLayer isn’t just creating and deploying the new features and services that keep us out in front of the pack, but educating our customers of their existence and their ease of use. BTW, that’s a great problem to have!

The growth in energy demanded by, and used in, IT environments is a well documented phenomenon. Datacenters are using more energy as CPUs get faster, hard drives become larger, and end user demand for access to data and applications continues to increase. Prices for the underlying hardware and services continue to fall, which just fuels more demand.

Datacenter operators have done their best to maximize the use of every available asset within a facility in order to operate highly efficient environments. Much of the emphasis to date has been on proper datacenter alignment: hot-aisle/cold-aisle configurations, blanking panels to cover gaps in server racks, and sealing holes under raised floors to better contain cold air have become common place in the data center.

However, in most large organizations, there many areas that needs more attention. Departments within a large company often have competing goals that negate green IT efforts. One example of this would be –

The system administrators and developers want the biggest, fastest machines they can get with the most expandability. This enables them to add memory or hard drives as utilization increases – which makes their jobs much easier to perform and helps them better maintain customer SLAs.

Purchasing (and finance) department’s primary goal is to save money. The focus is to work with the vendors to reduce the overall hardware cost.

The disconnect between those two departments will often leave the datacenter manager out in the heat (definitely not “out in the cold”). That person’s job essentially becomes “just find a place to put it” until the datacenter is full enough that the answer becomes “no more”. It then becomes a “fix it now” problem as the company struggles with plans to build more datacenter space. So, the problem is a short term view (meeting quarterly earnings results) versus long term direction (to achieve a sustainable and efficient operations environment that may have a short term cost implication).

What should happen is that the disparate groups need to work together throughout the entire planning process. The purchasing department, the system administrators, developers, and the datacenter managers should build a common plan and set realistic expectations in order to optimize the IT infrastructure required and to best meet business, operations, and efficiency objectives.

Let’s continue the example from above… if a server is ordered just because it’s more expandable (more expansion slots, RAM slots and hard drive bays), that means that the power supply has to be bigger to support the potential need of those future components. A server power supply is most efficient (wastes the least amount of power doing the conversion) when it is running at 80-90% load. If a power supply is over sized to support potential future needs, then it is operating at a much lower efficiency than it should – thus generating more heat, wasting more power and requiring more cooling, which in turn requires more power to run the AC’s.

That might seem like a small price to pay for expandability, but when that scenario is multiplied over an entire datacenter, the scope of the problem becomes very significant. This could lead to lost efficiency of well over 20% as a business plans and buys ahead of demand for the computing capacity it may need in the future.

So, what is the other option? Is purchasing right? Should IT simply buy a small server, at a lower total cost, and scale as the business scales? The problem with this is that it tends to lead to exponential growth in all aspects of IT – more racks to house smaller servers, additional disks, more space and power over time, increased obsolescence of components, and more lost efficiency.

The problem is considerably more complex than both options. The simple fact is that the “fixes” for IT go well beyond implementing a hot-aisle cold-aisle layout and sealing up holes under the raised floor of the datacenter. Now that those things have become “best practices,” it’s time to start highlighting all of the other things that can be done to help improve energy efficiency.

At SoftLayer, we promote an energy efficient focus across the entire company. Datacenter best practices are implemented in all of the datacenter facilities we occupy; we use hot-aisle cold-aisle configurations, we use blanking panels, we use 208v power to the server, we pay very close attention to energy efficient components such as power supplies, hard drives and of course CPUs, and we recycle whatever we can.

Most importantly, we deliver a highly flexible solution that allows customers to scale their businesses as they grow – they do not need to over buy or under buy, since we will simply “re-use” the server for the next customer that needs it. Individually, the energy savings from each of these might be fairly small. But, when multiplied across thousands and thousands of servers and multiple datacenters – these many small things become one large thing quickly – a huge savings in energy consumption over a traditional IT environment.

Ultimately, SoftLayer believes that we can never be satisfied with our efforts. As soon as one set of ideas becomes common place or best practices, we need to be looking for the next round of improvements. And bring those new ideas and practices forward so all can benefit.

In Steve's last post he talked about the logic of outsourcing. The rationale included the cost of redundant internet connections, the cost of the server, UPS, small AC, etc. He covers a lot of good reasons to get the server out of the broom closet and into a real datacenter. However, I would like to add one more often over looked component to that argument: the Spares Kit.

Let's say that you do purchase your own server and you set it up in the broom closet (or a real datacenter for that matter) and you get the necessary power, cooling and internet connectivity for it. What about spare parts?

If you lose a hard drive on that server, do you have a spare one available for replacement? Maybe so - that's a common part with mechanical features that is liable to fail - so you might have that covered. Not only do you have a spare drive, the server is configured with some level of RAID so you're probably well covered there.

What if that RAID card fails? It happens - and it happens with all different brands of cards.

What about RAM? Do you keep a spare RAM DIMM handy or if you see failures on one stick, do you just plan to remove it and run with less RAM until you can get more on site? The application might run slower because it's memory starved or because now your memory is not interleaved - but that might be a risk you are willing to take.

How about a power supply? Do you keep an extra one of those handy? Maybe you keep a spare. Or, you have dual power supplies. Are those power supplies plugged into separate power strips on separate circuits backed up by separate UPSs?

What if the NIC on the motherboard gets flaky or goes out completely? Do you keep a spare motherboard handy?

If you rely on out of band management of your server via an IPMI, Lights Out or DRAC card - what happens if that card goes bad while you're on vacation?

Even if you have all necessary spare parts for your server or you have multiple servers in a load balanced configuration inside the broom closet; what happens if you lose your switch or your load balancer or your router or your... What happens if that little AC you purchased shuts down on Friday night and the broom closet heats up all weekend until the server overheats? Do you have temperature sensors in the closet that are configured to send you an alert - so that now you have to drive back to the office to empty the water pail of the spot cooler?

You might think that some of these scenarios are a bit far fetched but I can certainly assure you that they're not. At SoftLayer, we have spares of everything. We maintain hundreds of servers in inventory at all times, we maintain a completely stocked inventory room full of critical components, and we staff it all 24/7 and back it all up with a 4 hour SLA.

Some people do have all of their bases covered. Some people are willing to take a chance, and even if you convince your employer that it's ok to take those chances, how do you think the boss will respond when something actually happens and critical services are offline?

"ah - I don't need backups."
"Too busy to do backups - I'll get to that later."
"Backups? It costs too much."
"I don't need backups - MTBF of a Raptor is 1.2 Million hours."
"Oops - I forgot about doing backups."

Backups are one of the most commonly forgotten tasks of a system administrator. In some cases, they are never implemented. In other cases, they are implemented but not maintained. In other cases, they are implemented with a great backup and recovery plan - but the system usage or requirements change and the backups are not altered to compensate.

A hard drive really is a fairly reliable piece of IT equipment. The WD 150GB Raptor has a rating of 1.2 Million hours MTBF. With that kind of mean time between failures, you would think that you would never have to worry about a hard drive failing. How willing are you to take that chance? What if you double your odds by setting up two drives in a RAID 1 configuration? Now can you afford to take that chance? How willing are you to gamble with your data?

What if one of your system administrators accidentally deletes the wrong file? Maybe it's your apache config file. Maybe it's a piece of code you have been working on all day. Or, maybe your server gets compromised and you now have unknown trojans and back doors on your server. Now what do you do?

Working in a datacenter with thousands of servers, there are thousands and thousands of hard drives. When you see that many hard drives in production, you are naturally going to see some of them fail. I have seen small drives fail, large drives fail, and I have even seen RAID 1 mirrors completely fail beyond recovery. Is it bad hardware? Nope. Is it Murphy's Law? Nope. It's the laws of physics. Moving parts create heat and friction. Heat and friction cause failures. No piece of IT equipment is immune to failure.

That 1.2 million hours MTBF looks pretty impressive. For a round number, let's say there are 15,000 drives in the SL datacenter. 1,200,000 hours / 15,000 drives = 80 hours. That means that every 80 hours, one hard drive in the SL datacenter could potentially fail. Now how impressive is that number?

Ultimately, regardless of the levels of redundancy you implement, there is always a chance of a failure - hardware or human - that results in data loss. The question is - how important is that data to you? In the event of a catastrophic failure, are you willing to just perform an OS reload and start from scratch? Or, if a file is deleted and unrecoverable, are you willing to start over on your project? And lastly, how much downtime can you afford to endure?

Regardless of how much redundancy you can build into your infrastructure with the likes of load balancers, RAID arrays, active/passive servers, hot spares, etc, you should always have a good plan for doing backups as well as checking and maintaining those backups.

The SoftLayer contingency recently returned from attending HostingCon 2007 in Chicago and I have to say, it was a great experience. We had a lot of opportunities to meet up with many of our customers, meet with a lot of vendors and potential vendors as well as visit with some of our competitors.

While there, I had the privilege of participating in a panel discussion on "Green Hosting: Hope or Hype". Isabel Wang did a great job of moderating the discussion with Doug Johnson, Dallas Kashuba, and myself. The overall premise of the panel discussion was to talk about green initiatives, how they affect the hosting industry, what steps can hosting companies take and is it something we should be pursuing.

It was interesting to hear the different approaches that companies take to be green. Should companies focus their efforts on becoming carbon neutral by purchasing carbon credits such as DreamHost, by promising to plant a tree for each server purchased such as Dell, by working on virtualization strategies such as SWSoft or by working to eliminate the initial impact on the environment such as we have done at SoftLayer. You can probably tell from one of my previous blog posts where SoftLayer is focusing our efforts to help make a difference.

Besides the efforts of the individual companies on the panel, there were some good questions from the audience that helped spur the conversation. Does the hosting industry need its own organization for self regulation or are entities such as The Green Grid sufficient? Do any of the hosting industry customers really care if a company is "green"? Should a hosting company care if it’s "green"? And, what exactly does "being green" mean?

While there are differing opinions to all of those questions, there really isn't a "wrong" answer. Ultimately all of the steps companies take - no matter how small - will help to some extent. And no matter what the motivation - whether a company is "being green" in an effort to gain publicity, to save money or to simply "make a difference" - it's all worth it in the end.

The article discusses the successful deployment of servers from a remote location. The author talks about being able to remotely configure and deploy some new servers from the confines of a ski lodge. Of course, they had to have someone at their offices to receive the server shipment, unbox the servers, rack them up, get them all cabled, make sure space, power and cooling would all be sufficient and then put in a CD. Things that weren't mentioned probably included throwing away all of the packaging material, doing QA on the hardware to verify it was all correct and changing any BIOS settings.

Beyond all of that, there are many things that are just inherent to the process that they didn’t refer to, including having to find the right server vendor, negotiating pricing for the servers, making sure all of the pieces and parts were going to be shipped, tracking the shipment dates, contacting the vendor multiple times to try to find out why the shipment wasn't going to be on time, having available datacenter space and infrastructure, putting those dang cage nuts in the server racks, having available switch ports, making sure the network was configured correctly, providing network security, making sure all of the software licenses were up to date, etc, etc, etc.

Or, as so many of you already know - they could have gotten their servers from a dedicated hosting provider such as SoftLayer (hint, hint) and had the servers purchased, configured, QA’d and online within just a couple of hours and with no more effort than just filling out a signup form. It’s hard to imagine there are still so many people out there doing things the hard way.