Main menu

Posts Tagged 'Redundancy'

For starters, let me say that if you were affected by IKE my thoughts are with you as you try to cleanup and get life back to as close to normal as possible. I can’t even fathom what it would be like to be away from home and work and not know what lies ahead or even be allowed back to see the damage.

A few friends of mine live in the area surrounding Houston and I have heard from them that they still don’t have power and gas is hard to come by. It sure makes me understand what I take for granted every day. Even with short power outages at my house due to spring storms I find myself opening the microwave and then realizing that my burrito isn’t going to be very warm when I think it is done.

This particular blog came to me when I emailed one of my friends in far north Houston at his work email address. He works for a company that is also located in north Houston. After no response I decided to call him to get an update but ended up leaving a message. Before he called me back that evening my email to his work bounced. When I did talk to him I asked him a simple question, “are your company’s servers sitting in a closet onsite at your office?” He responded with what I already knew, “Yep!” Now, he is a national account manager and he has been meeting at his boss’s house this week to go over what they will need to do once things are back to normal. I asked him if it would help them if they could still receive emails from their customers and if they could post updates on their website if it was still up and running. He of course said, “Yep!”

So here is my plea to all you savvy IT gurus out in the world. Outsource it! Then you don’t take the brunt force of the storm when everything is down. Your servers, whether or not you are still receiving email, and whether your website is still up and running will probably and should probably be the last things on your mind when a disaster like IKE strikes. Leave that part to us.

One of the hot topics over the past couple of weeks in our growing industry has been how to minimize downtime should your (or your host’s) data center experience catastrophic failure leading to outages that could span multiple days.

Some will think that it is the host’s responsibility to essentially maintain a spare data center into which they can migrate customers in case of catastrophe. The reason we don’t do this is simple economics. To maintain this type of redundancy, we’d need to charge you at least double our current rates. Because costs begin jumping exponentially instead of linearly as extensive redundancy is added, we’d likely need to charge you more than double our current rates. You know what? Nobody would buy at that point. It would be above the “reservation price” of the market. Go check your old Econ 101 notes for more details.

Given this economic reality, we at SoftLayer provide the infrastructure and tools for you to recover quickly from a catastrophe with minimal cost and downtime. But, every customer must determine which tools to use and build a plan that suits the needs of the business.

One way to do this is to maintain a hot-synched copy of your server at a second of our three geographically diverse locations. Should catastrophe happen to the location of your server, you will stay up and have no downtime. Many of you do this already, even keeping servers at multiple hosts. According to our customer surveys, 61% of our customers use multiple providers for exactly that reason – to minimize business risk.

Now I know what you’re thinking – “why should I maintain double redundancy and double my costs if you won’t do it?” Believe me, I understand this - I realize that your profit margins may not be able to handle a doubling of your costs. That is why SoftLayer provides the infrastructure and tools to provide an affordable alternative to running double infrastructure in multiple locations in case of catastrophe.

SoftLayer’s eVault offering can be a great cost effective alternative to the cost of placing servers in multiple locations. Justin Scott has already blogged about the rich backup features of eVault and how his backup data is in Seattle while his server is in Dallas, so I won’t continue to restate what he has already said. I will add that eVault is available in each of our data centers, so no matter where your server is at SoftLayer, you can work with your sales rep to have your eVault backups in a different location. Thus, for prices that are WAY lower than an extra server (eVault starts at $20/month), you can keep near real-time backups of your server data off site. And because the data transfer between locations happens on SoftLayer’s private network, your data is secure and the transfer doesn’t count toward your bandwidth allotment.

So let’s say your server is in our new Washington DC data center and your eVault backups are kept in one of our Dallas data centers. A terrorist group decides to bomb data centers in the Washington DC area in an attempt to cripple US government infrastructure and our facility is affected and won’t be back up for several days. At this point, you can order a server in Dallas, and once it is provisioned in an hour or so, you restore the eVault backup of your choice, wait on DNS to propagate based on TTL, and you’re rolling again.

Granted, you do experience some downtime with this recovery strategy. But the tradeoff is that you are up and running smoothly after the brief downtime at a cost for this contingency that begins at only $20 per month. And when you factor in your SLA credit on the destroyed server, this offsets the cost of ordering a new server, so the cost of your eVault is the only cost of this recovery plan.

This is much less than doubling your costs with offsite servers to almost guarantee no downtime. The reason that I throw in the word “almost” is that if an asteroid storm takes out all of our locations and your other providers’ locations, you will experience downtime. Significant downtime.

It’s a fact -- all software ends up relying on a piece of hardware at some point. And hardware can fail. But the secret is to create redundancy to minimize the impact if hardware does fail.
RAIDS, load balancers, redundant power supplies, cloud computing - the list goes on. And we support them all. Many of these options are not mandatory, but I wish they were! That’s where the customer comes in – it is critical to understand the value of the application and data sitting on the hardware and set a redundancy and recovery plan that fits.

Keep your DATA safe:

RAID - For starters *everyone* should have a RAID 1, 5, or 10. This keeps your server online in the event of a drive failure.

The best approach – RAID 10 all the way. You get the benefits of a RAID 0 (striping across 2 drives so you get the data almost twice as fast) and the security of RAID 1 (mirroring data on 2 separate drives) all rolled into one. I think every server should have this as a default.

Separate Backups – EVault Backup, ISCSI Storage, FTP/NAS Storage, your own NAS server or just a different server. Lose data just once (or have the ability to recover it painlessly) and these will pay for themselves. Remember, hardware is not the only way in which you can lose data -– hackers, software failures, and human error will always be a risk.

StorageLayer. Use it or lose it.

Going further:

Redundant servers in different locations – spread your servers out across different datacenters and use a load balancer. Nothing is safer than a duplicate server 1000’s of miles away. That’s why we have invested in a second data center – to keep your data and business safe.

Solid state drives are just that – a drive with no moving parts. No more platters or read/write heads. I mean come on, hard drives are essentially using the same basics that old record players use. CD’s use this technology too. And you see where those went (can you say iPod? I prefer my iPod touch. I have never had an iPod until now so I skipped right to the new fancy pants model. Can you tell I just got it?).

Faster, faster, faster! –- Processors, memory, drives, network -- everything is getting much faster. And in part by redundancy (dual and quad core processors, dual and quad processor motherboards). See? Redundancy is the way of the future!

We have 4 Intel Xeon Quadcore Tigertown processors on one motherboard. That’s 16 processors on one server! Shazam!

Robot DC patrol sharks – yep. Got the plans on my desk right now. But I can’t take all the credit, Josh R. suggested this one, I just make things happen.

I work to keep all of our hardware running in tip top condition. But I look at the bigger picture when it comes to hardware – how to completely eliminate the impact of any hardware issue. That’s why I suggest all the redundancies listed above. While I can reduce the probability of hardware issues with testing, monitoring of firmware updates, proper handling procedures, choosing quality components, etc., redundancy is the ultimate solution to invisible hardware.

In Steve's last post he talked about the logic of outsourcing. The rationale included the cost of redundant internet connections, the cost of the server, UPS, small AC, etc. He covers a lot of good reasons to get the server out of the broom closet and into a real datacenter. However, I would like to add one more often over looked component to that argument: the Spares Kit.

Let's say that you do purchase your own server and you set it up in the broom closet (or a real datacenter for that matter) and you get the necessary power, cooling and internet connectivity for it. What about spare parts?

If you lose a hard drive on that server, do you have a spare one available for replacement? Maybe so - that's a common part with mechanical features that is liable to fail - so you might have that covered. Not only do you have a spare drive, the server is configured with some level of RAID so you're probably well covered there.

What if that RAID card fails? It happens - and it happens with all different brands of cards.

What about RAM? Do you keep a spare RAM DIMM handy or if you see failures on one stick, do you just plan to remove it and run with less RAM until you can get more on site? The application might run slower because it's memory starved or because now your memory is not interleaved - but that might be a risk you are willing to take.

How about a power supply? Do you keep an extra one of those handy? Maybe you keep a spare. Or, you have dual power supplies. Are those power supplies plugged into separate power strips on separate circuits backed up by separate UPSs?

What if the NIC on the motherboard gets flaky or goes out completely? Do you keep a spare motherboard handy?

If you rely on out of band management of your server via an IPMI, Lights Out or DRAC card - what happens if that card goes bad while you're on vacation?

Even if you have all necessary spare parts for your server or you have multiple servers in a load balanced configuration inside the broom closet; what happens if you lose your switch or your load balancer or your router or your... What happens if that little AC you purchased shuts down on Friday night and the broom closet heats up all weekend until the server overheats? Do you have temperature sensors in the closet that are configured to send you an alert - so that now you have to drive back to the office to empty the water pail of the spot cooler?

You might think that some of these scenarios are a bit far fetched but I can certainly assure you that they're not. At SoftLayer, we have spares of everything. We maintain hundreds of servers in inventory at all times, we maintain a completely stocked inventory room full of critical components, and we staff it all 24/7 and back it all up with a 4 hour SLA.

Some people do have all of their bases covered. Some people are willing to take a chance, and even if you convince your employer that it's ok to take those chances, how do you think the boss will respond when something actually happens and critical services are offline?

"ah - I don't need backups."
"Too busy to do backups - I'll get to that later."
"Backups? It costs too much."
"I don't need backups - MTBF of a Raptor is 1.2 Million hours."
"Oops - I forgot about doing backups."

Backups are one of the most commonly forgotten tasks of a system administrator. In some cases, they are never implemented. In other cases, they are implemented but not maintained. In other cases, they are implemented with a great backup and recovery plan - but the system usage or requirements change and the backups are not altered to compensate.

A hard drive really is a fairly reliable piece of IT equipment. The WD 150GB Raptor has a rating of 1.2 Million hours MTBF. With that kind of mean time between failures, you would think that you would never have to worry about a hard drive failing. How willing are you to take that chance? What if you double your odds by setting up two drives in a RAID 1 configuration? Now can you afford to take that chance? How willing are you to gamble with your data?

What if one of your system administrators accidentally deletes the wrong file? Maybe it's your apache config file. Maybe it's a piece of code you have been working on all day. Or, maybe your server gets compromised and you now have unknown trojans and back doors on your server. Now what do you do?

Working in a datacenter with thousands of servers, there are thousands and thousands of hard drives. When you see that many hard drives in production, you are naturally going to see some of them fail. I have seen small drives fail, large drives fail, and I have even seen RAID 1 mirrors completely fail beyond recovery. Is it bad hardware? Nope. Is it Murphy's Law? Nope. It's the laws of physics. Moving parts create heat and friction. Heat and friction cause failures. No piece of IT equipment is immune to failure.

That 1.2 million hours MTBF looks pretty impressive. For a round number, let's say there are 15,000 drives in the SL datacenter. 1,200,000 hours / 15,000 drives = 80 hours. That means that every 80 hours, one hard drive in the SL datacenter could potentially fail. Now how impressive is that number?

Ultimately, regardless of the levels of redundancy you implement, there is always a chance of a failure - hardware or human - that results in data loss. The question is - how important is that data to you? In the event of a catastrophic failure, are you willing to just perform an OS reload and start from scratch? Or, if a file is deleted and unrecoverable, are you willing to start over on your project? And lastly, how much downtime can you afford to endure?

Regardless of how much redundancy you can build into your infrastructure with the likes of load balancers, RAID arrays, active/passive servers, hot spares, etc, you should always have a good plan for doing backups as well as checking and maintaining those backups.