Posts Tagged 'Raid'

The jet flow gates in the Hoover Dam can release up to 73,000 cubic feet — the equivalent of 546,040 gallons — of water per second at 120 miles per hour. Imagine replacing those jet flow gates with a single garden hose that pushes 25 gallons per minute (or 0.42 gallons per second). Things would get ugly pretty quickly. In the same way, a massive "big data" infrastructure can be crippled by insufficient IOPS.

IOPS — Input/Output Operations Per Second — measure computer storage in terms of the number of read and write operations it can perform in a second. IOPS are a primary concern for database environments where content is being written and queried constantly, and when we take those database environments to the extreme (big data), the importance of IOPS can't be overstated: If you aren't able perform database reads and writes quickly in a big data environment, it doesn't matter how many gigabytes, terabytes or petabytes you have in your database ... You won't be able to efficiently access, add to or modify your data set.

As we worked with 10gen to create, test and tweak SoftLayer's MongoDB engineered servers, our primary focus centered on performance. Since the performance of massively scalable databases is dictated by the read and write operations to that database's data set, we invested significant resources into maximizing the IOPS for each engineered server ... And that involved a lot more than just swapping hard drives out of servers until we found a configuration that worked best. Yes, "Disk I/O" — the amount of input/output operations a given disk can perform — plays a significant role in big data IOPS, but many other factors limit big data performance. How is performance impacted by network-attached storage? At what point will a given CPU become a bottleneck? How much RAM should included in a base configuration to accommodate the load we expect our users to put on each tier of server? Are there operating system changes that can optimize the performance of a platform like MongoDB?

The resulting engineered servers are a testament to the blood, sweat and tears that were shed in the name of creating a reliable, high-performance big data environment. And I can prove it.

Most shared virtual instances — the scalable infrastructure many users employ for big data — use network-attached storage for their platform's storage. When data has to be queried over a network connection (rather than from a local disk), you introduce latency and more "moving parts" that have to work together. Disk I/O might be amazing on the enterprise SAN where your data lives, but because that data is not stored on-server with your processor or memory resources, performance can sporadically go from "Amazing" to "I Hate My Life" depending on network traffic. When I've tested the IOPS for network-attached storage from a large competitor's virtual instances, I saw an average of around 400 IOPS per mount. It's difficult to say whether that's "not good enough" because every application will have different needs in terms of concurrent reads and writes, but it certainly could be better. We performed some internal testing of the IOPS for the hard drive configurations in our Medium and Large MongoDB engineered servers to give you an apples-to-apples comparison.

Before we get into the tests, here are the specs for the servers we're using:

The numbers shown in the table below reflect the average number of IOPS we recorded with a 100% random read/write workload on each of these engineered servers. To measure these IOPS, we used a tool called fio with an 8k block size and iodepth at 128. Remembering that the virtual instance using network-attached storage was able to get 400 IOPS per mount, let's look at how our "base" configurations perform:

Clearly, the 400 IOPS per mount results you'd see in SAN-based storage can't hold a candle to the performance of a physical disk, regardless of whether it's SAS or SSD. As you'd expect, the "Journal" reads and writes have roughly the same IOPS between all of the configurations because all four configurations use 2 x 64GB SSD drives in RAID1. In both configurations, SSD drives provide better Data mount read/write performance than the 15K SAS drives, and the results suggest that having more physical drives in a Data mount will provide higher average IOPS. To put that observation to the test, I maxed out the number of hard drives in both configurations (10 in the 2U MD server and 34 in the 4U LG server) and recorded the results:

It should come as no surprise that by adding more drives into the configuration, we get better IOPS, but you might be wondering why the results aren't "betterer" when it comes to the IOPS in the SSD drive configurations. While the IOPS numbers improve going from four to ten drives in the medium engineered server and six to thirty-four drives in the large engineered server, they don't increase as significantly as the IOPS differences in the SAS drives. This is what I meant when I explained that several factors contribute to and potentially limit IOPS performance. In this case, the limiting factor throttling the (ridiculously high) IOPS is the RAID card we are using in the servers. We've been working with our RAID card vendor to test a new card that will open a little more headroom for SSD IOPS, but that replacement card doesn't provide the consistency and reliability we need for these servers (which is just as important as speed).

There are probably a dozen other observations I could point out about how each result compares with the others (and why), but I'll stop here and open the floor for you. Do you notice anything interesting in the results? Does anything surprise you? What kind of IOPS performance have you seen from your server/cloud instance when running a tool like fio?

Not everyone enjoys or has the benefit of taking what they learn at work to apply at home in personal situations, but I consider myself lucky because the things I learn from work can often be very useful for hobbies in my own time. As an electronics and PC gaming fanatic, I always enjoy tips that would increase the performance of my technological equipment. Common among PC gaming enthusiasts is the obsession with making their gaming rig excel in every aspect by upgrading video card, ram, processor, etc. Before working at SoftLayer, I had only considered buying better hardware to improve performance but never really looked into the advantages of different types of setups for a computer.

This new area of exploration for me started shortly after my first days at SoftLayer when I was introduced to RAID (Redundant Array of Inexpensive Disks) for our servers. In the past, I had heard mention of the term but never had any idea of what that entailed and was only familiar with our good ole bug killer brand Raid. You can imagine my excitement as I learned more about its intricacies and how the different types of RAID could benefit my computer’s performance.

Armed with this new knowledge, I was determined to reconfigure my gaming pc at home to reap the benefits. Upon looking at the different RAID setups, I decided to go with a RAID 0 because I did not want to sacrifice storage space and my data was not critical enough that I would need a mirror such as provided with RAID 1.

One thing led to another as I became occupied for a good amount of time with benchmarking drive performance in my old setup versus my new setup. In the end, I was happy to report a significant performance gain in what I now refer to as my “killer setup”. Applications would launch noticeably faster and even in games where videos were stored locally on hard drives, the cinematic scenes would come up faster than before.

To add to the hype, a coworker was also building a new computer in anticipation of a new game called Final Fantasy XIV. It felt like a competition to exceed each other with better scores. I’m already planning ahead for future upgrades since this time around I had only used SATA drives. For my next upgrade I would love to run a RAID 0 with two SSD drives to see what kind of boost I would get.

So for business or pleasure, have you ever considered the benefits of setting up a RAID system?

There is some confusion out there on what’s a good way to back up your data. In this article we will go over several options for good ways to backup and sore your backups along with a few ways that are not recommended.

There is some confusion out there on what’s a good way to back up your data. In this article we will go over several options for good ways to backup and sore your backups along with a few ways that are not recommended.

When it comes to backups storing them off site (off your server or on a secondary drive not running your system) is the best solution with storing them off site being the recommended course.

When raids come into consideration just because the drives are redundant (a lave mirror situation) there are several situations, which can cause a complete raid failure such as the raid controller failing, the array developing a bad stripe. Drive failure on more than one drive(this does happen though rarely) , out of date firmware on the drives and the raid card causing errors. Using a network storage device like our evault or a nas storege is also an excellent way to store backups off system. The last thing to consider is keeping your backups up to date. I suggest making a new back every week at minimum (if you have very active sites or data bases I would recommend a every other day backup or daily backup). It is up to you or your server administrator to keep up with your backups and make sure they are kept up to date. If you have a hardware failure and your backups are well out of date it’s almost like not having them at all.

In closing consider the service you provide and how your data is safe, secure, and recoverable. These things I key to running a successful server and website.

When considering these 2 raid options there are a few points you’ll want to consider before making your final choice.

The first to consider is your data, so ask yourself these questions:

Is it critical data that your data be recoverable?

Do you have backups of your data that can be restored if something happens?

Do you want some kind of redundancy and the ability to have a failed drive replaced without your data being destroyed?

If you have answered yes to most of these, you are going to want to look at a Raid 1 configuration. With a Raid 1 you have 2 drives of like size matched together in an array, which consists of an active drive and a mirror drive. Either of these drives can be replaced should one go bad without any loss of data and without taking the server offline. Of course, this assumes that the Raid card that you are using is up to date on it’s firmware and supports hot swapping.

If you answered no to most of these questions other than the backup question (you should always have backups), a Raid 0 set-up is probably sufficient. This is used mostly for disk access speeds and does not contain any form of redundancy or failover. If you have a drive failure while using a Raid 0 your data will be lost 99% of the time. This is an unsafe Raid method and should only be used when the data contained on the array is not critical in anyway. Unfortunately with this solution there is no other course of action that can be taken other than replacing the drives and rebuilding a fresh array.

I hope this helps to clear up some of the confusion regarding these 2 Raid options. There are several other levels of Raid which I would suggest fully researching before you consider using one of them.

I won’t pretend to know the ins and outs of the cloud software we use (okay, maybe a little :),) but I know the gist of it as far as hardware is concerned- redundancy. Entire servers were the last piece of the puzzle needed to complete entire hardware redundancy. In my original article, Hardwhere?, (http://theinnerlayer.softlayer.com/2008/hardwhere/) I talked about using load balancers to spread the load to multiple servers (a service we already had at the time) and eluded to cloud computing.

Now cloud services are a reality.

This is a dream come true for me as the hardware manager. Hardware will always have failures and living in the cloud eliminates customer impact. Words cannot describe what it means to the customer. Never again will a downed server impact service.

Simply put, when you use a SoftLayer CloudLayer Computing Instance, your software is running on one or more servers. If one of these should fail, the load of your software is shifted to another server in the “cloud” seamlessly. We call this HA or High Availability.

If there is a sad part to all of this, it would be that I have spent considerable effort optimizing the hardware department to minimize customer downtime in the even on hardware failures. But I have a rather odd way of looking at my job. I believe the end game of any job I do is complete automation and/or elimination of the task altogether. (Can you say the opposite of job security?) I have a going joke where I say: “Until I have automated and/or proceduralized everything down to perfection with one big red button, there is still work to be done!”

Cloud computing eliminates the customer impact of hardware failures. Bam! Even though this has nothing to do with my hardware department planning, policies and procedures, I have no ego in the matter. If it solves the problem, I don’t care who did the work and was the genius behind it all, as long as it moves us forward with the best products and optimal customer satisfaction!

We have taken the worry out of hosting- no more deciding what RAID is best. No more worrying about how to keep your data available in the event of a hardware failure. CloudLayer does it for you and has all the same service options as a dedicated server and more! One more step to a big red button for the customer!

Now back to working on the DC patrol sharks (they keep eating the techs!) New project- tech redundancy!

I love working at SoftLayer. I get to play with the newest hardware before anyone else. Intel, Adaptec, Supermicro… The list goes on. If they are going to release something new, we get to play with it first. I also like progression. Speed, size, performance, reliability; I like new products and technologies that make big jumps in these areas. I am always looking to push components and complete systems to the limits.

But alas, Thomas Norris stole my thunder! Check out his article “SSD: A Peek into the Future” for the complete skinny on the SSD’s we use. I seem to be a bit to concise for a nice long blog anyways. But not to worry, I’ve got some nifty numbers that will blow the jam out of your toes!

Solid State Drives (SSD) represent a large jump in drive performance. Not to mention smaller physical size, lower power consumption, and lower heat emissions. The majority of drive activity is random read/write. SSD drives have drastically improved in this area compared to mechanical drives. This results in a drastic overall performance increase for SSD drives.

This is a comparison of the Intel 32GB X25-E Extreme drive vs. other drives we carry. Note the massive jump in the random read/write speed of the SSD drive.

No more waiting on physical R/W heads to move around. How archaic!

Please note that no performance utility should be used to definitively judge a component or system. In the end, only real time usage is the final judge. But performance tests can give you a good idea of how a component or system compares to others.

Single drive performance increases directly translate into big improvements for RAID configurations as well. I have compared two of our fastest SATA and SAS four drive RAID 10 setups to a four drive SSD RAID 10 using an Adaptec 5405 Controller.

The Adaptec 5405 RAID controller certainly plays a part in the performance increase, on top on the simple speed doubling due to 2 drives being read simultaneously. (See my future blog on the basics or RAID levels, or check Wikipedia) .

Propeller heads read on:

The numbers indicate a multiplied increase if you take the base drive speed (Cheetah – 11.7mbps / X25-E – 64.8mbps) and double it (the theoretical increase a RAID 10 would give): 23.4mbps and 129.6mbps respectively. Actually performance tests show 27.3mbps and 208.1mbps. That means the Cheetahs are getting a 15% performance boost on random read/write and the X25-E a whopping 37% due to the RAID card. Hooray for math!

Once again, this is all performance tests and a bit of math speculation. The only real measure of performance, IMO, is how it performs the job you need it to do.

It’s a fact -- all software ends up relying on a piece of hardware at some point. And hardware can fail. But the secret is to create redundancy to minimize the impact if hardware does fail.
RAIDS, load balancers, redundant power supplies, cloud computing - the list goes on. And we support them all. Many of these options are not mandatory, but I wish they were! That’s where the customer comes in – it is critical to understand the value of the application and data sitting on the hardware and set a redundancy and recovery plan that fits.

Keep your DATA safe:

RAID - For starters *everyone* should have a RAID 1, 5, or 10. This keeps your server online in the event of a drive failure.

The best approach – RAID 10 all the way. You get the benefits of a RAID 0 (striping across 2 drives so you get the data almost twice as fast) and the security of RAID 1 (mirroring data on 2 separate drives) all rolled into one. I think every server should have this as a default.

Separate Backups – EVault Backup, ISCSI Storage, FTP/NAS Storage, your own NAS server or just a different server. Lose data just once (or have the ability to recover it painlessly) and these will pay for themselves. Remember, hardware is not the only way in which you can lose data -– hackers, software failures, and human error will always be a risk.

StorageLayer. Use it or lose it.

Going further:

Redundant servers in different locations – spread your servers out across different datacenters and use a load balancer. Nothing is safer than a duplicate server 1000’s of miles away. That’s why we have invested in a second data center – to keep your data and business safe.

Solid state drives are just that – a drive with no moving parts. No more platters or read/write heads. I mean come on, hard drives are essentially using the same basics that old record players use. CD’s use this technology too. And you see where those went (can you say iPod? I prefer my iPod touch. I have never had an iPod until now so I skipped right to the new fancy pants model. Can you tell I just got it?).

Faster, faster, faster! –- Processors, memory, drives, network -- everything is getting much faster. And in part by redundancy (dual and quad core processors, dual and quad processor motherboards). See? Redundancy is the way of the future!

We have 4 Intel Xeon Quadcore Tigertown processors on one motherboard. That’s 16 processors on one server! Shazam!

Robot DC patrol sharks – yep. Got the plans on my desk right now. But I can’t take all the credit, Josh R. suggested this one, I just make things happen.

I work to keep all of our hardware running in tip top condition. But I look at the bigger picture when it comes to hardware – how to completely eliminate the impact of any hardware issue. That’s why I suggest all the redundancies listed above. While I can reduce the probability of hardware issues with testing, monitoring of firmware updates, proper handling procedures, choosing quality components, etc., redundancy is the ultimate solution to invisible hardware.

In Steve's last post he talked about the logic of outsourcing. The rationale included the cost of redundant internet connections, the cost of the server, UPS, small AC, etc. He covers a lot of good reasons to get the server out of the broom closet and into a real datacenter. However, I would like to add one more often over looked component to that argument: the Spares Kit.

Let's say that you do purchase your own server and you set it up in the broom closet (or a real datacenter for that matter) and you get the necessary power, cooling and internet connectivity for it. What about spare parts?

If you lose a hard drive on that server, do you have a spare one available for replacement? Maybe so - that's a common part with mechanical features that is liable to fail - so you might have that covered. Not only do you have a spare drive, the server is configured with some level of RAID so you're probably well covered there.

What if that RAID card fails? It happens - and it happens with all different brands of cards.

What about RAM? Do you keep a spare RAM DIMM handy or if you see failures on one stick, do you just plan to remove it and run with less RAM until you can get more on site? The application might run slower because it's memory starved or because now your memory is not interleaved - but that might be a risk you are willing to take.

How about a power supply? Do you keep an extra one of those handy? Maybe you keep a spare. Or, you have dual power supplies. Are those power supplies plugged into separate power strips on separate circuits backed up by separate UPSs?

What if the NIC on the motherboard gets flaky or goes out completely? Do you keep a spare motherboard handy?

If you rely on out of band management of your server via an IPMI, Lights Out or DRAC card - what happens if that card goes bad while you're on vacation?

Even if you have all necessary spare parts for your server or you have multiple servers in a load balanced configuration inside the broom closet; what happens if you lose your switch or your load balancer or your router or your... What happens if that little AC you purchased shuts down on Friday night and the broom closet heats up all weekend until the server overheats? Do you have temperature sensors in the closet that are configured to send you an alert - so that now you have to drive back to the office to empty the water pail of the spot cooler?

You might think that some of these scenarios are a bit far fetched but I can certainly assure you that they're not. At SoftLayer, we have spares of everything. We maintain hundreds of servers in inventory at all times, we maintain a completely stocked inventory room full of critical components, and we staff it all 24/7 and back it all up with a 4 hour SLA.

Some people do have all of their bases covered. Some people are willing to take a chance, and even if you convince your employer that it's ok to take those chances, how do you think the boss will respond when something actually happens and critical services are offline?