Posted
by
Soulskillon Wednesday September 02, 2009 @10:09AM
from the we-know-exactly-what-you'd-do-with-that-much-storage dept.

Chris Pirazzi writes "Online backup startup BackBlaze, disgusted with the outrageously overpriced offerings from EMC, NetApp and the like, has released an open-source hardware design showing you how to build a 4U, RAID-capable, rack-mounted, Linux-based server using commodity parts that contains 67 terabytes of storage at a material cost of $7,867. This works out to roughly $117,000 per petabyte, which would cost you around $2.8 million from Amazon or EMC. They have a full parts list and diagrams showing how they put everything together. Their blog states: 'Our hope is that by sharing, others can benefit and, ultimately, refine this concept and send improvements back to us.'"

Before realizing that we had to solve this storage problem ourselves, we considered Amazon S3, Dell or Sun Servers, NetApp Filers, EMC SAN, etc. As we investigated these traditional off-the-shelf solutions, we became increasingly disillusioned by the expense. When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.

That's odd, where I work we pay a premium for what happens when the power goes out, what happens with a drive goes bad, what happens when maintenance needs to be performed, what happens when the infrastructure needs upgrades, etc. This article left out a lot of buzzwords but they also left out the people who manage these massive beasts. I mean, how many hundreds (or thousands) of drives are we talking here?

You might as well add a few hundred thousand a year for the people who need to maintain this hardware and also someone to get up in the middle of the night when their pager goes off because something just went wrong and you want 24/7 storage time.

We don't pay premiums because we're stupid. We pay premiums so we can relax and concentrate on what we need to concentrate on.

That's all fine and dandy but where is my support going to come from when this server has issues? Are they throwing in for free maintenance and upgrades to this server when it no longer meets requirements? If not, this figure is highly disingenuous.

But when we priced various off-the-shelf solutions, the cost was 10 times as much (or more) than the raw hard drives.

Um..and what do you plan on running these disks with? HD's don't magically store and retreive data on their own. The HD's are cheap compared to the other parts that create a storage system. That's like saying a Ferrari is a ripoff because you can buy an engine for $3,000.

The point is that the costs of services like Amazon or NetApp, etc include the costs for support, server maintenance, upgrades, etc. That they are only comparing this to just the bare minimum price for this company to construct their server is highly misleading.

Yeah, this only works if your the geeks building the hardware to begin with. The real cost is in setup and maintenance. Plus, if the shit hits the fan, the CxO is going to want to find some big butts to kick. 67TB of data is a lot to lose (though it's only about 35 disks at max cap these days).

These guys, however, happen to be both the geeks, the maintainers, and the people-whos-butts-get-kicked-anyway. This is not a project for a one or two man IT group that has to build a storage array for their 100-200 person firm. These guys are storage professionals with the hardware and software know how to pull it off. Kudos to them for making it and sharing their project. It's a nice, compact system. It's a little bit of a shame that there isn't OTS software, but at this level you're going to be doing grunt work on it with experts anyway.

FWIW, Lime Technology (lime-technology.com) will sell you a case, drive trays, and software for a quasi-RAID system that will hold 28TB for under $1500 (not including the 15 2TB drives - another $3k on the open market). This is only one fault tolerant, though failure is more graceful than a traditional RAID). I don't know if they've implemented hot spares or automatic failover yet (which would put them up to 2 fault tolerant on the drives, like RAID6).

It's great having someone tell you they will be there in three hours to replace your power supply, that you then have to dedicate a staff person to be with when they go out on the shop floor because some moron in security requires it. If they had just left a few spare parts you could do it yourself because everything just slides into place anyway.

That 2.683M also pays for salaries, pretty building(s), advertising, research, conventions, and more advertising.

I could hire a couple of dedicated staff to have 24x7 support for far less than 2.683M, plus a duplicate system worth of spare parts.

This stuff isn't rocket science. Most companies don't need high-speed, fiber-optic disk array subsystems for a significant amount of their data, only for a small subset that needs blindingly fast speed. The rest can sit on cheap arrays. For example, all of my network accessible files that I open very rarely but keep on the network because it gets backed up. All of my 5 copies of database backups and logs that I keep because it's faster to pull it off of disk than request a tape from offsite. And it's faster to backup to disk, then to tape.

BackBlaze is a good example of someone that needs a ton of storage, but not lightening fast access. Having a reliable system is more important to them than one that has all the tricks and trappings of an EMC array that probably 10% of all EMC users actually use, but they all pay for.

These guys build their own hardware, think it might be able to be improved on or help the community, and they release the specs, for free, on the Internet. They then get jumped on by people saying "bbbb-but support!". They're not pretending to offer support, if you want support, pay the 2MM for EMC, if you can handle your own support in-house, maybe you can get away with building these out.

It's like looking at KDE and saying "But we pay Apple and Microsoft so we get support" (even though, no you don't). The company is just releasing specs, if it fits in your environment, great, if not, bummer. If you can make improvements and send them back up-stream, everyone wins. Just like software.

I seem to recall similar threads whenever anyone mentions open routers from the Cisco folks.

Backup: depends on the backup strategy. I could make this happen for less than an additional 10%. But ok, point taken.

Redundancy: You mean as in plain redundancy? These are RAID arrays are they not? You want redundancy at the server level? Now you're increasing the scope of the project which the article doesn't address. (Scope error)

Hosting: Again, the point of the article was the hardware. That's a little like accounting for the cost of a trip to your grandmother's, and factoring in the cost of your grandmother's house. A little out of scope.

Cooling: I could probably get the whole project chilled for less than 6% of the total cost, depending on how cool you want the rig to run.

You will more than likely NOT have to take a node offline. The design looks like they place the drives into slip down hot plug enclosures. Most rack mounted hardware is on rails, not screwed to the rack. You roll the rack out, log in, fail the drive that is bad, remove it, hot plug another drive and add it to the array. You are now done.

They went RAID 6, even though it is slow as shit, for the added failsafe mechanisms.

Its better at what they need it for. Based on the services and software they describe on their site, it looks like they store data in the classic redundant chunks distributed over multiple 'disposable' storage systems. In this situation most of the added redundancy that vendors put in their products doesn't add much value to their storage application. Thus having racks and racks of basic RAIDs on cheap disks and paying a few on-site monkeys to replace parts is more cost effective then going to a more stable/tested enterprise storage vendor.

If you build a petabyte stack using 1.5TB disks you need about 800 drives including RAID overhead. With an MTBF for consumer drives of 500,000 hours, a drive will fail roughly every 10-15 days, if your design is good and you create no hotspots/vibration issues.

Rebuild times on large RAID sets are such that it is only a matter of time before they run a double drive failure and lose their customers data. The money they saved by going cheap will be spent on lawyers when they get the liability claims in.

If you RTFA, you will see that they are using RAID6 with 2 parity drives per raid, so a double drive failure can be handled, and it is only the less likely triple drive failure that will ruin them.
It seems weak that they don't have hot-swappable drives in this configuration, but they have software that is managing the data across disk sets, and presumably they have redundant copies of data that keep the data accessible when one of their servers is taken down to replace a drive (if they don't, the downtimes due to replacing drives will make the service useless). This redundancy may also save them in the case that they actually lose a RAID set.

I have worked in disk storage design. This was a very cool project. This looks like a promising start and in some ways represents the future of storage; COTS parts. Others have pointed out some areas of improvement, cooling and the like.

And I think I would use dual micro ATA motherboards, perhaps in their own cases to make them replaceable in case of failure.

I realize that the layout of the drives was done with an eye toward airflow, but I personally don't like to see drives set on their edges. It's probably a personal bias, but I like to see drives set flat. The bearings seem to last longer that way. Just my personal experience.

And, one final point, storage density is reaching the point where we can jam a lot of storage into a small space. Perhaps we have reached the point where we can start to spread things out and do things like put the drives in a separate enclosure or multiple enclosures. It makes designing, installing, and servicing easier. Use eSATA ports on the SATA cards to make external storage easier.

It's the google model: you don't replace failed components. (This isn't meant for a case where you have 1 'server'; this is meant for when you have hundreds of these pods.) The labor is better served deploying a new pod with 45 new disks than replacing one disk in 45 pods.

As evidence of that, I submit that dozens of companies like the one in this article have existed over the years, and only a handful of them still exist. Those that still do have either exited the storage array business, or have evolved their offerings into something that costs a lot more to build and support than a pile of disks.

I like how you dismiss a detailed real world design example based simply on a claimed feature without any further substantiation. Very classy. I'm not saying you are wrong, but would it kill you to go into a little more detail about why these folks need "luck" when they are clearly very successful with their existing design?

Forgive me; I've committed the sin of working for one of those name-brand storage companies.

The real value in a data storage system isn't in the hardware, it's in the data. And the real cost incurred in a data storage system is measured in the inability of the customer to access that data quickly, efficiently and (in the case of a disaster) at all.

If you need to crunch the data quickly, a higher-performing system is going to save you money in the end. Look at all the benchmarks: no home-grown systems are anywhere on the lists. If you want to stream through your data at several gigabytes per second, you need to pay for a fast interconnect. Putting 45 drives behind a single 1GbE just doesn't cut it.

Similarly, if you want to ensure that the data is protected (integrity, immutable storage for folks who need to preserve data and be certain it hasn't been tampered with, etc) and stored efficiently (single instance store, or dedupe, so you don't fill your petabytes of disks with a bajillion copies of the same photos of Anna Kournakova) then you need to pay for the extra goodness in that software and hardware as well.

Finally, if you want extremely high availability, then the cost of the hardware is miniscule compared to the cost of downtime. We had customers that would lose millions of dollars per service interruption. They're willing to pay a million dollars to eliminate or even reduce downtime.

These folks are essentially just building a box that makes a bunch of disks behave like a honking big tape drive. It's a viable business--that's all some folks need. But EMC et al are not going to lose any sleep over this.

What failure rate are you using to "virtually guarantee" that you'll get data corruption with 45 drives?What failure rate in your RAM, CPU, and motherboard are you using to guarantee that the ZFS checksum are not themselves corrupted? Not to mention the high possibility of bugs in a younger file system, and the different performance characteristics among FSes.

I'm not say ZFS is a bad plan, at least if you're running enough spindles, but if you're going to "virtually guarantee" silent corruption with less than 100 drives I'd like to see some documentation for the the non-detectable failure rates you're expecting.

It's also worth noting that in a lot of data, a small amount of bit-flips might not be worth protecting against at all. Or they might be better protected at the application level instead of the block level -- for example, if the data will be transmitted to another system before it is consumed, as would be typical for a disk-host like this, a single checksum of the entire file (think md5sum) could be computed at the end-use system, rather than computing a per-block checksum at the disk host and then just assuming the file makes it across the network and through the other system's I/O stack without error.

How about reading the section "A Backblaze Storage Pod is a Building Block".

<snip> the intelligence of where to store data and how to encrypt it, deduplicate it, and index it is all at a higher level (outside the scope of this blog post). When you run a datacenter with thousands of hard drives, CPUs, motherboards, and power supplies, you are going to have hardware failures — it's irrefutable. Backblaze Storage Pods are building blocks upon which a larger system can be organized that doesn't allow for a single point of failure. Each pod in itself is just a big chunk of raw storage for an inexpensive price; it is not a "solution" in itself.

Emphasis mine. I believe there are quite a few successful and reliable storage vendors not using ZFS. We get the point, you like it. Doesn't mean you can't succeed without it. Be more open minded.

The real solution here is to design custom ASICs which can tolerate more failures than standard RAID6, and to store redundant data on a completely different controller. That way, if one board on a rack goes titsup (or is merely down for maintenance) chances are better that sufficient data is available to reconstruct the original file. The Reed-Solomon coding tech is well known and lends itself well to custom ASICs. The part that isn't well developed is the network routing/transport mechanism that lets you efficiently shuttle large quantities of data between boards in the rack. The general idea is well known in the literature (read: "Efficient Dispersal of Information for Security, Load Balancing, and Fault Tolerance" by Michael Rabin) but the hardware to do the many-to-many interconnection isn't available as an off-the-shelf architectural component. Rather than trying to optimise the system from the viewpoint of how cheaply the MAID (massive array of inexpensive disks) can be constructed, anyone interested in this area should be more interested in how to guarantee/availability/ of the data for the least cost. For that you need better strategies than "let's see how many commodity disks we can fit in a rack".

Raw storage will always be cheaper than the effort of designing of fault-tolerant, high-availability systems, but it's worth the effort to at least implement "good enough" systems to attempt to achieve these qualities rather than sticking with the dumb "stack-em-high" approach. Scalability matters, or else your "super cluster" will quickly be overtaken by the next dumb implementation when the next 18-month increment rolls around.