Let's say you have gigglebytes of data to store and you aren't sure you want to use a CDN. Amazon's S3 doesn't excite you. And you aren't quite ready to join the grid nation. You want to keep it all in house. Wouldn't it be nice to have something like the Google File System you could use to create a unified file system out of all your disks sitting on all your nodes?

According to Robin Harris, a.k.a StorageMojo (a great blog BTW), you can now have your own GFS: Parascale launches Google-like storage software. Parascale calls their softwate a Virtual Storage Network (VSN). It "aggregates disks across commodity Linux x86 servers to deliver petabyte-scale file storage. With features such as automated, transparent file replication and file migration, Parascale eliminates storage hotspots and delivers massive read/write bandwidth." Why should you care?

I don't know about you, but the "storage problem" is one the most frustrating parts of building websites. There's never a good answer that is affordable. Should you build a SAN or a NAS? How do you make it redundant? How do you make it perform? How do you back it up? How do you grow it without a defense appropriations sized budget? Should you use RAID? Which level and where for what reason? Should you use SCSI, iSCSI, SAS, SATA, or alpha beta? Which vendor should you use? There are so many conflicting opinions about everything. It's all a confusing mess to me.

So I like the simplicity of buying commodity nodes with just a bunch of disks attached. But the question has always been how do you turn all those disks into a unified storage system without writing a ton of software on top? Harris says this is what Parascale has done for you:

VSN, like GFS, builds availability and scalability around low-cost servers and disks. NAS appliances rely on costly low-volume boxes that are closed and don't scale. GFS has been deployed in production clusters of over 5,000 servers, proving the scalability of the architecture. Fast, reliable, low-cost and massively scalable storage powers the growth of new applications like Web 2.0, video-on-demand, and hi-resolution image archiving. Parascale is the first of a new generation of software-only storage solutions.

They make a big deal out of it being a software only system. Harris says why this is a good thing:

I like software-based systems because hardware is a commodity. When you create custom hardware you also create low-volume, high-cost components whose economics go from bad to worse. If you *need* to do it, then go for it. But data is getting cooler and the requirement for specialized high-performance hardware is shrinking relative to the market.

Other systems use an appliance model. Appliances can add a lot of value, but they are also a way of monetizing you. A software system on commodity hardware has the potential to give good value. Will it? I didn't see pricing so it's hard to tell. Even odder is their pricing model. You are leasing the software per year, per disk spindle. Do you have any idea how much this will cost? Neither do I. I sounds like it could be horribly expensive or really reasonable. We'll have to see.

Another thing that bothers me is that you can't run a database on top of their file system. This means I need an entire separate storage system for my database. You can run a database on a NAS or SAN, so this is a definite disadvantage.

Anyway, it's just another interesting option to consider when architecting your website.

Related Articles

LiveJournal created an open source distributed file system called MogileFS that builders may find useful.

The only thing I could find on this is GMailFS which is a program to convert your (2GB) Gmail account into a storage service. This is 'not' even a non-commercial solution much less a commercial solution requiring one to store stuff over the internet to an email account. The Google File System (GFS), however is based on MapReduce (Map and Reduce): http://labs.google.com/papers/mapreduce-osdi04.pdf