With Hadoop and CouchDB all over in Blogs and related news what's a distributed-fault-tolerant storage (engine) that actually works.

CouchDB doesn't actually have any distribution features built-in, to my knowledge the glue to automagically distribute entries or even whole databases is simply missing.

Hadoop seems to be very widely used - at least it gets good press, but still has a single point of failure: The NameNode. Plus, it's only mountable via FUSE, I understand the HDFS isn't actually the main goal of Hadoop

GlusterFS does have a shared nothing concept but lately I read several posts that lead me to the opinion it's not quite as stable

Lustre also has a single point of failure as it uses a dedicated metadata server

Ceph seems to be the player of choice but the homepage states it is still in it's alpha stages.

So the question is which distributed filesystem has the following feature set (no particular order):

POSIX-compatible

easy addition/removal of nodes

shared-nothing concept

runs on cheap hardware (AMD Geode or VIA Eden class processors)

authentication/authorization built-in

a network file system (I'd like to be able to mount it simultaneously on different hosts)

Nice to have:

locally accessible files: I can take a node down mount the partition with a standard local file system (ext3/xfs/whatever...) and still access the files

I'm not looking for hosted applications, rather something that will allow me to take say 10GB of each of our hardware boxes and have that storage available in our network, easily mountable on a multitude of hosts.

10 Answers
10

I think you'll have to abandon the POSIX requirement, very few systems implement that - in fact even NFS doesn't really (think locks etc) and that has no redundancy.

Any system which uses synchronous replication is going to be glacially slow; any system which has asynchronous replication (or "eventual consistency") is going to violate POSIX rules and not behave like a "conventional" filesystem.

I can't speak to the rest, but you seem to be confused between a 'distributed storage engine' and a 'distributed file system'. They are not the same thing, they shouldn't be mistaken for the same thing, and they will never be the same thing. A filesystem is a way to keep track of where things are located on a hard drive. A storage engine like hadoop is a way to keep track of a chunk of data identified by a key. Conceptually, not much difference. The problem is that a filesystem is a dependency of a storage engine... after all, it needs a way to write to a block device, doesn't it?

All that aside, I can speak to the use of ocfs2 as a distributed filesystem in a production environment. If you don't want the gritty details, stop reading after this line: It's kinda cool, but it may mean more downtime than you think it does.

We've been running ocfs2 in a production environment for the past couple of years. It's OK, but it's not great for a lot of applications. You should really look at your requirements and figure out what they are -- you might find that you have a lot more latitude for faults than you thought you did.

As an example, ocfs2 has a journal for each machine in the cluster that's going to mount the partition. So let's say you've got four web machines, and when you make that partition using mkfs.ocfs2, you specify that there will be six machines total to give yourself some room to grow. Each of those journals takes up space, which reduces the amount of data you can store on the disks. Now, let's say you need to scale to seven machines. In that situation, you need to take down the entire cluster (i.e. unmount all of the ocfs2 partitions) and use the tunefs.ocfs2 utility to create an additional journal, provided that there's space available. Then and only then can you add the seventh machine to the cluster (which requires you to distribute a text file to the rest of the cluster unless you're using a utility), bring everything back up, and then mount the partition on all seven machines.

See what I mean? It's supposed to be high availability, which is supposed to mean 'always online', but right there you've got a bunch of downtime... and god forbid you're crowded for disk space. You DON'T want to see what happens when you crowd ocfs2.

Keep in mind that evms, which used to be the 'preferred' way to manage ocfs2 clusters, has gone the way of the dodo bird in favor of clvmd and lvm2. (And good riddance to evms.) Also, heartbeat is quickly going to turn into a zombie project in favor of the openais/pacemaker stack. (Aside: When doing the initial cluster configuration for ocfs2, you can specify 'pcmk' as the cluster engine as opposed to heartbeat. No, this isn't documented.)

For what it's worth, we've gone back to nfs managed by pacemaker, because the few seconds of downtime or a few dropped tcp packets as pacemaker migrates an nfs share to another machine is trivial compared to the amount of downtime we were seeing for basic shared storage operations like adding machines when using ocfs2.

Just wanted to comment that this is exactly my experience with OCFS2 / Pacemaker vs. NFS as well. After trying OCFS2 as a clustered data store for a while I found it extremely lacking. Meanwhile our HA NFS system has been running like a charm.
–
Kamil KisielJun 4 '09 at 20:00

1

OCFS2 is definitely not what I'm looking at. By distributed I don't mean something with a central instance of storage but rather something where I can easily add/remove nodes that provide storage while still being up with the rest of the "cluster"
–
Server HorrorJun 8 '09 at 7:40

Since I'm still getting upvotes on this answer, I should add that we're now using GlusterFS in production as a replacement for nfs. However, we do NOT store VM disk images, database storage files (sqlite or myisam or whatever), or other files that are prone to change frequently on glusterfs as it causes replication lash. Those we store locally on VM hosts in LVM and use DRBD to distribute to failover sites, or use built-in replication.
–
Karl KatzkeSep 27 '12 at 22:31

Lustre allows for multiple metadata-stores in active/passive configuration for redundancy, so no single point of failure.

OCFS2 might also be worth looking at.

Note that cutting out the requirement for multiple simultaneous network access makes it possible to switch to something like iSCSI or even cifs or nfs. The downside is you have to 'carve out' pieces of your uberArray into bites for each server that needs space.

Do you really, absolutely positively need POSIX semantics? Life gets a lot easier if you can use a custom datastore. We have an internally written datastore that is effectively a very large distributed key-value store. You store a file in it and you get a token back. If you want the file back, give it the token you were given earlier. It's distributed, is shared-nothing, data is replicated three times, nodes can be added and removed at will, both storage servers and controlling servers.

Unfortunately I really need POSIX semantics. We have a lot of "legacy apps" that store stuff to the local filesystem. Rewritting all of that is definitely outside any budget
–
Server HorrorJun 3 '09 at 22:41

I suspect you're going to have to abandon some of your other requirements. I'd be looking at GlusterFS, Lustre, OCFS2, GFS, but I doubt you'll find one that has shared-nothing.
–
David PashleyJun 3 '09 at 23:08

Unless it's for academic/development purposes, this kind of thing should be approached starting with the overall requirements for the project. Most distributed filesystems are not mature enough for a serious deployment - for example, what do you do if the whole thing flakes out. If it's for academic/development purposes, then this is actually a good thing as you could learn a great deal and fix a lot of bugs.

The comment questioning whether you really need POSIX semantics is a good start. Non-POSIX "filesystem" semantics can be so much more flexible, leading to much more reliable systems.

If this is a legacy application, I really wonder why a modern distributed filesystem might be considered the best solution.

Don't get me wrong - these are amazingly fun toys. I just wouldn't want to be responsible for a complex interdependent solution that is not commonly used and would be very difficult to fix when it flakes out.

Lustre also has a single point of failure as it uses a dedicated metadata server

Lustre is designed to support failover and a MDS/MDT/OSS can have a number of addresses which it can be contacted at, heartbeat can be used to migrate the service around.

Be aware that some recent versions have had issues where the unmount appears to work but there is data still in flight to the disc, However the double mount protection should have helped ( apart from the interesting issues that has had )....