Storage Cluster: A Challenge to LJ Staff and Readers

For a few years I have been trying to create a "distributed cluster storage
system" (see below) on standard Linux hardware.
I have been unsuccessful. I have looked into buying one and they do exist,
but are so expensive I can't afford one.
They also are designed for much larger enterprises and have tons of features I don't want or need.
I am hoping the Linux community can help me create this low cost
"distributed cluster storage system" which I think other small businesses could use.
Please help me solve this so we can publish the solution to the open source
community.

I am open to any reasonable solution (including buying one) that I can afford (under $3000).
I already have some hardware for this project which includes all the nodes
and 2 data servers which are:
2 @ Supermicro systems with dual Xeon 3.0 GHz CPUs, 8GB RAM, 4 @ 750GB Seagate SATA HDs.

I have tried to use all of these technologies at one point or another in
various combinations to create my solution but have not succeeded.
drbd, nfs, gfs, ocfs, aoe, iscsi, heartbeat, ldirectord, round robin DNS,
pvfs, cman, clvm, glusterfs, and several fencing solutions.

Description of my "distributed cluster storage system":

Data server: 2 units (appliances/servers) that each have a 4+ drives in a RAID5 disk set
(3 active, 1 hot spare).
These 2 units can be active/passive or active/active I don't care.
These two units should mirror each other in in real-time.
If 1 unit fails for any reason the other picks up the load and carries on
without any delay or hang time on the clients.
If a unit fails, when it comes back up, I want the data to be re-synced
automatically. Then the unit should "come back on-line" (assuming its normal
state is active) after it is synced.
It would be even more ideal if the data servers could be 1-N instead of just 1-2.

Data clients: Each cluster node machine (the clients) in the server farm (CentOS 5.4 OS)
will mount 1 or more data partitions (in read/write mode) provided by the data server(s).
Multiple clients will mount the same partition at the same time in r/w, so network file locking is needed.
If the a server goes down (multiple HD failure, network issue, power supply,
etc) the other server takes over 100% of the traffic and the client machines never know.

To re-cap:

2 data servers mirroring each other in real time.

Auto fail over to the working server if one fails (without the clients
needing to be restarted, or even being interrupted).

Auto re-sync of the data if a failed unit comes back on-line, when the
sync is done the unit goes active again (assuming its normal state is active).

What follows is a (partial) description of what I've tried and why it's failed
to live up to the requirements:

For the most part I got all the technology listed working as advertised.
The problems are mostly related to 1 server failing which can cause a
3 second+ hang in the network file system.
The problem is that when a server goes down any solution that uses heartbeat
or round robin DNS will hang the network file system for 3 seconds or more.
While this is not a problem for many technologies like http, ftp, and ssh
(which is what heartbeat is designed for) this poses a big problem for a file system.
If you are loading a web page and it takes 3 seconds you may not even notice,
or you would hit your reload button and it would work, however a file system
is not anywhere near that tolerant of delays, lots of things break when a mount
point is unavailable for 3 seconds.

So with DRBD in active/passive mode and heartbeat I set up a
"Highly Available NFS Server" with these instructions.

With Linux's implementation of NFS the NFS mount will hang and require
manual intervention to recover, even when server #2 takes over the virtual IP.
Sometimes you have to reboot the client machines to recover the NFS mounts.
This means that it is not "highly available" anymore.

With iscsi I could not find a good way to replicate the LUNs to multiple
machines. As I understand the technology it is a 1 to 1 relationship not 1 to many.
And again it would use heartbeat or round robin DNS for redundancy which
would hang if one of the servers was down.

Moving on to GFS, I find that almost all of the fencing solutions available
for GFS make GFS completely unusable.
If you have a power switch fence, your client machines will be cold rebooted
when the problem might be a flaky network cable, or an overloaded network switch.
Cold rebooting any server is very dangerous (if you ask any experienced sysadmin).
A simple example, if your boot loader was damaged for some reason after the machine
was up it could run for years without a problem, if you reboot, the boot loader will bite you.
If you have a managed network switch you could lose all communications with a
server because the network file system was down.
Many servers will need network communication for other things that are not reliant
on the network file system.
Again, very high risk in my opinion for what could/should be solved another way.
The one solution I do like is the AOE fencing method. This simply removes the
MAC of your unreachable client from the "allowable MACs" for the AOE software server.
This should not effect anything on the client machine except for the network file system.

I did get a drbd, AOE, heartbeat, GFS, and an AOE fence combination working,
but again when a server goes down there is a hang time on the network file
system of at least 3 seconds.

Finally there is glusterfs.
This seemed like the ideal solution, and I am still trying to work with
the glusterfs community on getting this to work.
The 2 problems with this solution are:

When a server goes down there is still a delay/timeout that the mount point is unavailable.

When a failed server comes back up it has no way to "re-sync" with the working server.

The reason for the second item is that this is a client side replication solution.
This means that each client is responsible for writing its file to each server.
So basically the servers are unaware of each other. The advantages of this
client side replication is the scalability. According to the glusterfs community
they are talking hundreds of servers and petabytes of storage.

A final note on all of this, I run one simple test on all my network file systems
that I know will bring consistent results. What I have found is that any network
file system will be at best 10-25% the speed of a local file system.
(also note this is over 1gb copper ethernet not fiber channel)
When running (this creates a 1gb test file):

dd of=/network/mount/point/test.out if=/dev/zero bs=1024 count=1M

I get about 11MB on NFS, 25MB on glusterfs, and 170+MB on the local file system.
So you have to be ok with the performance hit, which is acceptable for what I am doing.

p.s. I just learned about Lustre, now under the Sun/Oracle brand. I will be testing it out soon.

Chad Columbus is a freelance IT consultant and application developer.
Feel free to contact him if you have a need for his services.

Comments

Comment viewing options

I do not have the NFS solution currently set-up. I have moved on to testing a glusterfs solution. From the testing and notes I took hard/soft made no difference. I was using /etc/fstab. Finally the 3-5 second hang is the problem, even if the file system recovers, httpd can't be without its files for that long, it gets in a bad way or dies. I would guess my other applications like asterisk would also die.

Hmmm... I'm surprised that Apache (httpd) noticed that the NFS timeout of 3 seconds and actually died. Do you have error logs?

Did other applications have problems?

And honestly, why are you running httpd on 40 systems, along with other applications? I'd probably re-archictect my system to have application specific systems which are tuned for that specific app. This way I get the most performance, without worrying about other apps which might take up all the memory on the system, etc.

I understand what you're trying to do here, and I'm sure it can be done. But you need to provide lots more detail of your setups, including logs, application settings, error messages, etc.

One thought might be to have the NFS automounter setup so that you really only have a master/slave relationship. So if the NFS master goes away, the slaves look at the slave. This does NOT handle writing back to the NFS, it's really only good for read only loads. Which might not be what you're trying to achive here.

Now, for the 3 seconds failover hangs. Which HA software are you using, and how is it configured? Are you using heartbeat over Ethernet? Serial cable? How low can you set your heartbeat timeouts? How do you handle STONITH (Shoot The Other Node In The Head) to make sure it's down? Do you have power strips that you can toggle via serial? That might be the simplest way to make sure that when the failover happens, it's quick and dirty and darn well going to happen.

What you really want is an Active/Active solution, so that the clients can write to a single namespace, no matter which server they are writing to. But that might involve a more expensive network to get the bandwidth and latency down.

Again, you really really really need to provide more details on what exactly you've tried, what errors you've gotten and hey, it would even help to understand *why* your app fails with a 3 second hang while the cluster fails over.

Does it fail if you've just got a test app which opens a directory, reads all the files in there, computes to SHA1, then loops over the directory again and again doing the same read and compute of SHA1 and comparing it to the saved version.
Print out the current time to read the file and ocmpute the hash, etc, in a loop. Then failover the cluster.

Which reminds me, how do you failover the cluster in your testing? Do you yank cables? Hard power off the system?

I've used Netapp clusters at work and when they fail over, the clients (mostly NFS) just hang until the cluster comes back the life. Most apps/users never notice. So I'm wondering why your httpd is so sensitive. Maybe you can tweak settings on httpd to not timeout so damn quickly.

Maybe you can use a tool like Puppet to ensure that services are running, and started if they don't. Apache is a standard example for Puppet. Maybe the inotify (see http://en.wikipedia.org/wiki/Inotify) framework can help you monitor and fix such issues too.

I run a couple of web servers redundantly, and I just use rsync to keep them synced. There is no networked file system in my setup, so the 3-second, 5-second problem goes away.

Basically, I'm turning your problem on it's side. I trade integrity (do both nodes match?) for simplicity and reliability. In your case, you have a high assurance that both nodes have the same data at the same time, but you have a complex setup and you have your 3 second problem.

So, I think the real solution you're looking for is something that isn't a network file system, but instead is a way to keep to independent local disks in sync. I run rsync every 10 minutes, and that's fine for me. You might need something snappier and more efficient.

To go a little on a tangent, I don't even do automatic failover. The #1 cause of outages in my environment, by a huge marigin is ... planned downtime. In fact, in 10 years doing this, I can think of only a few instances where an automatic failover system might have helped me.

Now, weigh that against the risk of increased problems, outages etc... that comes with a more complex system. It's always possible that a system intended improve stability results in a net reduction of stability.

So, as you're considering all these options, ask yourself, which ones are better than running rsync every 10 minutes?

I have about 40 nodes each with 1 80gb hd, in a 2.5" form factor.
To change each of them over to hold 2tb of data is very expensive, and since each is only capable of holding 2 hds each, it is also limiting. What happens when the data grows to 5tb, 10tb, etc.

My client with the really fast local raid arrays has a similar set-up to yours. One I designed in fact. The main difference is that I set up a development server for them that all their personal works on. When a change is committed they run a "push" application that rsyncs the files to the production machines, and nightly we run an "global rsync" to make sure nothing has drifted out of sync. This prevents the need to rsync every 10 min and the overhead that that requires (which as systems scale up becomes a problem).

I run a couple of web servers redundantly, and I just use rsync to keep them synced. There is no networked file system in my setup, so the 3-second, 5-second problem goes away.

Basically, I'm turning your problem on it's side. I trade integrity (do both nodes match?) for simplicity and reliability. In your case, you have a high assurance that both nodes have the same data at the same time, but you have a complex setup and you have your 3 second problem.

So, I think the real solution you're looking for is something that isn't a network file system, but instead is a way to keep to independent local disks in sync. I run rsync every 10 minutes, and that's fine for me. You might need something snappier and more efficient.

To go a little on a tangent, I don't even do automatic failover. The #1 cause of outages in my environment, by a huge marigin is ... planned downtime. In fact, in 10 years doing this, I can think of only a few instances where an automatic failover system might have helped me.

Now, weigh that against the risk of increased problems, outages etc... that comes with a more complex system. It's always possible that a system intended improve stability results in a net reduction of stability.

So, as you're considering all these options, ask yourself, which ones are better than running rsync every 10 minutes?

I know that DRBD works, and I am petty sure pacemaker is just heartbeat under another name. Can you tell us about your solution? I do want this set-up, but I also want to share the solution with the Linux community.
Can you post a short paraphrased "how to" so we can understand what you are proposing?

Rajagopal,
Thanks for the link I intend to read it carefully.
I looked at NFS over UDP, but everything I read said to stay away from it.
Do you know if something has changed?
I will see if I can work with it after reading the pdf.

So you want a ha file system on several systems.
you have 2 storage servers which will contain the disks in a ha way.
you then want to export some sort of bock device or file system out to the other servers.

I have not played with drbd but i am assuming that this replicates a block device between the two file servers and it will give you a virtual block device, it appears that you can then combine this with lvm to split out the volumes.

Next you need to export this to the servers so you will need to setup an iscsi target. this can use any block device so setup drbd and lvm and export this. setting this up is a bit of a pain using debian as you need to compile the kernel driver but there are guides.

On the client side you will be mounting this block device on all nodes in that cluster and using a clustered file system ti make it work. I work as an oracle DBA and use ocfs2 though gfs might also do the job.

the next step is to mount the exported iscsi device using the iscsi initiator. test it one node at a time by mounting the initiator, creating a file system and a test file, unmounting the file system and target and mounting it on the second node.
setup ocfs between the nodes this involves entering the ip addresses of the other nodes and configuring it to load the services on startup.

you still need something application level to to be cluster aware, reads will be fine but you could potentially have some file level corruption if the two copies of the application write to the same file. an oracle RAC database will run fine in active active but for postgres or mysql you want only one node at a time.

feel free to contact me i have setup a single iscsi target to multiple servers but not plated around with having active active on the storage level.

Since this file system will be used for may applications it is not really practical to leave the write lock in the application space. We need it at the file system level. I thought OCFS was a true network file system and did allow for network locks, and everything you suggest is reasonable, as a matter of fact I have tried it almost exactly as you have listed (with the exception of lvm, which reduces performance and the features it provides are not needed for my implementation). The real question is how do we handle a server failure? How do we take it out of service? How do we re-sync it on recovery? How do we keep clients from hanging when it is down?

Is NFS stateless? If so, you ought to be able to have the backup server mirror the primary (including who's mounting its FSs if this is applicable). When the primary server crashes, have the secondary server assume the primary's MAC and IP addresses and become the primary. When the failed server restarts, it re-syncs and fetches the secondary MAC and IP addresses from the primary server.

If the secondary server crashes, the primary should assume the secondary's MAC and IP addresses. When the secondary returns, it re-syncs and gets the secondary addresses from the primary, as expected.

NFS is not really stateless, but that is not a problem as you can make the files needed part of the replicated data between the 2 servers.
Your solution is a fine one, and is one of the things I tried (see "Highly Available NFS Server"), the problem is all the magic of switching the IP fast enough and seamlessly enough not to hang the clients.

In my HA/DRBD/NFS server cluster I store /var/lib/nfs/ on the DRBD disk so that the NFS state data is available to either cluster node when it is active. I use a drbdlinks heartbeat resource to make /var/lib/nfs a symlink to the /var/lib/nfs/ directory on the DRBD disk(/fserv/var/lib/nfs/ in my case ).

This does help with recovery after a failover. It is still not perfect though.

...but I have read every comment with great interest as this subject has had me scribbling "wouldn't it be great if..." diagrams many a time. Ultimately doesn't it all come down to whatever piece of kit your server is connected to right now to provide the file service must be replaced almost instantaneously with it's backup if it disappears for whatever reason. Even as a mental exercise that's a difficult one to achieve. Whatever mechanism the two devices use to know the other is still alive (or not) has to have a latency of half that you require for the service they provide.

My main reason for commenting is so I can see the follow ups. I look forward to a solution being pieced together and published in full in an LJ later this year!

I'm not 100% sure this is what you're looking for, but have you seen the Nasuni Filer? It uses cloud storage (Amazon's S3 is currently the cheapest provider), but has intelligent caching (for local-fileserver speeds) and uses encryption to secure your data.

After reading their docs very carefully I have already decided that Lustre is not really an option as it uses load balancing for fail over. This seems to be very common, network file systems are designed and created by someone, but the authors don't seem to think that the fail over/load balancing is part of the file system so they leave that to a 3rd party solution/application. This is fine except that I have not found a load balancing solution that is not designed for file systems (with near instantaneous fail over). Anyone have a really fast load balancer solution?

I'm curious as to how you mounted the filesystem on the client when using DRBD. I'd probably have tried using the automounter to mount the filesystem, as that tends to recover more gracefully than having something hard-coded in /etc/fstab. I think you're on the right track though. If you still have the DRBD setup, try using autofs to mount the filesystem on the client and see how that works.

Bill Childers is the Virtual Editor for Linux Journal. No one really knows what that means.

I used DRBD just to replicate the data partition on the server, so that on fail over, the 2nd server would go active and have an exact copy of the primary server in near real time. This worked fine. The problem is the delay in the load balancer switching to the secondary server. The load balancer needs 3+ seconds to make the switch, and this delay makes all the NFS clients hang. I will have to read/research the autofs mount option and see if that some how overcomes the short comings of the NFS hang. Have you tried this yourself with a standard partition? Basically just mount an NFS partition and then block the client with iptables on the server, then see if the NFS mount hangs on the client. When it does (which it will) release the iptables block and see if the NFS mount recovers. If it does then we will be 1 step closer to making the DRBD + NFS solution work.

I'm really surprised that you've not tried autofs. This should definitely solve the hanging problem on the clients. According to RedHat (I know you're not using them) this is the way NFS should be used exclusively.

I've not used NFS without autofs since RHEL (or CentOS) 4.0.

Also, just to put things in perspective, a 3 second delay is NOT a lot. With EMC CLARiiON SAN arrays, which are definitely not within your budget, the time it takes to trespass (fail-over) LUNs from SPA to SPB (the two service processors in the array) is normally in the 30-second range, PowerPath (EMC's multipathing software) has a default timeout of 60 seconds.

I'm not saying that you need to have resiliency on the application layer, but you can definitely have resiliency on the client's OS layer (below the application) and autofs definitely gives you that.
Thanks.

Cleversafe (http://cleversafe.org) might be an option, though I'm not sure how well it scales to the low end. Basically it distributes parts of a disk block to multiple storage nodes. It codes the block parts so that you only need a subset of all the nodes alive to reconstruct the block. It then has an iscsi interface so you can use multipathing to provide redundancy to the cluster itself.

From what I have read this seems like a reasonable approach. However it seems to need 16 servers (based on their examples) and I only have 2 servers. If I was deploying something much larger I would definitely try this out, but as it stands it is just to big for my needs and resources.
I can't even test it to see if I can find any holes in it. I found that several solutions seemed to work on paper then had issues in the real world.

This sounds a lot like glusterfs, which I have set-up already.
Glusters problem is that when a server goes down it takes about 5 seconds (one time) for the server to time out, it then gets marked by clients as down and the delay is gone. The problem is that the 5 second delay is long enough to kill httpd. I might try Cleversafe and see how long it takes them to recover from a downed server.

Trying to build a redundant file system is hard, as you've shown. Often it's easier to build redundancy closer to the application level.

Mogilefs might be a better approach if you're looking to serve a huge number of files over HTTP and want availability...

FWIW, Cleversafe has a file based store that you can access through HTTP instead of iSCSI. It's mostly made to be an origin server for CDN because they're going the "lots of slow cheap disk" philosophy for their commercial offering.

I would use this as a general file system. That includes using it for http files, config files, asterisk recordings, e-mail, ftp, and more. (Note I do not include log files, as I feel you should log locally and merge nightly.)

The idea for me is to be able to create a farm of servers that all share the same storage back end. When we (the linux community) talks about clusters, mostly we talk about computer clusters. However I have a more general cluster solution in mind. I don't want/need many computers to solve a single problem, I have many users that want access to many services. Most applications that my users want is more or less stateless (think http and e-mail), in that there is nothing special about a single server in my cluster except for the data. If I have shared storage I can throw almost an unlimited number of servers at the cluster and handle as much traffic as my bandwidth will allow.

You've really opened a good subject and I'll be watching this thread pretty closely too...

I've been wanting to implement a similar solution and have had the same results...

As for network latency... to my important clients I've gone as far as binding multiple nics in an effort to decrease access times... It works, but it's not a real good solution for multiple clients, unless of course you have the budget.

Jerry,
For the most part I can live with the latency issues, I do not need super fast data. I do have some clients that run local SAS raid arrays on every server just to overcome the data speed issues, so I understand the need for fast data. I like to say when it comes to storage, you can either have big, slow, and cheap or small, fast, and cheap, but if you want big and fast, it won't be cheap. I think the same applies here. If you want fast, it will cost you to go to FC on all the servers. Even with bonded NICs (which I do run on some critical servers) There is just a limit of what you can do over copper.

* Storage solutions are not a "one size fits all". It sounds like you have solved the storage problem for 95% of your infrastructure

as for the remaining 5% , the question needs to be,

Why do you need a speedy recovery?

Is it for file-locking of a particular application?
Is it to keep users happy?
Is it a particular application that can't handle a 3 second timeout?

Perhaps the remaining bit is best handled by throwing a different solution at it. this will cost less money and end up helping a lot in the long run because the solution is tailored for the problem you are trying to solve.

if it's for users on the web who need dynamic content, throw something static at them for 3 seconds and then make it dynamic when the hiccup is over.

if it's for a relational DB, you will probably have to get something fast like FC, no other way to do it.

if it's file-locking, you already know how to use file-systems that can handle file contention, just fine-tune them

if it's an application, talk to developers and ask them why there needs to be less than 3 seconds between file reads/writes, etc.

I agree, that I have a solution that works as long as nothing is down, and you are right the systems are up 95+ % of the time.

It may be that it can only be done with special hardware (FC) and/or software (proprietary), but I hope that is not the case.

The first application that I know can't handle the 3-5 delay is httpd.
That by itself is a deal breaker for me. When there is a delay of that length, I have to restart httpd (once the delay is over) on all the machines. This means that an outage is not really 3-5 seconds, it means it is as long as it takes to notice httpd is down, restart httpd on all nodes (about 40), and for httpd to start-up and begin serving pages. It also means the load will spike as requests get queued up on the load balancer and may overwhelm the nodes as the flood gates are opened. All of this and we are only taking about 1 out of 10-12 applications that will depend on the file system.

It would not be reasonable/acceptable for the network file system solution to make a simple outage into a complex one. Which the httpd issue could be.

I don't necessarily agree that focusing on the application side would cost less money. A commercial solution to the problem is a one time cost, and trying to get 10-12 sets of application developers to reprogram their code could take many man hours of work (read paying salary) or may not be possible at all (read busy open source developers that don't have time to modify their application).

I may be at the end of the road on this, and it could be that there is not an inexpensive solution to the problem, but I am not ready to cry uncle yet ;)