Links

Licence

I have a need for shared storage of around 300GB worth of 200×200 image files. These files are written once, then resized and stored. Once stored they never change again – they might get deleted.

They get served up to 10 Squid machines and the cache times are huge, like years. This is a very low IO setup in other words, very few writes, reasonably few reads and the data isn’t that big just a lot of files – around 2 million.

In the past I used a DRBD + Linux-HA + NFS setup to host this but I felt there’s a bit too much magic involved with this and I also felt it would be nice to be able to use 2 nodes a time rather than active-passive.

I considered many alternatives in the end I settled for GlusterFS based on the following:

It stores just files, each storage brick has just a lot of files on ext3 or whatever, you can still safely perform reads on these files on the bricks. In the event of a FS failure or event your existing tool set for dealing with filesystems all apply still.

It seems very simple – use a FUSE driver, store some xattr data with each file and let the client sort out replication, all seems simple

I had concerns about FUSE but I felt my low IO overhead would not be a problem as the Gluster authors are very insistent – almost insultingly so when asked on IRC about this – that FUSE issues are just FUD.

It has a lot of flexibility in how you can construct data, you can build all of the basic RAID style setups just using machines of reasonable price as storage bricks

There is no metadata server, most cluster filesystems need a metadata server on dedicated hardware kept resiliant using DRDB and Linux-HA. Exactly the setup I wish to avoid and those are overkill if all I have is need for a 2 node cluster.

Going in I had a few concerns:

There is no way to know the state of your storage in a replicated setup. The clients take care of data syncing not the servers, so there’s no healthy indicator anywhere.

To re-sync your data after a maintenance event you need to run ls -lR to read each file, this will validate the validity of each file syncing out any strange ones. This seemed very weird for me and in the end my fears of this was well founded.

The documentation is poor, extremely poor and lacking. What there is applies to older versions and the code has had a massive refactor in version 3.

I built a few test setups, first on EC2 then on some of my own VMs, tried to break in various ways, tried to corrupt data and come up with a scenario where the wrong file would be synced etc and found it overall to be sound. I went through the docs and identified any documented shortfalls and verified if these still existed in 3.0 and mostly found they didn’t apply anymore.

We eventually ordered kit, I built the replicas using their suggested tool, set it up and copied all my data onto the system. Immediately I saw that small files is totally going to kill this setup. Doing a rsync of 150GB took many days over a Gigabit network. IRC suggested that if I am worried about the initial build being slow I can use rsync to prep the machines directly individually and then start the FS layer and then sync it with ls -lR.

I tested this theory out and it worked, files copied onto my machines quickly and the ls -lR at the end found little to change according to write traffic to the disks and network, both bricks were in sync.

We cut over 12 client nodes to the storage and at first it was great. Load averages was higher which I expected since it would be a bit slower to respond on IO but nothing to worry about. A few hours into running it all client IO just stopped. Doing a ls, or a stat on a specific file, both would just take 2 or 3 minutes to respond. Predictably for a web app this is completely unbearable.

A quick bit of investigation suggested that the client machines were all doing lots of data syncing – very odd since all the data was in sync to start with so what gives? It seemed that with 12 machines all doing resyncs of data the storage bricks just couldn’t cope, they were showing very high CPU. We shut the 2nd brick in the replica and IO performance recovered and we were able to run but now without a 2nd host active.

I asked on the IRC channel for options on debugging this and roughly got the following options:

Recompile the code and enable debugging, shut down everything and deploy the new code which would perform worse, but at least you can find out whats happening.

Make various changes to the cluster setup files – tweaking caches etc, these at least didnt require recompiles or total downtime so I was able to test a few of these options.

Get the storage back in sync by firewalling the bulk of my clients off the 2nd brick leaving just one – say a dev machine – start the 2nd brick and ls -lR fix the replica, then enable all the nodes. I was able to test this but even with one node doing file syncs all the IO on all the connected clients failed. Eventhough my bricks werent overloaded IO or CPU wise.

I posted to the mailing list hoping to hear from the authors who don’t seem to hang out on IRC much and this was met with zero responses.

At this point I decided to ditch GlusterFS. I don’t have a lot of data about what actually happened or caused it, I can’t say with certainty what events were happening that was killing all the IO – and that really is part of the problem, it is too hard to debug issues in a GlusterFS cluster as you need to recompile and take it all down.

Debugging complex systems is all about data, it’s all about being able to get debug information when needed, it’s about being able to graph metrics, it’s about being able to instrument the problem software. This is not possible or too disruptive with GlusterFS. Even if the issues can be overcome getting to that point is simply too disruptive to operations because the software is not easily managed.

Had the problem been something else – not replication related – I might have been better off as I could enable debug on one of the bricks but as at that point I had just one brick that had valid data and any attempt to sync the second node would result in IO dying it means in order to run debug code I had to unmount all connected clients and rebuild/restart my only viable storage server.

The bottom line is that while GlusterFS seems simple and elegant it is too hard/impossible to debug it should you run into problems. A HA file system should not require a complete shutdown to try out a lot of suggested tweaks, recompiles etc. Going down that route might mean days or even weeks of regular service interruption and that is something that is not suitable to the modern web world. Technically it might be sound and elegant, from an operations point of view it is not suited.

One small side note, as GlusterFS stores a lot of is magic data in x-attributes of the files I found that my GlusterFS based storage was about 15 to 20% bigger than my non GlusterFS ones, that seems a huge amount of waste. Not a problem these days with cheap disks but worth noting.

26 Comments

This is why prefer something like MogileFS for uses cases like these.
Sure, it has a central metadata storage (although it can be made HA) It’s just more “accessible”, metadata is in databases, the synchronisation/file serving/… is readable perl code, etc.

“Debugging complex systems is all about data, it’s all about being able to get debug information when needed, it’s about being able to graph metrics, it’s about being able to instrument the problem software. This is not possible or too disruptive with GlusterFS. ”

Ok, fair point, you need to be able to debug issues in complex systems, and if GlusterFS wont let you do that, there’s a problem.

“it is too hard to debug issues in a GlusterFS cluster as you need to recompile and take it all down.”

Wait, what? You’re unwilling to compile a debug release, yet you want to debug? And the fact that you need to compile a debug release in the first place is your sticking argument for debugging being too difficult?

I do not need to compile debugging into my kernel level filesystem to instrument them with the wealth of kernel tools that exist like iostats, iotop and other similar tools.

Similarly I do not need to reboot my server when I change buffers, etc via the standard Unix sysctl tools.

I am able to tune, monitor, debug, graph etc all about traditional filesystems without rebooting, without recompiling.

Comparing using well known and standard kernel structures to both debug and tune a running filesystem vs recompile/restart GlusterFS and you’ll know what I mean.

Also like I pointed out it might have been more viable had my problem not been that of my cluster only 1 node was operable, had I two nodes I could probably do debug code on one without interrupting my production system.

I’m a little confused about one thing. You say at one point that you had “12 machines all doing resyncs of data” and that was causing the storage bricks to become overloaded. Why were your 12 clients doing resyncs all at the same time? For the same files? Even if it’s for different files, that could be a lot of metadata traffic.

I had the same experience. Gluster is slow and hard to administer.
A lot of things you want to do, you’ll have to do outside of gluster.

Glusterfs is good for archiving but not for performance storage.
Since it is decentralized meta data, all stat operations is crazy long.
Even with their 3.1.1-1 release, which promised to be faster, I experienced 100 times slower just doing a “du -sch *”.

Eventually they will get it right but now, I think they are missing the boat on what a network filesystem is. As many shortcomings as NFS has, it still should be considered over GlusterFS until a few more years….

I’m the community guy at Gluster. I tried reaching out to you on Twitter but hadn’t heard back. In any case, there are a couple of things here:

1. Yes, documentation needs work, but we’ve improved it since your post, and that work continues

2. We’ve released 3.2 since then, which I’d be curious if you’ve tried

3. I’m not sure if you tried this, but we have support and services staff on hand who are ready and willing to help with this type of thing. Would you be interested in working with them to get to the heart of your issues?

Am currently researching on GlusterFS, i have a Red Hat GFS cluster hardware due for replacement so am interested to see if anyone has compared performance and complexity of GlusterFS and GFS? With GFS the nodes mount the common storage (i.e. no data replication) over FC, so with GlusterFS i can get additional tolerance if i lose storage on one server.

It sounds like you have a few other options available to you apart from GlusterFS. You could use RedHat Cluster with GFS2 on top of DRBD (up to 3 nodes) to gain an active-active setup. But since your data is seldom written, you would probably be better of with a solution like lsyncd. This can also scale up to many nodes with some careful configuration and ring replication topology. Best of all, the reads incur no overheads at all.

@jon elf: in the end we wrote a small non posix based system ie. a webservice that just distributed the files via a deterministic algo onto a set of storage nodes. similar in design to how S3 or other similar systems would work.

we were lucky that we could replace the existing file IO code with a API call easily

Hey R.I.Pienaar, thanks for a good post, I use gluster with some PHP web applications to store image data, and I have had my fair share of problems, but I find using these mount options, worked better for me.

Note, this was posted in 2010, today in 2016 with glusterfs 3.8.4 I have a very similar type of setup, I put up the first gluster server with all the files on it, and added in the second server with nothing on it. It’s taking about 12 hours to sync 2.5 TB of data, and btw I didn’t understand the ls -lR thing to begin with so a bunch of that time is not “active”… considering that the hardware speed limit of the disks is probably close to 6 to 10 hours minimum for this much data transfer, I think it’s reasonable.

hoping I won’t see the kinds of sudden i/o overload you saw, but one thing I’ll say is that it seems to me that glusterfs is really for use *on the cluster* and for sharing to third parties something like Samba or NFS is required for security reasons if nothing else. I also think it would introduce fewer glusterfs native clients into the mix and this could be very helpful. My setup is two hosts each with one large btrfs brick replicating to each other over a local private lan through a gigabit switch. Each node has mounted the gluster volume via native client, and then each exports the glusterfs native mounted filesystem via either NFS or SMB to people who actually use the files (with Kerberos security).

This system wouldn’t have 12+ native clients all trying to replicate things, and with glusterfs 3.8.4 on this system it seems acceptable initially.

Thanks for your comment, yes this post is ancient, I hope the picture today is a bit more rosey

I am curious why you use BTRFS to replicate across nodes and then have gluster as well? Doesn’t this basically mean you have 2 cluster filesystems trying to replicate the same data? I guess I am just misunderstanding your setup