not so random musings and mutterings about high performance computing, business, entrepreneurship, and the economy

sad/exciting time ahead

One of our customers has become fed up with the issues they’ve run into on Gluster. Started about a year ago, with some odd outages in the 3.0.x system, and didn’t improve with 3.2.x … in some instances it got worse. RDMA support in 3.0.x was pretty good, there were other bugs (which were annoying). The migration to 3.2.x was rocky. Libraries left from 3.0.x were somehow picked up and some things just failed.

Suffice it to say that this customer was experiencing Gluster outages on a weekly basis. Usually involving a long phone call to me for the post mortem. And there came a point in time, after watching Gluster get absorbed by Red Hat, and realizing that the ties that I had with the Gluster engineering team had now been … er … reduced … that our ability to get this customer the support they needed was now problematic.

Add to this a hardware RAID issue (vendor has trouble admitting that they have a problem, despite a reproducable problem we’ve seen pretty much everywhere). Sadly they are still the best vendor of the lot, even though their support is not what one might call “good”, or even “acceptable”. Or “workable”.

So we absorb most of the brunt of their failures. And the software bugs.

It got to the point with this one customer that we were spending 5-6 hours per week discussing the latest failure (usually at a Gluster level).

So the customer has kicked Gluster to the curb. I am currently re-commissioning one of the machines now. This is a fairly sizable storage cluster, and one of the early “successes” that we had with Gluster. It had just worked in the early days, though there were some bugs. The mistake appears to have occurred in moving them off of 3.0.x to higher numbered versions. Thats when things went from annoying to terrible.

Thats the sad part.

The exciting part is that they are going to be giving Fraunhofer Parallel File system a try. Initial tests on our recommendation have been … very encouraging.

The hardware is solid (modulo environmental changes). We even have a workaround for the recalcitrant RAID vendor. And we have some stuff development which should handle issues for us soon, so we won’t have these problems to worry about anymore.

Post navigation

11 thoughts on “sad/exciting time ahead”

Crap. Getting closer to a small gluster deployment here. Non-free FSes aren’t an option for a few reasons. If gluster isn’t working (and it didn’t first time ’round here, either), I might just try ceph.

It works. This particular customer had an … er … odd … networking requirement that had both 10GbE and IB needed for different portions of the same cluster at once. After 3.0.x, Gluster did not handle this well at all.

Remember that this is just one of many customers using GlusterFS. Most of the rest are happy with it, though we have one other that may be getting fed up with bugs (like the add-brick silently succeeding … though it really didn’t).

Most of the rest of the GlusterFS customers are using it over 10GbE or RDMA (not both). Most are happy with it.

Yeah … that design might not make it too happy. Fraunhofer certainly is ready, and can easily handle this. Ceph is (if you skip the BTRFS fs, and use a different backing store).

A nicer feature of Fraunhofer is the distributed metadata, that we can put on flash. Customer noted an immediate (and significant) uptick in stat heavy load performance.

Whats also nice is that both sets of developers show up on this blog every now and then :), and leave good comments.

FWIW: I don’t see Ceph and Fraunhofer as directly competitive, though there is some overlap. I see two different problem sets being largely addressed. And both sets of developers are doing great jobs finding/fixing bugs throughout the kernel systems as well. I have a fairly high degree of confidence in these groups.

This is not to say that I don’t like the Gluster group. Far from it, they are wonderful people. Its simply that Gluster has a different (and orthogonal) focus to what we need for high performance file systems. They are very focused upon “the cloud”, and making standing up “cloud file systems” easy. Nothing wrong with this. Its just different (very different) from standing up high performance, highly reliable, long term lifetime file systems.

We are still looking at how to hook our IP into Gluster, but our IP is pretty generic in that we simply need a way to get data in and out, so a translator layer is a pretty “simple” mechanism in this regard. But we are also looking at how to hook our IP in more generally directly to the OS stack through SCSI targets and things like that, so we can, again, sit as a layer atop other goodness. Same IP, slightly different interface.

So don’t take any of my comments as casting aspersions upon Gluster. I am not. Its that one customer with a hard network had issues. And these issues caused lots of outages. We would still use it for some scale-out NAS systems, where performance was less critical. Our siCluster-NAS is generally built atop Gluster and our DeltaV, and this has been a very reliable product.

Oh, I’m not reading anything negative into this. This is a vast space full of unsolved problems and barely explored solutions, and every group needs to pick bits to tackle. Now that I think of it… Other places I read of people dropping RDMA and going to IPoIB for Gluster… I wonder if it’s the same problem. That *might* be a solution for me, for now.

I am interested in HekaFS & the “cloud-ish” aspect, though. We’re a silly .edu, so our resources need to be too-flexible and zero-cost. 😉 If HekaFS gives a multi-tenant (aka multi-class/research/*funding*-group) view and keeps enough of Gluster’s performance… That would be very interesting in terms of setting up a flexible large-storage funding model.

As far as Ceph: I’ve been running btrfs at home without problems, aside from the nasty interaction with Debian’s fsync-paranoid package manager. Hasn’t eaten any data on me on kinda-unreliable hardware, although a single user produces at least two orders of magnitude fewer I/O ops.

@Jason:
“Non-free FSes aren?t an option for a few reasons.”
…this does not mean that you’re thinking fhgfs would be non-free (i.e. have any sort of licensing costs or so), does it?

Regarding btrfs: One of my colleagues is using btrfs on his desktop as root file system. Turned out in his case that after a while, more than 60% of his disk space were filled up with old copies of btrfs internal metadata and he needed to use a special workaround to get rid of those. So you might want to watch out for that.

@Jason – if you really want to use apt/dpkg without the fsync() issues then you can run it with libeatmydata (packaged as “eatmydata”) to avoid it. Not recommending it of course!

Also you want to be using 3.2.x for btrfs at present, though I think there might be a fix for the regression in 3.3.x coming in a future stable release (the first btrfs fix in a stable series I believe!).

@Chris:
I assumed that Jason was referring to “free” in terms of costs, not in terms of some fsf.org definition – because he later said “We’re .edu, so all our resources need to be zero-cost”.

Regarding open-sourcing the server components of fhgfs: We keep asking people for feedback on whether we need to do that. So far, the majortiy of people answers that they just want their cluster file system to do it’s job reliably – and if that’s the case then they care about whether it’s open-source software or not. In fact, we even offer access to the sources (and permission to modify them) to users with a support contract, but so far there’s no real interest in that.

@Sven: For me to support it, I’d want free and not just “no-cost.” The latter category has a history of changing. I’m not implying anything about the developers’ intentions, just what has happened even over developers’ objections in other projects. I remember getting quite burnt by the SSH license re-interpretation, for example. Plus, I like knowing that I have the source just in case things go utterly wrong. I’ve had to go spelunking in raw disks in the past, and having the code that managed the raw bits helped. If we had money to put behind requests, etc., my perspectives on this would be different. I don’t expect fhgfs development to jump at my no-money requests or provide any guarantees without appropriate remuneration. With free software, sometimes I can trade things (e.g. patches) for borrowing someone’s attention.

@Chris: Yup, I use libeatmydata, and that machine’s running Debian’s 3.2 kernel.

On btrfs: I haven’t run into issues with the disk being magically full, nor issues when it was really full. Might be luck. But desktop performance doesn’t seem any different than ext4. Server loads are different, and I’m definitely not stressing the FS.