Beyond Filesystems

14 June, 2009 10:16 pm

Historically, I’ve been a bit of a filesystem purist. If somebody called something a filesystem that wasn’t fully integrated into the kernel so that any application could transparently read, write and even map files within a familiar hierarchical directory/file structure, or if it didn’t support consistency at least as good as NFS (preferably better) I’d be quick to correct them. I’d say that what they had wasn’t a filesystem, with the clear implication that it was something less than a filesystem. My views have softened a bit over the years, though, and now I see a lot more value than I used to in data stores that aren’t filesystems. I’ve always seen value in databases, of course, but over the last few years more and more people have been putting data into things that are neither filesystems nor databases. In many cases, the value of these other data stores, and/or of what’s stored in them, exceeds that of the filesystems and databases (including the quasi-databases that I’m sure database purists complain about the same way I have about quasi-filesystems). At first I wondered why people were doing things this way. Many of these folks were doing distributed programming. Were they simply ignorant about distributed filesystems? Were they too lazy to figure out the wheel that already exists before they went and invented their own? Those two explanations seemed most likely for a while, but it turns out that there are better reasons as well, and it’s those reasons I’ll explain here.
The most prominent example of a non-filesystem non-database data store is the distributed key-value store. Create a key (often some sort of hash), and attach some arbitrary amount of data to it. Provide the key later, and get the data back. Examples include both Dynamo and S3 at Amazon, Project Voldemort, and many more. For the most part, these share some characteristics:

Weak consistency – readers might get stale data (or no data) even after a “completed” write.

One of the most fundamental operations in any data store is turning some sort of name into an identifier or handle for a particular unique object. In a filesystem, this means looking up the unique ID (typically an inode number) for each directory component of the path from the root on down. This can be so expensive even for a local filesystem that most operating systems have a special “name cache” just for this purpose. Much of the design effort for any distributed filesystem is oriented around doing the same thing efficiently across a network . . . and that’s where rename operations have earned a position of special notoriety. In addition to the ordering problems at the heart of the great fsync debate, renames make it possible for an already-resolved partial path – which might be the top of an entire large directory subtree – to disappear suddenly. Even more fun can be had when parts of the old path are repopulated with new directories/files, so that two nodes trying to resolve the same path might get different answers depending on their prior name-cache state. Basically rename (and rewriting of symbolic links) requires that name caches be kept consistent. In a distributed system, that can involve quite a bit of complexity and performance cost.

Consistency problems can be solved, of course. There’s additional complexity involved, and some performance cost, but I’ve long held that those could be kept to a minimum and that the benefits are worth it. The thing that really changed my mind about this was an observation in the Dynamo paper: strong consistency reduces availability. I’ve always thought of data availability in terms of data not being lost or stranded on the other side of a failed network connection. The Dynamo insight is that many applications have to do a lot of work within a small acceptable-response-time window, and to make sure that they fit into that window they have to impose deadlines on all sub-operations including data access. If consistency issues make data unavailable within that deadline then they’ve made it unavailable period, with practically the same effect as if the data were unavailable in any other sense. As someone who has done hard-realtime work I have trouble applying the term “realtime” to a web app that’s responding to a user, but by most definitions they would qualify. In such an environment, which has become extremely common, foregoing (some) consistency to ensure that data are available – meaning available within a given time – makes perfect sense.

This same consistency/availability tradeoff is the reason that these kinds of key-value stores usually eschew not only consistency within a file/object, but the whole structure of hierarchical names, renames, etc. They don’t need the naming flexibility. They do need lookups to be consistently fast, which means easily distributing keys among many nodes and not having to worry about them moving around. As the Dynamo paper also explains, many applications are happy with only primary-key access, and they don’t really need the primary key to be anything human-readable because it’s already stored in a database. If anybody wants more complex naming, they can implement it readily enough on top of the raw key-value functionality. If they don’t, they can get the primary key for something from their database query result and use it directly to get their data without any further name lookups.

The conclusion I draw from this exploration is that filesystems are overkill for many people. They provide functionality those people don’t need, and in return exact a cost those people are unwilling to pay, so alternatives have been developed. I think it’s time for more people to realize that there are no longer only two ways to store and access your data. Key-value stores aren’t just something layered on top of filesystems or databases. They – even more than object storage or content-addressable storage – represent a primary data-access paradigm in their own right.

I always wondered why one needs to put the exact location of a .h file into a specific file, rather than let the computer automatically know where a file is situated (or, where the DATA of that file is, so the compiler/computer can use it)

I just dont think that configure scripts are a smart way in the year 2009. The

“checking for sys/config.h” lines seem so incredibly stupid to me, especially if you listen to developers complaining that “on mac os x the file paths are all different from windows/linux and now i have to PORT my app to make it work…” seems stupid extra work, when the inherent logic in the application is already finished …

The problem is – things that already exist are a LOT HARDER to change. Even if things suck, if they just work more or less people accept the suckiness.

It sounds like the issue you’re talking about is not solved so much by key-value stores or other things outside of filesystems, but by things that can be done within the filesystem. I’ve long favored an approach kind of like Plan 9, where each session can essentially have it’s own namespace created by combining pieces of the system namespace however they please. Combine that with something like unionfs, which lets you have a directory tree containing the union of two (or more) others instead of one hiding the other(s), and I think you have a more elegant solution to the include-file and similar problems. Just union together the system includes and the project includes and your private includes into a single well-known location, and your compiler won’t even need to know that they’re actually coming from different places.

I see object storage, as represented in things like Lustre or pNFS, as a pretty direct replacement for block storage. The unit of addressing and access control is a finer-grained object instead of a coarser-grained LUN, the object has more responsibility for data placement etc., but for the most part the two are interchangeable building blocks for higher-level entities such as filesystems. This is even reflected in pNFS using basically the same protocol with both object and block layouts as options.

Key-value stores are not so much building blocks for filesystems as alternatives to them. They follow fundamentally different rules, especially with respect to consistency. Imagine the case where X writes something, then notifies Y which reads it. It would be unacceptable for either a block device or a filesystem to give Y anything but the data that X just wrote. In many key-value stores this is completely acceptable behavior if the elapsed time is shorter than that required for “eventual consistency” to do its thing.