Data about Data

Warning: Long, technical post

One of the few remaining icky areas of the Nautilus codebase is the metadata store. Its got some weird inefficient XML file format, the code is pretty nasty and its the data is not accessible to other apps. Its been on my list of things to replace for quite some time, and yesterday I finally got rid of it.

The new system is actually pretty cool, both in the API to access is how it works internally. So, I’m gonna spend a few bits on explaining how it works.

Lets start with the requirements and then we can see how to fulfil these. We want:

A generic per-file key-value store with string and string list values. (String lists are required by Nautilus for e.g. emblems)

All apps should be able to access the store for both writing and reading.

Access, in particular read access, needs to be very efficient, even when used in typical I/O fashion (lots of small calls intermixed with other file I/O). Getting the metadata for a file should not be significantly more expensive than a stat syscall.

Removable media should be handled in a “sane” way, even if multiple volumes may be mounted in the same place.

We don’t require transactional semantics for the database (i.e. no need to guarantee that a returned metadata set is written to stable storage). What we want is something I call “desktop transaction semantics”.
By this I means that in case of a crash, its fine to lose what you changed in the recent history. However, things that were written a long time ago (even if recently overwritten) should not get lost. You either get the “old” value or the “new” value, but you never ever get neither or a broken database.

Homedirs on NFS should work, without risking database corruption if two logins with the same homedir write concurrently. It is fine if doing so may lose some of these writes, as long as the database is not corrupted. (NFS is still used in a lot of places like universities and enterprise corporations.)

Seems like a pretty tall order. How would you do something like that?

Performance

For performance reason its not a good idea to require IPC for reading data, as doing so can block things for a long time (especially when data are contended, compare i.e. with how gconf reads are a performance issue on login). To avoid this we steal an idea from dconf: all reads go through mmaped files.

These are opened once and the file format in them is designed to allow very fast lookups using a minimal amount of page faults. This means that once things are in a steady state lookup is done without any syscalls at all, and is very fast.

Writes

Metadata writes are a handled by a single process that ensures that concurrent writes are serialized when writing to disk.

Clients talk to the metadata daemon via dbus. The daemon is started automatically by dbus when first used, and it may exit when idle.

Desktop Transaction semantics

In order to give any consistancy guarantees for file writes fsync() is normally used. However this is overkill and in some cases a serious system performance problem (see the recent ext3/4 fsync discussion). Even without the ext3 problem a fsync requires a disk spinup and rotation to guarantee some data on disk before we could return a metadata write call, which is quite costly (on the order of several milliseconds at least).

In order to solve this I’ve made the file format for a single database be in two files. One file is the “tree” which contains a static, read only, metadata tree. This file is replaced using the standard atomic replace model (write to temp, fsync, rename over).

However, we rarely change this file, instead all writes go to another file, the “journal”. As the name implies this is a journal oriented format where each new operation gets written at the end of the journal. Each entry has a checksum so that we can validate the journal on read (in case of crash) and the journal is never fsynced.

After a timeout (or when full) the journal is “rotated”, i.e. we create a new “tree” file containing all the info from the journal and a new empty journal. Once something is rotated into the “tree” it is generally safe for long term storage, but this slow operation happens rarely and not when a client is blocking for the result.

NFS homedirs

It turns out that this setup is mostly OK for the NFS homedir case too. All we have to do is put the log file on a non-NFS location like /tmp so that multiple clients won’t scribble over each other. Once a client rotates the journal it will be safely visible by every client in a safe fashion (although some clients may lose recent writes in case of concurrent updates).

There is one detail with atomic replace on NFS that is problematic. Due to the stateless nature of NFS an open file may be removed on the server by another client (the server don’t know you have the file open), which would later cause an error when we read from the file. Fortunately we can workaround this by opening the database file in a specific way[1].

Removable media

The current Nautilus metadata database uses a single tree based on pathnames to store metadata. This becomes quite weird for removable media where the same path may be reused for multiple disks and where one disk can be mounted in different places. Looking at the database it seems like all these files are merged into a single directory, causing various problems.

The new system uses multiple databases. libudev is used to efficiently look up the filesystem UUID and label for as mount and if that is availible use that as the database id, storing paths relative to that mount. We also have a standard database for your homedir (not based on UUID etc, as the homedir often migrates between systems, etc) and a fall-back “root” database for everything not matching the previous databases.

This means that we should seamlessly handle removable media as long as there are useful UUIDs or labels and have a somewhat ok fall-back otherwise.

Integration with platform

All this is pretty much invisible to applications. Thanks to the gio/GVfs split and the extensible gio APIs things are automatically availible to all applications without using any new APIs once a new GVfs is installed. Metadata can be gotten with the normal g_file_query_info() calls by requesting things from the “metadata” namespace. Similar standard calls can be used to set metadata.

Also, the standard gio copy, move and remove operations automatically affect the metadata databases. For instance, if you move a file its metadata will automatically move with it.

Relation to Tracker

I think I have to mention this since the Tracker team want other developers to use Tracker as a data store for their applications, and I’m instead creating my own database. I’ll try to explain my reasons and how I think these should cooperate.

First of all there are technical reasons why Tracker is not a good fit. It uses sqlite which is not safe on NFS. It uses a database, so each read operation is an IPC call that gets resolved to a database query, causing performance issues. It is not impossible to make database use efficient, but it requires a different approach than how file I/O normally looks. You need to do larger queries that does as much as possible in one operation, whereas we instead inject many small operations between the ordinary i/o calls (after each stat when reading a directory of files, after each file copy, move or remove, etc).

Secondly, I don’t feel good about storing the kind of metadata Nautilus uses in the Tracker database. There are various vague problems here that all interact. I don’t like the mixing of user specified data like custom icons with auto-extracted or generated data. The tracker database is a huge (gigabytes) complex database with information from lots of sources, mostly autogenerated. This risks the data not being backed up. Also, people having problems with tracker are prone to remove the databases and reindexing just to see if that “fixes it”, or due to database format changes on upgrades. Also, the generic database model seems like overkill for the simple stuff we want to store, like icon positions and spatial window geometry.

Additionally, Tracker is a large dependency, and using it for metadata storage would make it a hard dependency for Nautilus to work at all (to e.g. remember the position of the icons on the desktop). Not everyone wants to use tracker at this point. Some people may want to use another indexer, and some may not want to run Tracker for other reasons. For instance, many people report that system performance when using Tracker suffer. I’m sure this is fixable, but at this point its imho not yet mature enought to force upon every Gnome user.

I don’t want to be viewed like any kind of opponent of Tracker though. I think it is an excellent project, and I’m interested in using it, fixing issues it has and helping them work on it for integration with Nautilus and the new metadata store.

Tracker already indexes all kinds of information about files (filename, filesize, mtime, etc) so that you can do queries for these things. Similarly it should extract metadata from the metadata store (the size of this pales in comparison to the text indexes anyways, so no worries). To facilitate this I want to work with the Tracker people to ensure tracker can efficiently index the metadata and get updates when metadata changes for a file.

Where to go from here

While some initial code has landed in git everything is not finished. There are some lose ends in the metadata system itself, plus we need to add code to import the old nautilus metadata store into the new one.

We can also start using metadata in other places now. For instance, the file selector could show emblems and custom icons, etc.

Footnotes

[1] Remove-safe opening a file on NFS:
Link the file to a temporary filename, open the temp file, unlink the tempfile. Now the NFS client on you OS will “magically” rename the tempfile to .nfsXXXXX something and will track this fd to ensure this gets remove when the fd is closed. Other clients removing the original file will not cause the .nfsXXXX link on the server to be removed.

Very, very nice. We (gedit) are very happy with this (gedit also has an xml metadata format which we are unhappy with). One question popped up. gedit has a file browser plugin which shows a file listing. It has support for showing emblems, but getting emblems for files was never implemented. So I guess with this you can simply query for maybe metadata::nautilus-emblems? What is the result of that? I can imagine a list of maybe file names, or theme icon names? Also, how does it related to the GEmblem stuff?

Thats the old tracker metadata standard, even tracker is not using that anymore.

Pádraig Brady:

The performance issue with xattrs comes from them being stored in a separate block, which cause extra seeks to open, something which is comparable to file open in cost (i.e. like mimetype sniffing very expensive, see my various post about this). xattrs are stored in the inode until they don’t fit, and recent filesystems have larger inodes by default, so its better there. However, once you pass a certain size you still get the performance issue.

Could it be fixed? Of course, kernel space can do whatever userspace can do. But are the kernel people interested in solving this issue? Not the once i’ve spoken to at least.

William Lachance:
I haven’t measured, but I don’t think its a large difference, the previous nautilus code used in-process hashtable lookups, so its fairly efficient too, although limited in scope to nautilus.

Axel:
Well, the emblem data is availible, we still need to read it in various places like the file selector. But its doable.

I think data should be *shared as far as possible*. (the user doesn’t care about what an application’s “relation” to a specific file is, the user cares about his/her relation to the file!)

Cryptic metadata like the current nautilus is a good example of a bad idea — no other applications access and use emblems.

+1 for xattrs if it works on modern desktop filesystems, like ext4.

+1 on storing _without_ nautilus prefix when It’s *user data*, not *user config*. Example of data: Emblems, annotations; example of config: icon size and position. The user data should be available in all desktop applications, not just nautilus!

ulrik:
I don’t quite understand what you mean by relationship to apps? Sounds weird. Keys are namespaced to avoid conflicts with multiple apps using the same name for what may be two different things. This is standard computer science and is not something that users will really see. Of course we would not use such namespacing for shared keys. The name of the key and wheter or not other apps use emblem should be unrelated.

For ext4, xattrs are not significantly different from ext3, except that the default inode size is larger so more data fit before things go slow.
And, all the other reasons xattrs are not a good idea still hold.

This generally sounds pretty solid. Am I correct in thinking that all clients on the *same* machine share access to the journal and see the changes before rotation? Say, nautilus adds some metadata and also does something that causes a file to be added to “recent items” – if gnome-shell is watching the items and updates its display, will it immediately have access to the new metadata (to show emblems, say)?

alexl:
Just making the point that if the user marks a file “urgent” or “green” (or any emblem or label), for the user it’s the _file_ that is important and not what the file is like in nautilus.. I think it is clear in the “new” Gnome — with activities, centered around documents, nothing should be unique just for nautilus, none of the user data about files should be specific to nautilus, it should *be specific to the file* across all applications in the desktop.

That’s why emblems are pointless, since they only show up in nautilus.

ulrik:
Emblems only show up in nautilus because nothing else can read the old nautilus metadata store. With this metadata store all apps can read and use file emblems.

Now, emblems are clearly a global metadata that should not use a prefixed name. However other data may be truly application specific. Take for example the coordinates of the icons on the desktop directory. Thats not really useful for other apps, and if another app wants to store an “icon_pos” key its likely that sharing this with nautilus (i.e. use the same name) would break shit.

This sounds very nice. Is the metadata ever resync’d if a file gets moved by “mv”, and could removable-volume metadata be stored on the volume so that you could see the same emblems on files when you move a USB key between computers?

I suppose the biggest disappointment with metadata for me is that it never seems portable to other systems.

I’m under the impression that under this approach, if I put files with notes and emblems on them on a USB key and hand it to my girlfriend, or if I SFTP them to another machine (even if it’s running Nautilus), the metadata won’t carry with them.

I still dream of a day when I can annotate a bunch of my files and share them with their annotations. Right now, it seems as though a few lucky file formats get their own metadata standards, like ID3v2 tags on MP3s and EXIF data on JPEGs

I wish instead of doing “gvfs-copy /tmp/testfile /tmp/testfile2”, I could use cp and any standard command and have my files’ metadata carried with. The relationship between a file and its metadata when its stored in a separate database rather than being attached to a file is so fragile. It’s the reason why I generally cannot use photo management applications that use databases (or offer to create a copy of every photo in your collection?!).

I suppose a large issue with this is that most file formats today don’t expect metadata being tagged on them, and it would be difficult to convince Apple and MicroSoft to engage in a general-onfile metadata standard, I think. And one that could be efficient, since I suppose you don’t really want to be a parsing a file for the metadata of each file.

Oh, and not to be a total downer/tool, I’m quite pleased with the solution that has been come up with much kudos to you for it I hope this will help see metadata used more regularly and broadly on the desktop.

Also, perhaps in the future, I will be able to hover over files with notes and see a tooltip containing it, or something, in nautilus and other applications

anon:
Offtopic, but storing mimetype in xattrs *sounds* like an awesome idea!

However, I the sniffing done on Linux is really a great feature — it’s so good that file extensions are not really necessary. In comparison with OS X, if you change the file extension there it has no idea what’s in a file anymore, without file extension, no way it’s reading a pdf! (where Mac OS 9 and earlier had filetype bliss, 99% of files had well-defined filetypes in metadata, OS X is a big step backwards!)

So it *sounds* like a good idea but I still think sniffing is a real-world really neat feature.

I see the issue with not using xattrs and instead a separate metadata database as breaking ‘mv’, ‘cp’, etc and requiring the user to KNOW about the metadata database when they want to copy it along to another system. This should be more transparent to the user, and even if there are (even non-negligible) performance benefits I don’t think it benefits simplicity or clarity to the user.

However, having a global metadata spec is wonderful, so keep working on it.

For some context about performance i compared the gvfs metadata store with xattrs on ext4 in Fedora 11.

The test case is a directory with 10000 empty files, each file having one xattr key and one metadata key (same key and value in both cases, although each file has different value). Timings are made with dropped caches and with an average over 4 runs (although the times were pretty stable):

Talked with Alex about the perf side of this on ext4 a bit today, and one of the big problems when the xattrs are stored outside the inode is that even though the xattr blocks for consecutively created files may be contiguous on disk (at best), because of ext3/4 directory hashing, readdir returns things in essentially random order. Alex retested reading xattrs on all files in a directory w/ an LD_PRELOAD which does presorting, and the time went from 33s to 2s. Bummer when things conspire against you.

alexl: I tried that now, but the MetaData DBUS service wasn’t started. Then I added the location to the DBUS services path and then I get “DBus error org.freedesktop.DBus.Error.Spawn.ChildExited: Launch helper exited with unknown return code 1”.. Any idea what might be wrong?

Alex, I (and some others) are having trouble in Ubuntu Karmic Koala with the icon positions of symlinks on the Desktop. It has been suggested that the new metadata handling in Nautilus may be behind this. Basically, mounted drives, folders and files in the Desktop folder return to their positions on restart, but symlinks don’t. Instead they stack downwards on the left of the Desktop, messing with the carefully-chosen arrangement.

Can you see any reason why this is happening? It appeared a few weeks ago now, and could well coincide with the new metadata system release.

Stephen:
For various reasons metadata for the desktop special icons are not stored in gvfs metadata (this is due to the desktop dir being a virtual in-memory location, not a gvfs location). Instead they are stored in gconf. It could be that this is not working for some reason.