I’m going to put my half-baked ideas under a special Ideas tag on this site.

My first idea is a “cluster” version of ext4 for the special case where there is one writer and multiple readers.

Virt newbies commonly think this should “just work” – create a ext4 filesystem in the host and export it read-only to all the guests. What could possibly go wrong?

In fact this doesn’t work, and guests will see corrupt data depending on how often the filesystem gets updated. The reason is that the guest kernel caches old parts of the filesystem and/or can be reading metadata which the host is simultaneously updating.

Another problem is that a guest process could open a file which the host would delete (and reuse the blocks). Really the host should be aware of what files and directories that guests have open and keep those around.

So it doesn’t work, but can it be made to work with some small changes to ext4 in the kernel?

You obviously need a communications path from the guests back to the host. Guests could use this to “reserve” or mark their interest in files, which the host would treat as if a local process had opened. (In fact if the hypervisor is qemu, it could open(2) these files on the host side).

Update

If qemu is going to open files on the host side, why not go the whole way and implement a paravirtualized filesystem? It wouldn’t need to be limited to just ext4 on the host side.

But how would we present it on the guest side? Presenting a read-only ext2 filesystem on the guest side is tempting, but not feasible. The problem again is what to do when files disappear on the host side — there is no way to communicate this to the guest except to give fake EIO errors which is hardly ideal. In any case qemu can already export directories as “virtual FAT filesystems”. I don’t know anyone who has a good word to say about this (mis-)feature.

So it looks like however it is done, there is a requirement for the guest to communicate its intentions to the host, even though the guest still would not be able to write.

9 responses to “Half-baked ideas: Cluster ext4”

“The problem again is what to do when files disappear on the host side — there is no way to communicate this to the guest except to give fake EIO errors which is hardly ideal.”

As you know, a file is only truly deleted when nothing references it anymore (as open()/close() in userland, some refcount in VFS). So you just need to make sure to keep the files open on the host for as long as they’re referenced in the guest.

That means that virtio_fs needs an API/protocol that allows the guest to basically open() files in the host. The API could be similar to POSIX, but perhaps something closer to the VFS is better. See page 18 of http://www.slideshare.net/ericvh/9p-on-kvm

I think you want something that mmap()s the host’s files into the emulated physical memory space in the guest on demand. The guest would then use a special virtiofs which doesn’t use the guest kernel’s buffer cache, instead sharing the host’s buffer cache with all other guests.

I wonder if you could leverage btrfs r/o snapshotting. If it can handle absurdly frequent snapshot creation and destruction efficiently, then you’d ‘just’ need to plumb the VM to recognize each subsequent snapshot as the same filesystem. And handle all the shared memory (1.8yr project in those scare quotes).

About the author

I am Richard W.M. Jones, a computer programmer. I have strong opinions on how we write software, about Reason and the scientific method. Consequently I am an atheist [To nutcases: Please stop emailing me about this, I'm not interested in your views on it] By day I work for Red Hat on all things to do with virtualization. I am a "citizen of the world".

My motto is "often wrong". I don't mind being wrong (I'm often wrong), and I don't mind changing my mind.

This blog is not affiliated or endorsed by Red Hat and all views are entirely my own.