Multi-Threading with VFS

One of the new features in the BagIt Library will be multi-threading CPU-intensive bag processing operations, such as bag creation and verification. Modern processors are all multi-core, but because the current version of the BagIt Library is not utilizing those cores, bag operations take longer than they should. The new version of BIL should create and verify bags significantly faster than the old version. Of course, as we add CPUs, we shift the bottleneck to the hard disk and IO bus, but it’s an improvement nonetheless.

Writing proper multi-threaded code is a tricky proposition, though. Threading is a notorious minefield of subtle errors and difficult-to-reproduce bugs. When we turned on multi-threading in our tests, we ran into some interesting issues with the Apache Commons VFS library we use to keep track of file locations. It turns out that VFS is not really designed to be thread-safe. Some recent list traffic seems to indicate that this might be fixed sometime in the future, but it’s certainly not the case now.

Now, we don’t want to lose VFS – it’s a huge boon. Its support for various serialization formats and virtual files makes modeling serialized and holey bags a lot easier. So we had to figure out how to make VFS work cleanly across multiple threads.

The FileSystemManager is the root of one’s access to the VFS API. It does a lot of caching internally, and the child objects coming from its methods often hold links back to each other via the FileSystemManager. If you can isolate a FileSystemManager object per-thread, then you should be good to go.