The BeOS file system: an OS geek retrospective

One of the most-hyped features of the erstwhile BeOS was its fully buzzword- …

Journaling

Computers with disk storage can experience many kinds of failure modes. The magnetic material of a disk sector can go bad, the servo mechanism that moves the disk head can fail, the electronics that interface with the computer can die, or the user can reboot the machine in the middle of a disk write operation. If you run a disk long enough, eventually the electronics or mechanics of a disk will fail. You generally don’t have to wait very long for an impatient user to reboot during a file write operation. This is unfortunately a fairly common occurrence and it can have devastating results on a file system.

Consider the following situation. A user has created a file and it is the process of saving it to disk. Maybe it’s a developer working on an overdue project. The file system must make several updates before this file is completely saved. It must first save the contents of the file. It must save the metadata for the file, creation date, owner, file size, etc. It must also update the superblock. Consider what happens when a system powers down before these operations complete. In addition to losing your valuable work, the data structures can become corrupted and point to i-nodes and blocks that don’t exist. In addition to being even more late, your file system may subsequently fail to boot, losing all of your other files.

Journaling, or file system logging, alleviates some of these problems. While journaling doesn’t guarantee that a premature reboot won’t lose your file, it does guarantee that it won’t corrupt your file system.

Consider the following example of how journaling works. Let’s say a user creates a new file and saves it. The data must be saved to disk, and a new directory entry with correct metadata must be saved as well. Before these disk operations occur, BFS locks the unwritten blocks into RAM, and makes log entries in the file system journal for each block about to be written. After each block is successfully flushed to disk, the journal entries are marked as completed. If the system shuts down before the blocks are successfully flushed, the journal entries will list the incomplete operations. When the system restarts, it inspects the journal for entries that are not completed. The remaining entries can be ‘replayed’ when the system reboots to successfully complete the write operation, or the whole thing can be "rolled back," and the operation aborted. Either way, the file system is left in a consistent state where the directory structure and metadata accurately match the file data.

In BFS, the journal logs any changes made to the directory, the bitmap block, i-nodes, and extended attributes. It does not journal the actual user data. In this way, journaling protects file system consistency but does not provide data recovery features like a redundant disk array, or RAID.

While not immune to all disk errors, BFS is resistant to the common failure mode of a premature system shutdown. File systems without journaling, such as Linux EXT2, are vulnerable to file system inconsistencies and rely on lengthy scanning processes for recovery. BFS does not need disk scanning, and Haiku can start up quickly after a premature shutdown.

Improvements of Haiku BFS over BeOS

Haiku’s version of BFS has a number of improvements over the original BeOS BFS implementation. The B+tree is more robust. Haiku BFS uses a file cache for file data in addition to a block cache. This resulted in a factor of 10 speed improvement. Haiku’s BFS implements status changed time for files, and also has more fine-grained file status capability. The POSIX atime file was omitted from BeOS BFS for performance's sake. Haiku BFS includes a query optimized for hybrid regex that allows mixing of a static string with a regular expression. New inspection tools bfsinfo, bfswhich, chkindex, and recover were added for Haiku BFS. The reindex command was added to improve indexing of extended attributes.

A Quick Interview with a Senior Developer at BeOS, who worked on BFS

(He asked that his name not be used, to comply with the wishes of his current employer)

The book Practical File System Design describes the BeOS development environment as being short on time and scarce on developers. What were some of the positive aspects of that environment?

It was just a fun place to work at the time. We all got along really well, loved what we were doing and all fed off of each other's energy. Everyone was top-notch and it was just non-stop. Usually you get one or two of those aspects in a company, but when you have them all it's a very intoxicating environment: change is rapid, progress is good, you feel like you're really doing something important and it's just a lot of fun.

From conception to release, how long did it take to code and debug BFS?

Ten or eleven months. I had help from one other engineer to write the code that handled writing to the indirect blocks in a file.

What tools did you use to develop and debug BFS?

Emacs, make, and gcc. Initially I did some prototyping in user space and thus was able to use gdb but once it went into the kernel it was all "Welcome to Kernel Debugging Land" from there on out.

What was the biggest challenge in developing BFS?

Trying to get it done in such a short timeframe while supporting all the features we wanted.

Did BFS influence the design of any subsequent file systems?

Unlikely.

What are some hot topics of file systems today?

Data integrity and data de-duplication are probably the biggest areas of interest right now. People are also spending a lot of time trying to figure out how to provide reliable storage in the face of unreliable components. RAID-5/6 are OK, but as the size of drives go up, a lot of people are worried that when a failure happens, they're vulnerable to a total failure if another drive fails before they're done with reconstruction.

What new features or activities will the next generation of file systems support?

I'm not sure. I think it will be a little while yet before we sort out which layer is the right place for all the different parts of the problem of "storage." ZFS has gone down the path of putting everything in the FS. Other folks are putting in a different set of things. It's pretty clear that some of the functionality that existed in BFS does *not* belong in the FS.

In your opinion, what science fiction movie has the best use of a computer?

Hmmm, not sure. Star Trek? Their computers do everything and have a multitouch interface and they don't spend a lot of time futzing around with them.

A Quick Interview With Axel Dörfler, Developer of Open Source BFS

Before BFS, did you have previous experience working with file systems?

I had to write an application to recover some data from a BFS partitioned hard disk. That gave me all the knowledge I needed to write BFS.

What was the hardest part of rewriting BFS?

Making sure the B+tree implementation behaved correctly, as the one used in the original BFS (as mentioned in Dominic's book) was quite unstable, and inefficient.

What was the easiest part?

The actual BFS read-only implementation, as I could reuse most of the code I had written for recovery.

Did you find any surprises when rewriting BFS?

Yes, there are quite some illogical things like the log using block_runs, but requires them to have a length of one, or the almost superfluous double indirect block implementation.

Did you ever have any discussions with Dominic about BFS?

Yes, but they didn't really detail its flaws, rather my implementation.

What would you change in BFS 2.0?

A lot.

How long did the coding take?

I honestly don't remember. I think the read-only part took only two weeks or so, while the write part took a lot longer, and it took still longer to make it fully compliant to Be's BFS.

What was the biggest bug and how did you solve it?

The most annoying bug was the "vnode ID already exists" bug - there were like a dozen or so reasons why this one could happen, and that kind of minimized the joy of having found another issue with it (only a part of them were in BFS itself, though).

It's still there in a new incarnation as bug #5262, BTW, although this could have completely unrelated reasons (like memory corruption).

What lessons did you learn from rewriting BFS?

That it's much easier to have a working VFS when writing a file system.

Lastly, what is your current development PC?

Depending on where I am it's either a ThinkPad T60, or two year old Core 2 Quad with onboard Intel graphics.

Acknowledgements

This article relied heavily on the book Practical File System Design with the Be File System. It’s seldom that a file system is documented in such detail and in such a readable fashion. Also thanks to Axel Dörfler for fact checking.