An In-Depth Look at Reiserfs - page 3

Included in the Linux kernel

January 22, 2001

By
Scott Courtney

Modern filesystem designs, such as OS/2's HPFS, NT's NTFS, and
Linux's popular ext2, do a very good job of implementing the
things discussed in the previous section. If you have a system
crash, it may take a great deal of time to check the metadata
during bootup but the odds are good that you will still have
all your files when it's done. As Linux begins to take on more
complex applications, on larger servers, and with less tolerance
for downtime, there is a need for more sophisticated filesystems
that do an even better job of protecting data and metadata.
The journaled filesystems now available for Linux are
the answer to this need.

It's important to note here that we are talking about journaled
filesystems in general. There are a number of such systems
available, including "xfs" from Silicon Graphics, "Reiserfs"
from The Naming System Venture, "ext3" currently hosted at
Red Hat, and "Journaled File System" from IBM. In this article,
I use "journaled filesystem" in lower case to mean the generic
type of system, as opposed to the capitalized version that refers
specifically to IBM's software. You can find links to all of
these projects in the references attached to this article.

No matter which journaled filesystem is used, there are certain
principles that always apply. The term "journaled" means that
the filesystem maintains a log or record of what it is doing
to the main data areas of the disk, so that if a crash occurs
it can re-create anything that was lost. That can be a little
confusing, so let's take a closer look at this process.

In public speaking classes, there is an old saying that goes,
"Tell them what you're going to tell them, then tell them,
then tell them what you told them." This is similar to what
the journal does in a filesystem. When the system is about
to alter the metadata, it first makes an entry in the journal
saying, "Here is what I'm going to change." Then it makes the
change. Finally, it goes back to the journal and either marks
that change as "completed" or simply deletes the journal entry
entirely. There are variations on this sequence, and other
ways to accomplish the same thing, but this simplified view
will suffice for our purposes.

The idea is that the system can crash at any point in this
process but that such a crash won't have lasting effect.
If the crash happens before the first journal entry, then
the original data is still on the disk. You lost your new
changes, but you didn't lose the file in its previous state.
If the crash happens during the actual disk update, you still
have the journal entry showing what was supposed to have
happened. So when the system reboots, it can simply replay
the journal entries and complete the update that was
interrupted, or it can back out a partially completed
update to restore the file's previous state. In either case,
you have valid data and not a trashed partition.

These concepts are familiar to anyone who works with SQL
databases with their transaction logic. Replaying and
completing an operation that was interrupted is called
"roll forward" and backing out such an operation to its
previous, consistent state is called "roll back." Ideas that
were developed to prevent lost data in SQL databases are
also valuable on regular mass storage devices. That is
the real benefit of journaled filesystems.