Writing Stackable Filesystems

Now you can add a feature to your favorite filesystem without rewriting it.

Writing filesystems, or any kernel code,
is hard. The kernel is a complex environment to master, and small
mistakes can cause severe data corruption. Filesystems, however,
offer a clean data access mechanism that is transparent to user
applications, which is why developers always desire to add new
features to filesystems. In this article, we provide a quick
introduction so you can add new functionality to existing
filesystems without having to become a kernel or filesystems
expert.

So You Want to Be a Filesystem
Developer?

Although Linux supports many filesystems, they are pretty
similar: disk-based filesystems, network-based filesystems, etc.
Making a filesystem stable and efficient takes years of effort, and
once it's stable and working, you don't want to break it by
throwing in new features. Besides, maintainers of filesystems
rarely accept feature-enhancement patches to their stable
filesystems. So, it is no surprise that the most popular
filesystems currently in use have not fundamentally changed in
years.

Suppose you want to write a simple encryption filesystem that
uses a single fixed cipher key to encrypt file data. Getting
portable C code for various ciphers is easy. Next, you have to tie
the calls to encrypt and decrypt data buffers into the filesystem.
Conceptually the problem is simple: encrypt any data that comes
from the write system call before it is written to disk, and
decrypt any data that comes from the disk before it is passed back
to the user process that called the read system call.

Your first inclination might be to copy the 5,000+ lines of
source code for ext2, study it and then add your cipher calls to
it. You should resist the urge to copy a whole other filesystem as
a starting point. Although it's only 5,000+ lines of code, kernel
code is at least an order of magnitude more complex to develop than
user-level code. If you actually end up putting the calls to your
cipher in the right place in this new filesystem, you'll find you
spent most of your time studying it, only to add a small number of
lines in some places. Even so, now you've got yourself a single
encrypting ext2 filesystem. What if you want an encrypting NFS
filesystem or any one of the plethora of other Linux
filesystems?

Incremental Filesystem Development

Linux, like most OSes, separates its filesystem code into two
components: native filesystems (ext2, NFS, etc.) and a
general-purpose layer called the virtual filesystem (VFS). The VFS
is a layer that sits between system call entry points and native
filesystems. The VFS provides a uniform access mechanism to
filesystems without needing to know the details of those
filesystems. When filesystems are initialized in the kernel, they
install a set of function pointers (methods in OO-speak) for the
VFS to use. The VFS, in turn, calls these pointer functions
generically, without knowing which specific filesystem the pointers
represent. For example, an unlink system call gets translated into
a service routine sys_unlink, which invokes the vfs_unlink VFS
function, which invokes a filesystem-specific method by using its
installed function pointer: ext2_unlink for ext2, nfs_unlink for
NFS or the appropriate function for other filesystems. Throughout
this article, we refer to the specific filesystem method using
->, as in ->unlink().

To solve this problem of how to develop our encryption
filesystem quickly, we employ the following adage: “Any problem in
computer science can be solved by adding another level of
indirection.” Luckily, the Linux VFS allows another filesystem to
be inserted right between the VFS and another filesystem. Figure 1
shows such a stackable encryption filesystem called Cryptfs.
Cryptfs is called stackable because it stacks on top of another
filesystem (ext2). Here, the VFS calls Cryptfs' ->write() method
(cryptfs_write); Cryptfs encrypts the user data it receives and
passes it down by calling the ->write() method below
(ext2_write).

Figure 1. An Example Stackable Encryption Filesystem

In general, stackable filesystems can stand alone and be
mounted on top of any other existing filesystem mountpoint; this
means you only have to develop your (stackable) filesystem once,
and it will work with any other native (low-level) filesystem such
as ext2, NFS, etc. Moreover, as of Linux 2.4.20, stackable
filesystems even can be exported safely (via nfs-utils-1.0 or
newer) to remote NFS clients.

How a Stackable Filesystem Works

The basic function of a stackable filesystem is to pass an
operation and its arguments to the lower-level filesystem. The
following distilled code snippet shows how a stackable null-mode
pass-through filesystem called Wrapfs handles the ->unlink()
operation:

When the VFS needs to unlink a file in a Wrapfs filesystem,
it calls wrapfs_unlink, passing it the inode of the directory in
which the file to remove resides (dir) and the name of the entry to
remove (encapsulated in dentry).

Every filesystem keeps a set of objects that belong to it,
including inodes, directory entries and open files. When using
stacking, multiple objects represent the same file—only at
different layers. For example, our Cryptfs in Figure 1 may have to
keep a directory entry (dentry) object with the clear-text version
of the filename, while ext2 will keep another dentry with the
ciphertext (encrypted) version of the same name. To be truly
transparent to the VFS and other filesystems, stackable filesystems
keep multiple objects at each level.

This is why the first few actions that wrapfs_unlink takes
are to locate, from the arguments it gets, the inode and dentry
that correspond to the same objects, only at the filesystem mounted
below. These get_lower_* functions essentially follow pointers that
previously were stored in the private fields of Wrapfs' objects
when those objects were created. Once the lower objects are
located, the main magic of stacking takes place. We call the
lower-level filesystem's own ->unlink() method, through the
lower-level directory inode, and pass it the two lower
objects.

Wrapfs is a full-fledged stackable null-layer (or loopback)
filesystem that simply passes all operations and objects
(unmodified) between the VFS and the lower filesystem. Wrapfs
itself, however, is not easy to write for one main reason; it has
to treat the lower filesystem as if it were the VFS, while
appearing to the real Linux VFS as a lower-level filesystem. This
dual role requires careful handling of locks, reference counts,
allocated memory and so on. Luckily, someone already wrote and
maintains Wrapfs. Therefore, Wrapfs serves as an excellent template
for you to modify and add new functionality.

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.