The “Virtual File System” in Linux

This article outlines the VFS structure and gives an overview of how the Linux kernel accesses its file hierarchy. The information herein refers to Linux 2.0.x (for any x) and 2.1.y (with y up to at least 18).

The main data item in any Unix-like
system is the “file”, and a unique path name identifies each file
within a running system. Every file appears like any other file in
the way it is accessed and modified: the same system calls and the
same user commands apply to every file. This applies independently
of both the physical medium that holds information and the way
information is laid out on the medium. Abstraction from the
physical storage of information is accomplished by dispatching data
transfer to different device drivers. Abstraction from the
information layout is obtained in Linux through the VFS
implementation.

The Unix Way

Linux looks at its file system in the same way Unix
does—adopting the concepts of super block, inode, directory and
file. The tree of files accessible at any time is determined by how
the different parts are assembled, each part being a partition of
the hard drive or other physical storage device that is “mounted”
to the system.

While the reader is assumed to be well acquainted with the
concept of mounting a file system, I'll detail the concepts of
super block, inode, directory and file.

The super block
owes its name to its heritage, from when the first data block of a
disk or partition was used to hold meta information about the
partition itself. The super block is now detached from the concept
of data block, but it still contains information about each mounted
file system. The actual data structure in Linux is called
struct super_block and holds various
housekeeping information, like mount flags, mount time and device
block size. The 2.0 kernel keeps a static array of such structures
to handle up to 64 mounted file systems.

An inode is
associated with each file. Such an “index node” holds all the
information about a named file except its name and its actual data.
The owner, group, permissions and access times for a file are
stored in its inode, as well as the size of the data it holds, the
number of links and other information. The idea of detaching file
information from file name and data is what allows the
implementation of hard-links—and the use of “dot” and
“dot-dot” notations for directories without any need to treat
them specially. An inode is described in the kernel by a
struct inode.

The directory is a
file that associates inodes to file names. The kernel has no
special data structure to represent a directory, which is treated
like a normal file in most situations. Functions specific to each
file system type are used to read and modify the contents of a
directory independently of the actual layout of its data.

The file itself is
associated with an inode. Usually files are data areas, but they
can also be directories, devices, fifos (first-in-first-out) or
sockets. An “open file” is described in the Linux kernel by a
struct file item; the structure holds a pointer
to the inode representing the file. file
structures are created by system calls like open,
pipe and socket, and are shared by
father and child across fork.

Object Orientedness

While the previous list describes the theoretical
organization of information, an operating system must be able to
deal with different ways to layout information on disk. While it is
theoretically possible to look for an optimum layout of information
on disks and use it for every disk partition, most computer users
need to access all of their hard drives without reformatting, to
mount NFS volumes across the network, and to sometimes even access
those funny CD-ROMs and floppy disks whose file names can't exceed
8+3 characters.

The problem of handling different data formats in a
transparent way has been addressed by making super blocks, inodes
and files into “objects”; an object declares a set of operations
that must be used to deal with it. The kernel won't be stuck into
big switch statements to be able to access the
different physical layouts of data, and new file system types can
be added and removed at run time.

The entire VFS idea, therefore, is implemented around sets of
operations to act on the objects. Each object includes a structure
declaring its own operations, and most operations receive a pointer
to the “self” object as the first argument, thus allowing
modification of the object itself.

All the data handling and buffering performed by the Linux
kernel is independent of the actual format of the stored data.
Every communication with the storage medium passes through one of
the operations structures. The file system type,
then, is the software module which is in charge of mapping the
operations to the actual storage mechanism—either a block device,
a network connection (NFS) or virtually any other means of storing
and retrieving data. These modules can either be linked to the
kernel being booted or compiled as loadable modules.

The current implementation of Linux allows use of loadable
modules for all file system types but root (the root file system
must be mounted before loading a module from it). Actually, the
initrd machinery allows loading of a module
before mounting the root file system, but this technique is usually
exploited only on installation floppies.

In this article I use the phrase “file system module” to
refer either to a loadable module or a file system decoder linked
to the kernel.

This is in summary how all file handling happens for any
given file system type, and is depicted in Figure 1:

struct file_system_type is a
structure that declares only its own name and a
read_super function. At mount
time, the function is passed information about the storage medium
being mounted and is asked to fill a super block structure, as well
as loading the inode of the root directory of the file system as
sb->s_mounted (where sb is
the super-block just filled). The additional field
requires_dev is used by the file system type to
state whether it will access a block device: for example, the NFS
and proc types don't require a device, while
ext2 and iso9660 do. After
the super block is filled, struct
file_system_type is not used any more; only the super
block just filled will hold a pointer to it in order to be able to
give back status information to the user
(/proc/mounts is an example of such
information). The structure is shown in Listing 1.

The super_operations structure
is used by the kernel to read and write inodes, write super block
information back to disk and collect statistics (to deal with the
statfs and fstatfs system
calls). When a file system is eventually unmounted, the
put_super operation is called—in standard
kernel wording “get” means “allocate and fill”, “read” means
“fill” and “put” means “release”. The
super_operations declared by each file system
type are shown in Listing 2.

After a memory copy of the inode has been created,
the kernel will act on it using its own operations. struct
inode_operations is the second set of operations declared
by file system modules, and is listed below; they deal mainly with
the directory tree. Directory-handling operations are part of the
inode operations because the implementation of a
dir_operations structure would bring in extra
conditionals in file system access. Instead, inode operations that
only make sense for directories will do their own error checking.
The first field of the inode operations defines the file operations
for regular files. If the inode is a fifo, a socket or a
device-specific file operation will be used. Inode operations
appear in Listing 3; note the definition of
rename was changed in release 2.0.1.

The file_operations, finally,
specify how data in the actual file is handled: the operations
implement the low-level details of read, write,
lseek and the other data-handling system calls. Since the
same file_operations structure is used to act on
devices, it also includes some fields that only make sense for
character or block devices. It's interesting to note that the
structure shown here is the structure declared in the 2.0 kernels,
while 2.1 changed the prototypes of read, write
and lseek to allow a wider range of file
offsets. The file operations (as of 2.0) are shown in Listing
4.

I found the information about /proc because I am looking to interact with some hardware with GPL release code. Great introduction, thanks.
The only problem I have is that in my test module code, when I build it, proc_register_dynamic is not found.
I am clueless, working on MIPS based system, kernel version 2.4.19.
It's a wireless device being sold by Linksys.