Search

Trees in the Reiser4 Filesystem, Part I

One way of organizing information is to
put it into trees. When we organize information in a computer, we
typically sort it into piles called nodes, with a name (pointer)
for each pile. Some of the nodes contain pointers, and we can look
through the nodes to find those pointers to (usually other) nodes.

We are interested particularly in how to organize so we
actually can find things when we search for them. A tree is an
organizational structure that has some useful properties for that
purpose. We define a tree as follows:

A tree is a set of nodes organized into a root node
and zero or more additional sets of nodes, called subtrees.

Each of the subtrees is a tree.

No node in the tree points to the root node, and
exactly one pointer from a node in the tree points to each non-root
node in the tree.

The root node has a pointer to each of its
subtrees, that is, a pointer to the root node of the
subtree.

Note that a single node with no pointers is a tree, because
it is the root node. Also, a tree can be a linear tree of nodes
without branches.

Figure 1. Write Performance (30 Copies Linux Kernel 2.4.18)

Figure 2. Read Performance (Reading Linux Kernel 2.4.18 Source
Code)

Fine Points of the Definition

It is interesting to argue about whether finite should be a
part of the definition of trees. There are many ways of defining
trees, and which is the best definition depends on your purpose.
Professor Donald Knuth supplies several definitions of tree. As his
primary definition of tree he even supplies one which has no
pointers, edges or lines in the definition, only sets of
nodes.

Knuth defines trees as being finite sets of nodes. There are
papers on inifinte trees on the Internet. I think it more
appropriate to consider finite an additional qualifier on trees,
rather than bundling finite into the definition. However, I
personally only deal with finite trees in my research.

Edge is a term often used in tree definitions. A pointer is
unidirectional, meaning you can follow it from the node that has it
to the node it points to, but you cannot follow it back from the
node it points to, to the node that has it. An edge, however, is
bidirectional, meaning you can follow it in both directions.

Here are three alternative tree definitions, which are
interesting in how they are mathematically equivalent to each
other. They are not equivalent to the definition I supplied,
because edges are not equivalent to pointers. For all three of
these definitions, let there be no more than one edge connecting
the same two nodes:

a set of vertices (aka points) connected by edges
(aka lines) for which the number of edges is one less than the
number of vertices;

a set of vertices connected by edges that have no
cycles (a cycle is a path from a vertex to itself);

or a set of vertices connected by edges for which
there is exactly one path connecting any two vertices.

These three alternative definitions do not have a unique root
in their tree, and such trees are called free trees.

The definition I supplied is a definition of a rooted tree,
not a free tree. It also has no cycles, it has one less pointer
than it has nodes, and there is exactly one path from the root to
any node.

Graphs vs. Trees

Consider the purposes for which you might want to use a graph
and those for which you might want to use a tree. In a tree there
is exactly one path from the root to each node in the tree, and a
tree has the minimum number of pointers sufficient to connect all
the nodes. This makes it a simple and efficient structure. Trees
are useful when efficiency with minimal complexity is desired and
when there is no need to reach a node by more than one
route.

Reiser4 has both graphs and trees, with trees used when the
filesystem chooses the organization (in what we call the storage
layer, which tries to be simple and efficient) and graphs when the
user chooses the organization (in the semantic layer, which tries
to be expressive so that the user can do whatever he or she
wants).

Keys

We assign a key to everything stored in the tree. We find
things by their keys, and using them gives us additional
flexibility in how we sort things. If the keys are small, we have a
compact means of specifying enough information to find the thing.
It also limits what information we can use for finding
things.

This limit restricts the key's usefulness, and so we have a
storage layer, which finds things by keys, and a semantic layer,
which has a rich naming system (described in Part II of this
article). The storage layer chooses keys for things solely to
organize storage in a way that improves performance, and the
semantic layer understands names that have meaning to users. As you
read, you might want to think about whether this is a useful
separation that allows the freedom to add improvements that aid
performance in the storage layer, while escaping paying a price for
the side effects of those improvements on the flexible naming
objectives of the semantic layer.

Choosing Which Subtree

We start our search at the root, because from the root we can
reach every other node. But, how do we choose which subtree of the
root to go to from the root? The root contains pointers to its
subtrees. For each pointer to a subtree there is a corresponding
left delimiting key. Pointers to subtrees, and the subtrees
themselves, are ordered by their left delimiting key. A subtree
pointer's left delimiting key is equal to the least key of the
things in the subtree. Its right delimiting key is larger than the
largest key in the subtree, and it is the left delimiting key of
the next subtree of this node.

Each subtree contains only things whose keys are at least
equal to the left delimiting key of its pointer and are not more
than its right delimiting key. If there are no duplicate keys in
the tree, then each subtree contains only things whose keys are
less than its right delimiting key. If there are no duplicate keys,
then by looking within a node at its pointers to subtrees and their
delimiting keys, we know which subtree of that node contains the
thing we want.

Duplicate keys are a topic for another time. For now I will
only hint that when searching through objects with duplicate keys
we find the first of them in the tree. Then we search through all
the duplicates, one by one, until we find what we are looking for.
Allowing duplicate keys can allow for smaller keys, so there is
sometimes a trade-off between key size and the average frequency of
such inefficient linear searches.

The contents of each node in the tree are sorted within the
node. Therefore, the entire tree is sorted by key, and for a given
key we know exactly where to go to find at least one thing with
that key.

Node Size

We choose to make the nodes equal in size, so it is easier to
allocate the unused space between nodes, because it will be of a
size equal to some multiple of the size of a node. This way, there
aren't any problems of space being free but not large enough to
store a node. Also, disk drives have an interface that assumes
equal size blocks, which they find convenient for their
error-correction algorithms.

If having the nodes be equal in size is not very important,
perhaps because the tree fits into RAM, then using a class of
algorithms called skip lists is worth considering.

Reiser4 nodes are usually equal to the size of a page, which
if you use Linux on an Intel CPU is 4,096 (4k) bytes. There is no
measured empirical reason to think this size is better than others;
it is simply the one that Linux makes easiest and cleanest to
program. Quite honestly, we have been too busy to experiment with
other sizes.

Sharing Blocks Saves Space

If nodes are of equal size, how do we store large objects? We
chop them into pieces, and we call these pieces items. An item is
sized to fit within a single node.

Conventional filesystems store files in whole blocks. Roughly
speaking, this means that, on average, half a block of space is
wasted per file because not all of the last block of the file is
used. If a file is much smaller than a block, the space wasted is a
large percentage of the total file size. It is not effective to
store such typical database objects as addresses and phone numbers
in separately named files in a conventional filesystem because more
than 90% of the space in the storage blocks is wasted. By putting
multiple items within a single node in Reiser4, we are able to pack
multiple small pieces of files into one block. Our space efficiency
is roughly 94% for small files. This does not count
per-item-formatting overhead, whose percentage of total space
consumed depends on average item size and, for that reason, is hard
to quantify.

Aligning files to 4k boundaries does have advantages for
large files though. When a program wants to operate directly on
file data without going through system calls to do it, it can use
mmap() to make the file data part of the process' directly
accessible address space. Due to some implementation details,
mmap() needs file data to be 4k-aligned. If the data is already
4k-aligned, it makes mmap() much more efficient. In Reiser4 the
current default is that files larger than 16k are 4k-aligned. We
don't yet have enough empirical data and experience to know whether
16k is the optimal default value for this cutoff point.

Leaves, Twigs and Branches

Continuing with our tree metaphor, leaves are nodes that have
no children. Internal nodes are nodes that have children.

Figure 3. A Height = 4, Four-Level, Fanout = 3, Balanced Tree

A search starts with the root node, traverses two more
internal nodes and ends with a leaf node, which is a node that
holds the data and has no children.

An item is a data container that is contained entirely within
a single node. A node that contains items is called a formatted
node. When we store objects in the tree, we put their various parts
into items and unformatted leaves. Unformatted leaves (unfleaves)
are leaves that contain only data and do not contain any formatting
information. Only leaves can contain unformatted data. Pointers are
stored in items, and so all internal nodes are necessarily
formatted nodes.

Pointers to unfleaves are different in their structure from
pointers to formatted nodes. To clarify, extent pointers point to
unfleaves. An extent is a sequence of contiguous, in block number
order, unfleaves that belong to the same object. An extent pointer
contains the starting block number of the extent and a length.
Because the extent belongs to only one object, we can store only
one key for the extent, and we can calculate the key of any byte
within that extent. If the extent is at least two blocks long,
extent pointers are more compact than regular node pointers.

Node pointers are pointers to formatted nodes. We do not yet
have a compressed version of node pointers, but they should be
available soon. Notice that with extent pointers we don't have to
store the delimiting key of each node pointed to, while we need to
do this when using node pointers. We will probably introduce key
compression at the same time we add compressed node pointers. One
would expect keys to compress well because they are sorted into
ascending order. We expect our node and item plugin infrastructure
will make such features easy to add at a later date.

Twigs are parents of leaves, and extent pointers exist only
in twigs. (This is a very controversial design decision I will
discuss in Part II of this article.) Branches are internal nodes
that are not twigs.

You might think we would number the root level 1, but since
the tree grows at the top, it turns out to be more useful for
number 1 to be the level with the leaves where object data is
stored. The height of the tree depends upon how many objects we
have to store and what the fanout rate (average number of children)
of the internal and twig nodes will be.

Figure 4. A Reiser4 Tree, Including the Internal Nodes Called Twig
Nodes

Types of Items

Reiser4 includes many different kinds of items designed to
hold different types of information:

static_stat_data: holds the owner, permissions,
last access time, creation time, last modification time, size and
the number of links (names) to the file.

cmpnd_dir_item: holds directory entries and the
keys of the files they link to.

extent pointers: explained above.

node pointers: explained above.

bodies: hold parts of files not large enough to be
stored in unfleaves.

Units

We call a unit that which we must place as a whole into an
item, without splitting it across multiple items. When traversing
an item's contents, it is often convenient to do so in
units:

For body items the units are bytes.

For directory items the units are directory
entries. The directory entries contain a name and a key of the file
named (in practice the name and key may be compressed).

For extent items the units are extents. Extent
items contain only extents from the same file.

For static_stat_data the whole stat data item is
one indivisible unit of fixed size.

Figure 5. What Node Formats Look
Like

Conclusion

I have explained the basic structures of the Reiser4 tree,
but the fun stuff is yet to come. I have not yet explained how
other researchers structure their trees. Nor did you learn why
object contents are stored at the bottom of the tree, why high
fanout is important or what are the different kinds of balancing.
No hint have I yet given as to why balanced trees are better and
dancing trees are best. What I have most especially not done is
explain how a subtle and controversial tree structure change, which
you can see in the trees depicted in this article, doubled Reiser4
read speed compared to Reiser3. This will (space permitting) be in
Part II in next month's issue of Linux Journal.

Hans Reiser
(reiser@namesys.com)
entered UC Berkeley in 1979 after completing the eigth grade and
majored in “Systematizing”, an individual major based on the
study of how theoretical models are developed. His senior thesis
discussed how the philosophy of the hard sciences differs from that
of computer science, with the development of a naming system as a
case study. He is still implementing that naming system, of which
Reiser4 is the storage layer. In 1993 he went to Russia and hired a
team of programmers to develop ReiserFS. He worked full-time to pay
their salaries while spending nights and weekends arguing over
algorithms. In 1999 it began to work well enough that his mother
stopped suggesting a salaried job at a nice big company.