4. File Operations

We now discuss the handling of file system operations in fast indexing as
well as specific optimizations for common operations. Note that most of
this design relates to performance optimizations while a small part
(Section 4.4) addresses correctness.

Because the cost of SCAs can be high, it is important to
ensure that we minimize the number of times we invoke these algorithms and
the number of bytes they have to process each time. The way we store and
access encoded data chunks can affect this performance as well as the types
and frequencies of file operations. As a result, fast indexing takes into
account the fact that file accesses follow several patterns:

The most popular file system operation is stat(2), which results
in a
file lookup. Lookups account for 40-50% of all file system operations
[17,20].

Most files are read, not written. The ratio of reads to writes is
often 4-6 [17,20]. For example, compilers and editors
read in many header and configuration files, but only write out a handful of
files.

Files that are written are often written from beginning to end.
Compilers, user tools such as cp, and editors such as emacs
write whole files in this way. Furthermore, the unit of writing is usually
set to match the system page size. We have verified this by running a set
of common edit and build tools on Linux and recorded the write start
offsets, size of write buffers, and the current size of the file.

Files that are not written from beginning to end are often appended
to. The number of appended bytes is usually small. This is true for various
log files that reside in /var/log as well as Web server logs.

Very few files are written in the middle. This happens in two
cases. First, when the GNU ld creates large binaries, it
writes a sparse file of the target size and then seeks and writes the
rest of the file in a non-sequential manner. To estimate the frequency of
writes in the middle, we instrumented a null-layer file system with a few
counters. We then measured the number and type of writes for our large
compile benchmark (Section 7.3.1). We counted 9193
writes, of which 58 (0.6%) were writes before the end of a file.

Second, data-base files are also written in the middle. We surveyed our own
site's file servers and workstations (several hundred hosts totaling over
1TB of storage) and found that these files represented less than 0.015% of
all storage. Of those, only 2.4% were modified in the past 30 days, and
only 3% were larger than 100MB.

All other operations (together) account for a small fraction of file
operations [17,20].

We designed our system to optimize performance for the more common and
important cases while not harming performance unduly when the
seldom-executed cases occur. We first describe how the index file is
designed to support fast lookups, file reads, and whole file writes, which
together are the most common basic file operations. We then discuss support
for appending to files efficiently, handling the less common operation of
writes in the middle of files, and ensuring correctness for the infrequent
use of truncate.

4.1 Basic Operations

To handle file lookups fast, we store the original file's size in the index
table. Due to locality in the creation of the index file, we assume that
its name will be found in the same directory block as the original file
name, and that the inode for the index file will be found in the same inode
block as the encoded data file. Therefore reading the index file requires
reading one additional inode and often only one data block. After the index
file is read into memory, returning the file size is done by copying the
information from the index table into the ``size'' field in the current
inode structure. Remaining attributes of the original file come from the
inode of the actual encoded file. Once we read the index table into memory,
we allow the system to cache its data for as long as possible. That way,
subsequent lookups will find files' attributes in the attribute cache.

Since most file systems are structured and implemented internally for access
and caching of whole pages, we also encode the original data file in whole
pages. This improved our performance and helped simplify our code because
interfacing with the VFS and the page cache was more natural. For file
reads, the cost of reading in a data page is fixed: a fixed offset lookup
into the index table gives us the offsets of encoded data on the lower level
data file; we read this encoded sequence of bytes, decode it into exactly
one page, and return that decoded page to the user.

Using entire pages made it easier for us to write whole files, especially if
the write unit was one page size. In the case of whole file writes, we
simply encode each page size unit, add it to the lower level encoded file,
and add one more entry to the index table. We discuss the cases of file
appends and writes in the middle in Sections
4.2 and 4.3,
respectively.

We did not have to design anything special for handling all other file
operations. We simply treat the index file at the same time we manipulate
the corresponding encoded data file. An index file is created only for
regular files; we do not have to worry about symbolic links because the VFS
will only call our file system to open a regular file. When a file is
hard-linked, we also hard-link the index file using the name of the new
link with a the .idx extension added. When a file is removed from a
directory or renamed, we apply the same operation to the corresponding index
file.

4.2 Fast Tails

One common usage pattern of files is to append to them. Often, a small
number of bytes is appended to an existing file. Encoding algorithms such
as compression and encryption are more efficient when they encode larger
chunks of data. Therefore it is better to encode a larger number of bytes
together. Our design calls for encoding whole pages whenever possible.
Table 1 and Figure 3 show that only
the last page in the original file may be incomplete and that incomplete
page gets encoded too. If we append, say, 10 more bytes to the original
(upper) file of Figure 3, we have to keep it and the
index file consistent: we must read the 1020 bytes from 20480 until
21500, decode them, add the 10 new bytes, encode the new 1030 sequence of
bytes, and write it out in place of the older 1020 bytes in the lower file.
We also have to update the index table in two places: the total size of the
original file is now 21510, and word number 8 in the index file may be in a
different location than 10120 (depending on the encoding algorithm, it may
be greater, smaller, or even the same).

The need to read, decode, append, and re-encode a chunk of bytes for each
append grows worse as the number of bytes to append is small while the
number of encoded bytes is closer to one full page. In the worst case, this
method yields a complexity of O(n2) in the number of bytes that have to
be decoded and encoded, multiplied by the cost of the encoding and decoding
of the SCA. To solve this problem, we added a fast tails runtime
mount option that allows for up to a page size worth of unencoded data to be
added to an otherwise encoded data file. This is shown in the example in
Figure 4.

Figure:Size-changed file structure with fast-tail optimization.
A file system similar to Figure 3, only here we store up
to one page full of unencoded raw data. When enough raw data is collected
to fill a whole fast-tail page, that page is encoded.

In this example, the last full page that was encoded is page 4. Its data
bytes end on the encoded data file at offset 10000 (page 2). The last page
of the original upper file contains 1020 bytes (21500 less 20K). So we
store these 1020 bytes directly at the end of the encoded file, after offset
10000. To aid in computing the size of the fast tail, we add two more bytes
to the end of the file past the fast tail itself, listing the length of the
fast tail. (Two bytes is enough to list this length since typical page
sizes are less than 216 bytes long.) The final size of the encoded
file is now 11022 bytes long.

With fast tails, the index file does not record the offset of the last tail
as can be seen from the right-most column of Table 1.
The index file, however, does record in its flags field (first 12 bits of
the first word) that a fast tail is in use. We put that flag in the index
table to speed up the computations that depend on the presence of fast
tails. We append the length of the fast tail to the encoded data file to
aid in reconstruction of a potentially lost index file, as described in
Section 5.

When fast tails are in use, appending a small number of bytes to an existing
file does not require data encoding or decoding, which can speed up the
append operation considerably. When the size of the fast tail exceeds one
page, we encode the first page worth of bytes, and start a new fast tail.

Fast tails, however, may not be desirable all the time exactly because they
store unencoded bytes in the encoded file. If the SCA used is an encryption
one, it is insecure to expose plaintext bytes at the end of the ciphertext
file. For this reason, fast tails is a runtime global mount option that
affects the whole file system mounted with it. The option is global because
typically users wish to change the overall behavior of the file system with
respect to this feature, not on a per-file basis.

4.3 Write in the Middle

User processes can write any number of bytes in the middle of an existing
file. With our system, whole pages are encoded and stored in a lower level
file as individual encoded chunks. A new set of bytes written in the middle
of the file may encode to a different number of bytes in the lower level
file. If the number of new encoded bytes is greater than the old number, we
shift the remaining encoded file outward to make room for the new
bytes. If the number of bytes is smaller, we shift the remaining
encoded file inward to cover unused space. In addition, we adjust
the index table for each encoded data chunk which was shifted. We perform
shift operations as soon as our file system's write operation is invoked, to
ensure sequential data consistency of the file.

To improve performance, we shift data pages in memory and keep them in the
cache as long as possible: we avoid flushing those data pages to disk and
let the system decide when it wants to do so. That way, subsequent
write-in-the-middle operations that may result in additional inward or
outward shifts will only have to manipulate data pages already cached and in
memory. Any data page shifted is marked as dirty, and we let the
paging system flush it to disk when it sees fit.

Note that data that is shifted in the lower level file does not have to be
re-encoded. This is because that data still represents the actual encoded
chunks that decode into their respective pages in the upper file. The only
thing remaining is to change the end offsets for each shifted encoded chunk
in the index file.

We examined several performance optimization alternatives that would have
encoded the information about inward or outward shifts in the index table or
possibly in some of the shifted data. We rejected them for several reasons:
(1) it would have complicated the code considerably, (2) it would have made
recovery of an index file difficult, and (3) it would have resulted in
fragmented data files that would have required a defragmentation procedure.
Since the number of writes in the middle we measured was so small (0.6% of
all writes), we do consider our simplified design as a good cost vs. performance balance. Even with our simplified solution, our file
systems work perfectly correctly. Section 7.3.2 shows
the benchmarks we ran to test writes in the middle and demonstrates that our
solution produces good overall performance.

4.4 Truncate

One design issue we faced was with the truncate(2) system call.
Although
this call occurs less than 0.02% of the time [17,20],
we still had to ensure that it behaved correctly. Truncate can be used to
shrink a file as well as enlarge it, potentially making it sparse with new
``holes.'' We dealt with four cases:

1.

Truncating on a page boundary. In this case, we truncate the encoded
file exactly after the end of the chunk that now represents the last page of
the upper file. We update the index table accordingly: it has fewer pages
in it.

2.

Truncating in the middle of an existing page. This case results in a
partial page: we read and decode the whole page and re-encode the bytes
within the page representing the part before the truncation point. We
update the index table accordingly: it now has fewer pages in it.

3.

Truncating in the middle of a fast tail. In that case we just
truncate the lower file where the fast tail is actually located. We then
update the size of the fast tail at its end and update the index file to
indicate the (now) smaller size of the original file.

4.

Truncating past the end of the file is akin to extending the size of
the file and possibly creating zero-filled holes. We read and re-encode any
partially filled page or fast tail that used to be at the end of the file
before the truncation; we have to do that because that page now contains a
mix of non-zero data and zeroed data. We encode all subsequent zero-filled
pages. This is important for some applications such as encryption, where
every bit of data--zeros or otherwise--should be encrypted.