A Response to Ousterhout's Critique of LFS Measurements

Ousterhout references "a number of flaws in the earlier paper"
(the 1993 Usenix paper A
Log-Structured File System for UNIX), but does not
detail these flaws.
To date, we are aware of no technical inaccuracies in the paper.

Ousterhout cannot explain the measurements in Section 4, stating
that they are inconsistent with the simulated results in the
Rosenblum/Ousterhout paper that appeared in the February 1992 ACM
Transactions on Computer Systems. The simulated results in
the TOCS paper are for a different algorithm and different parameters
than used by either Sprite-LFS or BSD-LFS. This is explained
in more detail below.

The optimization, suggested by Ousterhout, to improve the transaction
processing performance has not yet been implemented. This optimization
affects the ability of the file system buffer cache to cache dirty data,
and the evaluation of its impact is beyond the scope of the 1995 Usenix paper.
We agree that this work will be interesting and encourage Ousterhout to
pursue this research avenue.

Ousterhout claims that the results in Section 5 are invalid.
We present even more data below that demonstrates that the results in
Section 5 are indicative of real-world performance.

A Clarification of the 1993 Usenix Paper

There is one graph that, while not incorrect, is only partially explained
in the paper. Figure
9 on page 320 depicts the write performance of BSD-LFS, BSD-FFS and the
raw disk on a benchmark that synchronously writes data to disk.
In the paper, this difference is attributed to the fact that FFS begins
asynchronously writing blocks to disk as soon as they are filled, but LFS
delays writes until approximately 800 KB of data have accumulated.
This phenomenon does account for part of the performance gap, but
a second phenomenon also contributes. The maximum transfer unit between memory
and the disk is 56 KB. The FFS places a rotational delay between
56 KB units; LFS does not. Therefore, when data is being written to disk,
LFS loses a rotation between 56 KB writes while FFS loses only a rotdelay
(approximately 1/4 revolution).

Section 4 Cleaner Measurements

Ousterhout's questions about the measurements in Section 4 stem from the
following misinterpretation of the results presented in the
Rosenblum/Ousterhout paper in ACM Transactions on Computer Systems,
February 1992 (hereafter referred to as the TOCS paper).

The simulated results presented in Figure 4 of the TOCS paper
are for the greedy cleaning algorithm. This is not
the cleaning algorithm implemented in either Sprite-LFS or BSD-LFS, and
therefore the results of the simulation do not apply.

The results presented in the TOCS paper are for a single file system
size, segment size, and cleaning cluster size (the number of segments cleaned
per cleaner invocation). None of these parameters is the same for the system
measured in the 1995 Usenix paper or for the Sprite-LFS system measured in
the TOCS paper. All of these parameters affect performance.

The cost-benefit calculation discussed in the text of the TOCS
paper is different than that used in the simulation (the one described
in the text is the one used in both Sprite-LFS and BSD-LFS).

The age metric used in the cost-benefit simulation is different
than the age metric used in either Sprite-LFS or BSD-LFS.

The TOCS paper discusses two cleaning algorithms. The first is the
greedy algorithm, which always cleans the segment(s) with the
lowest utilization. This is the algorithm simulated to produce the results
in Figure 4 of the TOCS paper. The paper then goes on to explain that the
greedy algorithm is sub-optimal for workloads that show skewed distributions.
This observation leads to the second algorithm: cost-benefit. The
cost-benefit algorithm uses a combination of segment utilization and segment
age to select segments for cleaning.

What the TOCS paper fails to discuss is the selection of the age metric
in the cost-benefit algorithm. The TOCS paper presents simulated
results of a log-structured file system. An examination of the simulator
that we believe was used to derive these results reveals that the simulated
file system measures age in terms of the number of blocks that have been
written. The calculation of the cost-benefit metric uses
the log of the number of writes since a segment was created. Therefore, the
age metric grows slowly. For example, on a 1 GB file system with a 1 MB
segment and 4 KB blocks,
the maximum age possible after filling the file system the first time is
18. A segment that survives another writing of the complete disk (all but
one of the 1024 segments) will have an age of 19.

In contrast to this simulated cleaning algorithm, both Sprite-LFS and
BSD-LFS use elapsed time as the measure of segment age.
The age of a segment is the age of the segment's newest file.
In the cost-benefit calculation, Sprite-LFS expresses the age in minutes
while BSD-LFS expresses the age in seconds.
In both cases, the selection of segments is affected by the write
rate of the application and the number of segments in the file system.
For example, consider the 1 GB file system described above. On BSD-LFS,
the first segment written on a pass over the entire file system will have
an age of 613 for the first pass and will grow to 1226 for the second pass.
On Sprite-LFS, the age will be 10 for the first pass and 20 for the second
pass.

Our measurements show that when the disk is 50% full, the average segment
utilization is 48%, not the 25% or 33% expected by Ousterhout.
In our measurements for TPC/B, we increase the fullness of the file system
by creating a large file that is never accessed. In effect, this reduces
the size of the file system and reduces the impact of the age metric,
because segments are recycled more quickly.
The effects of the age metric were never addressed in the TOCS paper.
These effects are subtle and not yet well-understood.
We believe that these effects warrant further investigation
that is beyond the scope of the 1995 Usenix paper.

Cleaning Optimization

Ousterhout proposes keeping indirect blocks in the cache when a file's
data blocks are written. This is not an unreasonable suggestion.
However, retaining indirect blocks is more complicated than Ousterhout
suggests.

First, Ousterhout ignores the fact that periodically, during checkpoint,
indirect blocks must be written to allow file system recovery. Indirect
blocks can be reconstructed by rolling forward after a crash. However,
there must be a consistent snapshot from which to begin the roll forward
process. When segments are cleaned, they cannot be rolled forward.
Therefore, the only segments eligible for cleaning are those written
before the most recent checkpoint.

Second, normal cache replacement in 4BSD depends on the fact that the file
system can evict a dirty buffer from the cache at any time. In
LFS, this is not true. The creation and writing of a segment requires
the use of additional buffers to store segment meta-data such as
segment summaries and inode blocks. This has two main effects on
buffer cache management. Dirty LFS blocks cannot be kept on the normal
LRU queue and segment writes must be initiated before there is danger
of running out of buffers. This is discussed in more detail in Section
4.1.2 of
the 1993 Usenix paper.

Retaining dirty indirect blocks in the cache will reduce the effective
size of the buffer cache, penalizing all files being served by the LFS.
Although this modification might improve performance of the transaction
processing benchmark, the effects on other workloads is unknown,
and it is of questionable value to modify an algorithm to improve
the performance of one benchmark, potentially at the expense of others.
For example, we have modifications to the FFS that substantially
improve the performance of the create benchmarks in Sections 3 and 5.
However, since we do not yet understand the long term ramifications of
these modifications, we did not include them in this study.

FFS Fragmentation: Section 5 Results

The benchmark used in Section 5 is a slight modification of that used
in Section 3. The benchmark, designed by John Ousterhout, creates,
reads, writes, and deletes a large number of files and gathers data for
a wide range of file sizes. In Section 3, the benchmark places 100 files
in each directory while in Section 5, we place only 25 files in each
directory and report results only for reading and creating.

The purpose of Section 5 is to examine how effectively the Fast File
System can allocate files as a file system ages. In order for FFS to
achieve the performance shown in Section 3, it must succeed in allocating
blocks contiguously on disk. As files are created and deleted, the free
space on a file system could become sufficiently fragmented that FFS
would be unable to allocate files contiguously.
We evaluate the performance over time by running the same benchmark on
recreations of old file systems (in particular, in the Usenix paper, we
show the performance for 4 points in time over the past year).

Ousterhout has two complaints with this methodology. First, he believes
that since the files are created all at once, they will exhibit different
characteristics than files created under normal use. Second, he does not
understand why our measurements show degradation in performance for files
of 8 KB since these files inhabit only a single block.

Let's begin with the second point, since it is trivial to explain.
The create test of an 8 KB file involves three writes: the new inode,
the directory in which the file is being added, and the data block.
In the "empty disk" case (i.e. running the benchmark on a newly
created file system), the directory and the 25 files contained in it
are allocated contiguously from the beginning of the cylinder group.
During the benchmark, the inodes and the modified directory
are written synchronously. The data is written asynchronously but is
scheduled before the next create begins. Therefore, the disk head
seeks back and forth between the inode area and the first data cylinder
in the cylinder group.

When the benchmark is run on the recreated file system there is no
guarantee that either the directory or any of the files appear at
the beginning of the cylinder group. Therefore the seeks between the
inode area and the data blocks of the directory and new files are likely
to be longer. These longer seeks lead to lower performance.

Read performance is substantially better than write performance
as multiple files are in the disk's track buffer after the first
file is accessed. However, read performance can also degrade on
an older file system.
Since the files are not allocated contiguously, there is less
benefit from the track buffers, and performance suffers.

Ousterhout's first criticism of the benchmark makes reference to our ongoing
research in file layout characterization.
The reference to which Ousterhout refers is
File Layout and
File System Performance. In this report, we define a layout
score that indicates what fraction of a file's blocks are allocated
optimally. A file that is allocated optimally (contiguously) has a layout
score of 1.0 and a file with no two blocks contiguous has a layout score of
0.0. Consider a four-block file. We assume that a seek is
always required to access the first block of a file and use only the
remaining blocks in the layout score calculation. In the case of
our four-block file, let's say that the first two blocks are
contiguous and the last two blocks are contiguous, but a seek is
required between the second and third blocks.
There are a total of three blocks considered in the layout score.
Of those, 2 are allocated optimally (the second and fourth) while
one (the third) is not. This yields a layout score of 0.67.

Our research shows that layout score is a moderate indicator of
performance (its linear regression coefficient is 0.7, indicating
that the layout score is responsible for approximately 50% of the
variance in performance we see). The other component is
the length of the seek/rotation required between blocks
that are not allocated contiguously. Layout score does not
model this.

Ousterhout states that information about layout scores was present in earlier
drafts of the paper but has been omitted from the final paper.
This is true. At the time of submission, we did not yet have a tool
that permitted reconstruction of a file system from a snapshot. We now
have this tool. Rather than provide an abstract score that only
partially explains performance, we provide actual measured performance
in the final paper. The detailed discussion of layout scores appears
in the File Layout and
File System Performance technical report.

To address Ousterhout's concern over the validity of the results,
we calculated the layout scores for the
64 KB files created by the benchmark and then calculated the layout
score for the 64 KB files on the recreated file system before
the benchmark. We refer to these two sets of files as the benchmark
files and the old files respectively.
In most cases, the layout scores of the benchmark
files was better than that of the old files, while in one case it was
actually worse.

The table below shows the six file systems analyzed in the Usenix paper.
For each file system, we show three snapshot dates and the layout scores
for the benchmark files and the old files. The fifth column shows the
difference in layout score between the two sets of files. Using the
layout score of the old files and the equations derived from linear
regression of layout score versus performance, we show a predicted
read/write performance. Then we show the actual performance that
we measured on the benchmark files.

The data shows that the benchmark files have worse layout scores than
old files on one file system and better layout scores on the other five
file systems. On those five file systems the difference in layout
score produces differences between predicted and actual read performance that
are typically less than 5%.
For writes, the difference between predicted and actual performance is
typically less than 10%.
Even using the predicted numbers, in most cases, read performance over the
one year period is at least 85% of the best performance and write
performance is at least 75% of the best performance.

This analysis is not perfect in that the regression equations were derived
from the measured numbers and may not be correct for the files that exist
in the file system prior to the benchmark. The only other option would be
to select a random set of files to read/write. It is unclear that this
provides a better metric. The real question is for a given file system
configuration, how fragmented are the files that are actually
accessed. A trace-drive study is probably the next step in this
particular examination.