Tuning Journaling File Systems

You can improve the performance of journaling file
systems by taking the time to tune them. There has been
considerable effort placed in the designs to make the file systems
scalable and fast without significant expertise. Just twisting a
few knobs — mount options and placing
the journal on an external device — can make the journaling
file systems run significantly faster.

File systems are part of our everyday lives. We store and
retrieve data constantly, but rarely do we think think about how
each file system works. Perhaps that’s as it should be:Linux supports many different kinds of file
systems, and most are mature and robust. For example, the Linux
kernel supports the traditional Ext2 file
system (among others), several cluster file systems (
"i">Lustre, GFS, GPFS, and CXFS), and
also includes no less than four journaling file
systems that have been proven time and again in production
server environments, where high throughput and near-perennial
uptime is essential. (For additional information on journaling file
systems, see the October 2002 Linux Magazine
article titled “Journaling File Systems”, available
online at
class=
"story_link">http://www.linux-mag.com/2002-10/jfs_01.html.)

But journaling file systems need not be limited to servers.
Journaling file systems can also benefit client machines, where
performance and reliability is often just as critical. However, the
jobs assigned to a workstation and the demands placed on a server
are radically different. To get the best high throughput and high
uptime requirements performance out of both, you have to tune each
configuration to suit. Let’s use the open source
"i">dbench benchmark (
"http://samba.org/ftp/tridge/dbench/" class=
"story_link">http://samba.org/ftp/tridge/dbench/) to tweak and
measure a number of different workloads and see how a little work
can yield big results.

The Linux kernel source includes not one, but five journaling
file systems: JFS from IBM (
"http://jfs.sourceforge.net/" class=
"story_link">http://jfs.sourceforge.net/),
"i">XFS from SGI (
class="story_link">http://oss.sgi.com/projects/xfs/),Ext3, and ReiserFS
and Reiser4 from Namesys (
"http://www.namesys.com" class=
"story_link">http://www.namesys.com). The output inFigure One shows shows a system with four of
the journaling systems.

FIGURE ONE:A system with four journaling file systems,Ext3, ReiserFS, JFS, and
"i">XFS

During installation, your distribution picks one of these
journaling file systems as the default for each partition, but you
can typically change the choice. The default format during a Red
Hat install is Ext3; SuSE defaults the format to ReiserFS.

Figure Two shows that JFS, XFS, ReiserFS,
and Ext3 are independent “peers.” It is possible for a
single Linux machine to use all of these types of file systems at
the same time. A system administrator can configure a system to use
JFS on one partition/volume, and ReiserFS on another.

File system performance is often a major component of overall
system performance. To achieve optimal performance, the underlying
file system configuration must be balanced to match the
characteristics of the system’s primary application.

Before creating file systems, a plan should be created for the
layout of your file systems. The following are some general
considerations to be aware of when planning your system:

*I/O workload should be
distributed as evenly as possible across the disk drives.

*The number of file systems on
any one disk should be kept to a minimum. All of the Linux file
systems are better able to manage fragmentation of a file system in
a larger partition/volume than in a small, completely full
partition.

*If a large set of files (in
size, number, or both) has characteristics that make the files
significantly different than “typical” files, create a
separate file system for these files that is tuned to their
requirements.

Most parameters that affect file system performance are set once
and for all when a file system is created. Hence, it’s
simpler to provide the parameters you want when you runmkfs. Some mount
options can also be used to change the performance of the file
system.

Disable Access Times

The first performance tip is a simple change you can make viamount: Disable file access times if your
system doesn’t need them.

Linux records an atime, or access time, whenever a file is read.
However, this information isn’t very useful, and can be quite
costly to track. To get a quick performance boost, simply disable
access time updates with the mount option
"c">noatime.

Use Ext3 Instead of Ext2

Ext3 is a minimal extension to Ext2 to add support for
journaling. Ext3 uses the same disk layout and data structures as
Ext2, making it forward- and backward-compatible with Ext2.
Migration from Ext2 to Ext3 (and vice versa) is quite easy, and can
even be done in-place in the same partition. (The other journaling
file systems require the partition to be formatted with a uniquemkfs utility.) Despite the similarities,
Ext3 provides higher availability and performance than Ext2,
without impacting robustness (at least the simplicity and
reliability).

Ext3’s first improvement is directory indexing. This
feature improves file access in directories containing large files
or many files by using hashed binary trees to store the directory
information. If a directory grows beyond a single disk block,
it’s automatically indexed using a hash tree.

One way to enable dir_index is to use thetune2fs command:

# tune2fs –O dir_index /dev/hda1

This command only applies to those directories created on the
named filesystem after tune2fs runs. To
apply directory indexing to existing directories, run thee2fsck utility to optimize and reindex the
directories on the filesystem:

# e2fsck –D –f /dev/hda1

Another Ext3 enhancement is preallocation. This feature is
useful when using Ext3 with multiple threads appending to files in
the same directory. You can enable preallocation using thereservation option.

# mount –t ext3 –o reservation /dev/hda1 /ext3

You can further improve Ext3 performance by keeping the file
system’s journal on another device. An external log improves
performance because the log updates are saved to a different
partition than the log for the corresponding file system. This
reduces the number of hard disk seeks.

To create an external Ext3 log, run the
"i">mkfs utility on the journal device, making sure that the
block size of the external journal is the same block size as the
Ext3 file system. For example, the commands…

… use /dev/hda1 is used as the
external log for the Ext3 file system on
"i">/dev/hdb1.

To measure the impact of the external journal, let’s rundbench. In the next few examples, the first
run of dbench measures Ext3 with an internal log; the second run
measures Ext3 with an external log. (Remember to place thedbench software on the partition being
benchmarked.)

As you can see, the second run of dbench
shows increased throughput, from 84.6415 MB/sec to 117.848 MB/sec.
The time required to run dbench was also reduced, from 31 seconds
to 22 seconds. The dbench benchmark creates a very large amount of
metadata activity. Therefore, determining the metadata activity for
your system can help determine the type of tuning that will be most
useful.

Tweak ReiserFS

The ReiserFS journaling file system supports metadata
journaling, and has a unique design that differentiates it from the
other journaling file systems. Specifically, ResiserFS stores all
file system objects into a single b*-balanced
tree. ReiserFS also supports compact, indexed directories,
dynamic inode allocation, resizable items, and 60-bit offsets.

The tree contains four basic components:
"c">stat() data, directory components, direct components,
and indirect components. You can find components by searching for a
key (where the key has an ID), the offset in the object that is
being searched, and the item type. Directories have the capability
to increase and decrease as their contents change. A hash of the
file name is used to keep an entry’s offset in the directory
permanent. For files, indirect components point to data blocks, and
direct components contain packed file data. All of the components
can be resized by rebalancing the tree.

ReiserFS is especially adept at managing lots and lots of small
files.

Like Ext3, the ReiserFS file system journal can be maintained
separately from the file system itself. To accomplish this, your
system needs two unused partitions. Assuming that
"i">/dev/hda1 is the external journal and
"i">/dev/hdb1 is the file system you want to create, simply
run the command:

# mkreiserfs –j /dev/hda1 /dev/hdb1

That’s all it takes.

In addition to an external journal, there are three
"i">mount options that can change the performance of
ReiserFS:

*The
"c">hash option allows you to choose which hash algorithm to
use to locate and write files within directories. There are three
choices. The rupasov hashing algorithm is a
fast hashing method that places and preserves locality, mapping
lexicographically close file names to close hash values. Thetea hashing algorithm is a Davis-Meyer
function that creates keys by thoroughly permuting bits in the
name. It achieves high randomness and, therefore, low probability
of hash collision, but this entails performance costs. Finally, ther5 hashing algorithm is a modified version
of the rupasov hash with a reduced probability of collisions. r5 is
the default hashing algorithm. You can set the hash scheme using a
command such as mount –t reiserfs –o
hash=tea /dev/hdb1 /mnt/reiserfs.

2.The
"c">nolog option disables journaling, and also provides a
slight performance improvement in some situations, albeit at the
cost of forcing fsck if the file system is
not cleanly unmounted. This is a good option to use when restoring
a file system from a backup. A sample command is
"c">mount –t reiserfs –o nolog /dev/hdb1
/mnt/reiserfs.

3.The
"c">notail option disables the packing of files into the
tree. By default, ReiserFS stores small files and “file
tails” directly into the tree.

It is possible to combine mount options by separating them with
a comma. Here’s an example that uses two mount options
(noatime, notail) to
increase file system performance:

# mount –t reiserfs –o noatime,notail /dev/hdb1 /mnt/reiserfs

Tweaking JFS

JFS for Linux is based on the IBM JFS file system forOS/2 Warp. JFS is well-suited to enterprise
environments and uses many advanced techniques to boost
performance, provide for very large file systems, and keep track
changes to the file system. Some of the features of JFS
include:

*Directory
organization. Two different directory organizations are
provided: one is used for small directories, and the other for
large directories. The contents of a small directory — up to
8 entries, excluding the self (. or
“dot”) and parent (.. or
“dot dot” entries) — are stored within the
directory’s inode. This eliminates the need for separate
directory block I/O and the need to allocate separate storage. The
contents of larger directories are organized in a B+- tree keyed on
name. The B+- tree provides faster directory lookup, inserts, and
deletes when compared to traditional, unsorted directory
indices.

*
"b">64-bits. JFS is a full, 64-bit file system. All of the
appropriate file system structure fields are 64 bits in size. This
allows JFS to support large files and partitions.

There are other advanced features in JFS, such as
"i">allocation groups, which are shown in
"i">Figure Two. Allocation groups speed file access times by
maximizing locality. (XFS also has this feature.)

Again, JFS file systems can be journaled on a separate device.
To create a JFS file system with the log on an external device, the
system needs to have two unused partitions. In the following
example, /dev/hda1 and /dev/hdb1 are spare partitions. /dev/hda1 is
used as the external log.

# mkfs.jfs –j /dev/hda1 /dev/hdb1

There is one mount option that can change the performance of the
JFS file system. nointegrity is used to not
write to the journal, and is used to allow for higher performance
when restoring a volume from backup media. The integrity of the
volume is not guaranteed if the system abnormally aborts.

The integrity option is the default. It
commits metadata changes to the journal. Use this option to remount
a volume where the nointegrity option was previously specified to
restore normal behavior.

Unlike ReiserFS, the JFS jfs_tune utility
allows you to change the location of the journal. To create a
journal on an external device, say,
"i">/dev/hda2, run:

# mkfs.jfs –J journal_dev /dev/hda2

Then attach the external journal to the file system, which is
located on /dev/hdb1.

# jfs_tune –J device=/dev/hda2 /dev/hdb1

Tweaking XFS

The XFS file system for Linux is based on the SGI’sIRIX XFS file system technology. XFS
supports metadata journaling and extremely large disk farms. In
addition, XFS is designed to scale and have high-performance.

XFS is a 64-bit file system. All of the file system counters in
the system are 64-bit, as are the addresses used for each disk
block and the unique number assigned to each file inode number.

XFS supports delayed allocation. This
feature allows the file system to optimize write performance. When
it comes time to write data to disk, XFS can allocate free space in
intelligent way that optimizes file system performance by
allocating a single, contiguous region on the disk to store this
data.

XFS partitions the file system into regions called
"i">Allocation Groups (AG). Each AG manages its own free
space and inodes, as shown in Figure Three.
In addition AGs provide scalability and parallelism for the file
system. Files and directories are not limited to a single AG. Free
space and inodes within each AG are managed so that multiple
processes can allocate free space throughout the file system
simultaneously, thus reducing the bottleneck that can occur on
large, active file systems.

One option that can make a difference in an XFS file system is
the –i size=
"i">xxx option. The default inode size is 256 bytes.
However, the inode size can be increased (up to 4 KB), which allows
more directories to retain contents in the inode and causes less
disk I/O to read and write. However, larger inodes conversely need
more I/O to read, because they are read and written in clusters.
Because extents are also held in the inode if there is room, larger
inodes also reduce the number of files with out-of-inode
metadata.

Another option that affects performance of the filesystem is the
log size: –l size=
"i">xxx. When there is a large amount of metadata
activity, a larger log translates to more elapsed time before
modified metadata is flushed to the disk. However, a larger log
also slows down recovery.

As with the other journaling file systems, an external log
improves performance because the log updates are saved to a
different partition than their corresponding file system. To create
an XFS file system with the log on an external device, you again
need two unused partitions. In the following example, /dev/hda1 and
/dev/hdb1 are spare partitions. The /dev/hda1 partition is used as
the external log.

# mkfs.xfs –l logdev=/dev/hda1 /dev/hdb1

At mount time, there are three XFS options that can alter
performance:

*
"c">osyncisdsync indicates that
"c">O_SYNC is treated as O_DSYNC,
which is the behavior Ext2 gives you by default. Without this
option, O_SYNC file I/O syncs more metadata
for the file.

*
"c">logbufs=size sets the number of log buffers that are
held in memory. This means you can have more active transactions at
once, and can still perform metadata changes while the log is
synced to disk. The flip side is that the amount of metadata
changes that might be lost due to a system crash is greater. Valid
values are 2 through
"c">8.

*
"c">logbsize=size sets the size of the log buffer held in
memory. Valid values are 16,
"c">32, 64,
"c">128, and 256 Kbytes.

For a metadata-intensive workload, the default log size could be
the limiting factor that reduces the performance of the file
system. Better results are achieved by creating file systems with a
larger log size. The following mkfs command
creates a log size of 32,768 bytes.

# mkfs –t xfs –l size=32768b –f /dev/hdb1

(Currently, in order to resize a log inside the volume, you need
to remake the file system.)

Benchmarking XFS

Let’s look at two ways to tune the XFS file system and rundbench. The first example uses the defaults
to format an XFS partition, which, by default, has the log inside
the same partition as the data. This test provides a baseline. The
second example uses the mount optionslogbufs and logbsize.
A third example uses an external log and the same two mount
options.

Finally, let’s place the log on an external device for
XFS. The example in this section runs dbench
with the same parameters as in the previous example, but the log is
placed on external device, /dev/hdb1.

When the logbufs and
"c">logbsize options are added, throughput increases from
92.7404 MB/sec to 96.4556 MB/sec. When the log is moved to an
external device the throughput nearly doubles to 182.083 MB/sec.
Clearly, the external log increases file system performance underdbench, a program that has a large amount of
metadata activity.

Tuning the I/O Scheduler

The I/O scheduler orders pending I/O requests to minimize the
time spent moving the disk head. This, in turn, minimizes disk seek
time and maximizes hard disk throughput. Hence, tuning the I/O
scheduler can also help increase file system performance.

Linux I/O Schedulers

The Linux I/O schedulers presents I/O requests to block devices
in an optimal order. There are currently four schedulers in the
kernel, each with a different notion of “high
performance”. All of them, however, maintain a dispatch
queue, which is a list of requests which have been selected for
submission to the device. The purpose of the I/O scheduler is to
sort and merge the I/O requests from the I/O queues in order to
increase efficiency and enable the best performance.

Using the /sys proc file system entries you can change and tune
the I/O scheduler for a given block device. For any scheduler there
is a different directory tree representing the tuning options.
Let’s discuss the design of each of the schedulers and areas
where one would be better than another.

The noop scheduler is a FIFO queue. Only the I/O merging is
provided. Good if your application already sorts the I/O.

The deadline scheduler uses a round-robin algorithm to minimize
the latency for any I/O request. It implements merging and sorting
plus a deadline mechanism to avoid starvation. It prefers reads
above writes.

The anticipatory scheduler tries to predict the future workload
delaying the I/O in order to merge request and decrease the number
of seeks. It implements merging and sorting plus an algorithm to
minimize disk head movements. It is suggested for workstation and
old hardware.

The cfq scheduler uses a round-robin technique trying to be
fairly divided the available I/O bandwidth amongst all I/O
requests. It implements sorting and merging. This is the default
I/O scheduler for Red Hat Enterprise Linux 4 release and for the
SuSE SLES9 and SLES10 releases.

To tell you which scheduler the system is using the following
command can be used:

# cat /sys/block/hda/queue/scheduler
noop anticipatory deadline [cfq]

On newer kernel you can change the scheduler without a reboot by
simply issuing the following command to switch to another I/O
scheduler. The example shows how to switch to the deadline
scheduler.

# echo deadline > /sys/block/hda/queue/scheduler

The I/O scheduler can be tuned using the kernel parameterelevator= xxx. The
first option is noop, best for smart storage
controllers. Another is deadline, which
limits the maximum latency per request to disk. The third isanticipatory, which maximizes throughput by
increasing latency, and is suitable to desktops. The fourth,completely fair queuing, abbreviated ascfq, compromises between reads and writes
and tries to balance throughput and latency, which is best for file
servers. (See the sidebar for additional information the I/O
schedulers available in the 2.6.x series of
the kernel.)

A Need for Speed

You can improve the performance of journaling file systems by
taking the time to tune them. There has been considerable effort
placed in the designs to make the file systems scalable and fast
without significant expertise. Just twisting a few knobs —mount options and placing the journal on an
external device — can make the journaling file systems run
significantly faster.

Steve Best works in the IBM Linux Technology Center
in Beijing, China. His latest book, Linux Debugging
and Performance Tuning, is in stores now. He can be
contacted at
"mailto:sbest@us.ibm.com?subject=Tuning%20Journaling%20File%20Systems:%20A%20small%20amount%20of%20effort%20and%20time%20can%20yield%20big%20results"
class="emailaddress">sbest@us.ibm.com.