There is always a relentless pursuit of more performance from our storage systems. This includes more performance from hardware (faster disks, SSD’s), network (bigger pipes, larger MTU’s), operating systems (caching, IO schedulers), and file systems. There are many levers than can be moved to improve performance but this article will look at one particular piece – the file system journal device. In particular, the metadata performance of ext4 will be considered as the journal is moved to different devices.

Journaling for File Systems

Sometimes bad things such as power failures happen to systems. Power interruptions or failures can cause a file system to become corrupt very quickly because an IO operation is interrupted and not completed. Consequently, the file system has to be checked (fsck) which means the entire file system has to be checked (walked) to find and correct any problems. As file systems grew the amount of time it takes to walk the file system greatly increased. For example, the author remembers performing an fsck on a 1TB file system in 2002-2003 that took several days. Having the system down for this amount of time is very painful.

One way to help improve fsck times is to use a journaled file system. Rather than IO operations happening directly to the file system, the operations are added to the journal (typically a log) in the order they are supposed to happen. Then the file system grabs the operation from the head of the journal and completes it, erasing the operation from the journal only after the operation is finished and the file system is satisfied that the operation is complete.

If the power is lost during the operation on a journaled file system, when the system comes back up, the journal is just “replayed,” i.e. the operations in the journal are performed one at a time starting at the beginning. This means that the entire file system doesn’t necessarily have to be checked (walked). The primary reason this can be done is that the interruption happens before the operation is removed from the journal. Even if the operation wasn’t completed on the file system, replaying the operation ensures that the IO operation actually occurs. If the interruption happened while the operation was being deleted from the journal, the file system can assume that the operation happened and it just deletes the “corrupted” operation from the head of the journal. As a result, you should not have to walk the entire file system to repair problems. Only the journal needs to be replayed. This means that instead of spending a couple of days waiting for an fsck to finish, a very fast replay of the journal is performed taking just minutes.

The journal can theoretically reside anywhere within the system on any device. It can be on the drive containing the file system or it can use a partition on another drive or any other block device you have laying around. But choosing the “best device” is important. The journal is very important to the integrity of the file system so making sure that the journal is on a device of some resiliency is very important (resiliency in this case means the ability to tolerate errors or problems). At the same time, everyone loves more performance (there is likely no one who has said, “you know, I want my storage to go slower.”). Since the performance of the journal can be key to the performance of the file system, perhaps improving the performance of the journaling device and the journal itself can help overall file system performance.

Testing the Metadata Performance

In this article three options for the journal device will be tested to determine the impact of journal device location on the metadata performance of ext4. The three device options are:

Journal on the same disk as the file system

Journal on a different disk from the file system

Journal on a ram disk

The last option, using a ramdisk for the journal, is designed to measure the pinnacle of performance. But it is not likely to be the most resilient solution (it would be better to use a battery backup of the ram disk with the ability to dump it to a storage device, drive or SSD). However, it is included as an “upper bound” on performance.

One of the ways that journal performance can impact overall file system performance is in metadata performance. This article will focus on metadata performance as measured by fdtree. This benchmark has been used before to examine the metadata performance of various Linux file systems. To read about fdtree and how it was used for benchmarking please see read the original article.

As a quick recap, the benchmark, fdtree, is a simple bash script that performs four different metadata tests:

Directory creation

File creation

File removal

Directory Removal

It creates a specified number of files of a given size (in blocks) in a top-level directory. Then it creates a specified number of sub-directories and then in turn sub-directories are recursively created up to a specified number of levels and are populated with files.

Fdtree was used in 4 different approaches to stressing the metadata capability:

Small files (4 KiB)

Shallow directory structure

Deep directory structure

Larger files (4 MiB)

Shallow directory structure

Deep directory structure

The two file sizes, 4 KiB (1 block) and 4 MiB (1,000 blocks) were used to get some feel for a range of performance as a function of the amount of data. The two directory structures were used to stress the metadata in different ways to discover if there is any impact on the metadata performance. The shallow directory structure means that there are many directories but not very many levels down. The deep directory structure means that there are not many directories at a particular level but that there are many levels.

The command lines for the four combinations are:

Small Files – Shallow Directory Structure

./fdtree.bash -d 20 -f 40 -s 1 -l 3

This command creates 20 sub-directories from each upper level directory at each level (“-d 20″) and there are 3 levels (“-l 3″). It’s a basic tree structure. This is a total of 8,421 directories. In each directory there are 40 files (“-f 40″) each sized at 1 block (4 KiB) denoted by “-s 1″. This is a total of 336,840 files and 1,347,360 KiB total data.

Small Files – Deep Directory Structure

./fdtree.bash -d 3 -f 4 -s 1 -l 10

This command creates 3 sub-directories from each upper level directory at each level (“-d 3″) and there are 10 levels (“-l 10″). This is a total of 88,573 directories. In each directory there are 4 files each sized at 1 block (4 KiB). This is a total of 354,292 files and 1,417,168 KiB total data.

Medium Files – Shallow Directory Structure

./fdtree.bash -d 17 -f 10 -s 1000 -l 2

This command creates 17 sub-directories from each upper level directory at each level (“-d 17″) and there are 2 levels (“-l 2″). This is a total of 307 directories. In each directory there are 10 files each sized at 1,000 blocks (4 MiB). This is a total of 3,070 files and 12,280,000 KiB total data.

Medium Files – Deep Directory Structure

./fdtree.bash -d 2 -f 2 -s 1000 -l 10

This command creates 2 sub-directories from each upper level directory at each level (“-d 2″) and there are 10 levels (“-l 10″). This is a total of 2,047 directories. In each directory there are 2 files each sized at 1,000 blocks (4 MiB). This is a total of 4,094 files and 16,376,000 KiB total data.

Each test was run 10 times with the four combinations for the three journal devices. The test system used for these tests was a stock CentOS 5.3 distribution but with a 2.6.30 kernel and e2fsprogs was upgraded to 1.41.9. The tests were run on the following system:

GigaByte MAA78GM-US2H motherboard

An AMD Phenom II X4 920 CPU

8GB of memory

Linux 2.6.30 kernel

The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta ATA/100)

/home is on a Seagate ST1360827AS

There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are /dev/sdb and /dev/sdc.

Only the first Seagate drive was used, /dev/sdb for all of the tests. The second drive, /dev/sdc was used only for the second test where the journal was placed on a second drive.

Journaling Device Details

All three journal device options used the same size journal file, 16MB. The reason that this size is used is that CentOS boots with a number of ramdisks already created. However, these devices are limited to 16MB in size. To make any comparisons fair the size of the journal was kept constant for all three cases.

The first journal device option was to keep the journal on the same disk as the file system. The drive was partitioned so that the first partition was used for the file system itself (/dev/sdb1) and the remaining approximately 16MB of the drive was used for the journal (/dev/sdb2). The first step was to build the file system on /dev/sdb1.

Notice that the line “Filesystem features” does not have the entry “has_journal” indicating that the file system no longer has a journal. The last step is to tell the file system that it has a journal and it is on the second partition of the drive.

Notice on the line “Filesystem features” that the features “has_journal” is not listed. This indicates that the journal has been “removed” from the file system. The final steps is to tell the file system that it has a journal that is on a specific device – in this case /dev/sdc1.

Looking through the listing you can see that the file system has a journal again (“has_journal” on the line “Filesystem features”) and that the journal device is listed as “0×0821″ near the bottom of the listing.

The third journal device option is to place it on a ram drive. It is done in a similar fashion to the previous option where the journal was put on a second drive. But recall that the external journal has to be a block device. The technique used for a ramdisk block device is fairly simple and is based on this article. Despite the article being based on a 2.4 kernel, the techniques are the same.

Comments on "Improving MetaData Performance of the Ext4 Journaling Device"

There are various types of Remedial Medicine for
male impotency and erectile dysfunction. The drug has adverse drug reactions like dizziness and
sneezing among others. The most common side effects of taking Levitra are
headaches, flushing, stuffy or running nose.

Examining watch The 33 full movie (bearpark-online.net) numerous roles inhabited by females in Iranian society, Kiarostami’s
MO, focusing on a single deal with at just one time, allows Akbari’s tale
to steadily evolve and creates some incredible moments – her son’s
15 minute tirade followed by her priceless reaction.