1.2 Major File Systems in Linux

SUSE Linux Enterprise Server offers a variety of file systems from which to choose. This section contains an overview of how these file systems work and which advantages they offer.

It is very important to remember that no file system best suits all kinds of applications. Each file system has its particular strengths and weaknesses, which must be taken into account. In addition, even the most sophisticated file system cannot replace a reasonable backup strategy.

The terms data integrity and data consistency, when used in this section, do not refer to the consistency of the user space data (the data your application writes to its files). Whether this data is consistent must be controlled by the application itself.

IMPORTANT:Unless stated otherwise in this section, all the steps required to set up or change partitions and file systems can be performed by using YaST.

1.2.1 Btrfs

Btrfs is a copy-on-write (COW) file system developed by Chris Mason. It is based on COW-friendly B-trees developed by Ohad Rodeh. Btrfs is a logging-style file system. Instead of journaling the block changes, it writes them in a new location, then links the change in. Until the last write, the new changes are not committed.

IMPORTANT:Because Btrfs is capable of storing snapshots of the file system, it is advisable to reserve twice the amount of disk space than the standard storage proposal. This is done automatically by the YaST Partitioner in the Btrfs storage proposal for the root file system.

Key Features

Btrfs provides fault tolerance, repair, and easy management features, such as the following:

Writable snapshots that allow you to easily roll back your system if needed after applying updates, or to back up files.

Multiple device support that allows you to grow or shrink the file system. The feature is planned to be available in a future release of the YaST Partitioner.

Compression to efficiently use storage space.

Use
Btrfs commands to set up transparent compression. Compression and
Encryption functionality for Btrfs is currently under development and
is currently not supported on SUSE Linux Enterprise Server.

Different RAID levels for metadata and user data.

Different checksums for metadata and user data to improve error detection.

Integration with Linux Logical Volume Manager (LVM) storage objects.

Integration with the YaST Partitioner and AutoYaST on SUSE Linux.

Offline migration from existing Ext2, Ext3, and Ext4 file systems.

Bootloader Support

Bootloader support for /boot on Btrfs is planned to be available beginning in SUSE Linux Enterprise 12.

Btrfs Subvolumes

Btrfs creates a default subvolume in its assigned pool of space. It allows you to create additional subvolumes that act as individual file systems within the same pool of space. The number of subvolumes is limited only by the space allocated to the pool.

If Btrfs is used for the root (/) file system, the YaST Partitioner automatically prepares the Btrfs file system for use with Btrfs subvolumes. You can cover any subdirectory as a subvolume. For example, Table 1-1, Default Subvolume Handling for Btrfs in YaST identifies the subdirectories that we recommend you treat as subvolumes because they contain files that you should not snapshot for the reasons given:

Table 1-1 Default Subvolume Handling for Btrfs in YaST

Path

Reason to Cover as a Subvolume

/opt

Contains third-party add-on application software packages.

/srv

Contains http and ftp files.

/tmp

Contains temporary files.

/var/crash

Contains memory dumps of crashed kernels.

/var/log

Contains system and applications’ log files, which should never be rolled back.

/var/run

Contains run-time variable data.

/var/spool

Contains data that is awaiting processing by a program, user, or administrator, such as news, mail, and printer queues.

/var/tmp

Contains temporary files or directories that are preserved between system reboots.

Snapshots for the Root File System

Btrfs provides writable snapshots with the SUSE Snapper infrastructure that allow you to easily roll back your system if needed after applying updates, or to back up files. Snapper allows you to create and delete snapshots, and to compare snapshots and revert the differences between them. If Btrfs is used for the root (/) file system, YaST automatically enables snapshots for the root file system.

To prevent snapshots from filling up the system disk, you can change the Snapper cleanup defaults to be more aggressive in the /etc/snapper/configs/root configuration file, or for other mount points. Snapper provides three algorithms to clean up old snapshots that are executed in a daily cron-job. The cleanup frequency is defined in the Snapper configuration for the mount point. Lower the TIMELINE_LIMIT parameters for daily, monthly, and yearly to reduce how long and the number of snapshots to be retained. For information, see Adjusting the Config File in the SUSE Linux Enterprise Server Administration Guide.

Online Check and Repair Functionality

The scrub check and repair functionality is available as part of the Btrfs command line tools. It verifies the integrity of data and metadata, assuming the tree structures is fine. You can run scrub periodically on a mounted file system; it runs as a background process during normal operation.

RAID and Multipath Support

You can create Btrfs on Multiple Devices (MD) and Device Mapper (DM) storage configurations by using the YaST Partitioner.

Migration from Ext File Systems to Btrfs

You can migrate data volumes from existing Ext file systems (Ext2, Ext3, or Ext4) to the Btrfs file system. The conversion process occurs offline and in place on the device. The file system needs least 15% of available free space on the device.

To convert the Ext file system to Btrfs, take the file system offline, then enter:

btrfs-convert <device>

To roll back the migration to the original Ext file system, take the file system offline, then enter:

btrfs-convert -r <device>

IMPORTANT:When rolling back to the original Ext file system, all data will be lost that you added after the conversion to Btrfs. That is, only the original data is converted back to the Ext file system.

Btrfs Administration

Btrfs is integrated in the YaST Partitioner and AutoYaST. It is available during the installation to allow you to set up a solution for the root file system. You can use the YaST Partitioner after the install to view and manage Btrfs volumes.

Btrfs administration tools are provided in the btrfsprogs package. For information about using Btrfs commands, see the btrfs(8), btrfsck(8), mkfs.btrfs(8), and btrfsctl(8) man pages. For information about Btrfs features, see the Btrfs wiki.

The fsck.btrfs(8) tool will soon be available in the SUSE Linux Enterprise update repositories.

Btrfs Quota Support for Subvolumes

The Btrfs root file system subvolumes /var/log, /var/crash and /var/cache can use all of the available disk space during normal operation, and cause a system malfunction. To help avoid this situation, SUSE Linux Enterprise now offers Btrfs quota support for subvolumes. See the btrfs(8) manual page for more details.

1.2.2 Ext2

The origins of Ext2 go back to the early days of Linux history. Its predecessor, the Extended File System, was implemented in April 1992 and integrated in Linux 0.96c. The Extended File System underwent a number of modifications and, as Ext2, became the most popular Linux file system for years. With the creation of journaling file systems and their short recovery times, Ext2 became less important.

A brief summary of Ext2’s strengths might help understand why it was—and in some areas still is—the favorite Linux file system of many Linux users.

Solidity and Speed

Being quite an old-timer, Ext2 underwent many improvements and was heavily tested. This might be the reason why people often refer to it as rock-solid. After a system outage when the file system could not be cleanly unmounted, e2fsck starts to analyze the file system data. Metadata is brought into a consistent state and pending files or data blocks are written to a designated directory (called lost+found). In contrast to journaling file systems, e2fsck analyzes the entire file system and not just the recently modified bits of metadata. This takes significantly longer than checking the log data of a journaling file system. Depending on file system size, this procedure can take half an hour or more. Therefore, it is not desirable to choose Ext2 for any server that needs high availability. However, because Ext2 does not maintain a journal and uses significantly less memory, it is sometimes faster than other file systems.

Easy Upgradability

Because Ext3 is based on the Ext2 code and shares its on-disk format as well as its metadata format, upgrades from Ext2 to Ext3 are very easy.

1.2.3 Ext3

Ext3 was designed by Stephen Tweedie. Unlike all other next-generation file systems, Ext3 does not follow a completely new design principle. It is based on Ext2. These two file systems are very closely related to each other. An Ext3 file system can be easily built on top of an Ext2 file system. The most important difference between Ext2 and Ext3 is that Ext3 supports journaling. In summary, Ext3 has three major advantages to offer:

Easy and Highly Reliable Upgrades from Ext2

The code for Ext2 is the strong foundation on which Ext3 could become a highly-acclaimed next-generation file system. Its reliability and solidity are elegantly combined in Ext3 with the advantages of a journaling file system. Unlike transitions to other journaling file systems, such as ReiserFS or XFS, which can be quite tedious (making backups of the entire file system and recreating it from scratch), a transition to Ext3 is a matter of minutes. It is also very safe, because re-creating an entire file system from scratch might not work flawlessly. Considering the number of existing Ext2 systems that await an upgrade to a journaling file system, you can easily see why Ext3 might be of some importance to many system administrators. Downgrading from Ext3 to Ext2 is as easy as the upgrade. Just perform a clean unmount of the Ext3 file system and remount it as an Ext2 file system.

Reliability and Performance

Some other journaling file systems follow the metadata-only journaling approach. This means your metadata is always kept in a consistent state, but this cannot be automatically guaranteed for the file system data itself. Ext3 is designed to take care of both metadata and data. The degree of care can be customized. Enabling Ext3 in the data=journal mode offers maximum security (data integrity), but can slow down the system because both metadata and data are journaled. A relatively new approach is to use the data=ordered mode, which ensures both data and metadata integrity, but uses journaling only for metadata. The file system driver collects all data blocks that correspond to one metadata update. These data blocks are written to disk before the metadata is updated. As a result, consistency is achieved for metadata and data without sacrificing performance. A third option to use is data=writeback, which allows data to be written into the main file system after its metadata has been committed to the journal. This option is often considered the best in performance. It can, however, allow old data to reappear in files after crash and recovery while internal file system integrity is maintained. Ext3 uses the data=ordered option as the default.

Converting an Ext2 File System into Ext3

To convert an Ext2 file system to Ext3:

Create an Ext3 journal by running tune2fs -j as the root user.

This creates an Ext3 journal with the default parameters.

To specify how large the journal should be and on which device it should reside, run tune2fs -J instead together with the desired journal options size= and device=. More information about the tune2fs program is available in the tune2fs man page.

Edit the file /etc/fstab as the root user to change the file system type specified for the corresponding partition from ext2 to ext3, then save the changes.

This ensures that the Ext3 file system is recognized as such. The change takes effect after the next reboot.

To boot a root file system that is set up as an Ext3 partition, include the modules ext3 and jbd in the initrd.

Edit /etc/sysconfig/kernel as root, adding ext3 and jbd to the INITRD_MODULES variable, then save the changes.

Run the mkinitrd command.

This builds a new initrd and prepares it for use.

Reboot the system.

Ext3 File System Inode Size and Number of Inodes

An inode stores information about the file and its block location in the file system. To allow space in the inode for extended attributes and ACLs, the default inode size for Ext3 was increased from 128 bytes on SLES 10 to 256 bytes on SLES 11. As compared to SLES 10, when you make a new Ext3 file system on SLES 11, the default amount of space pre-allocated for the same number of inodes is doubled, and the usable space for files in the file system is reduced by that amount. Thus, you must use larger partitions to accommodate the same number of inodes and files than were possible for an Ext3 file system on SLES 10.

When you create a new Ext3 file system, the space in the inode table is pre-allocated for the total number of inodes that can be created. The bytes-per-inode ratio and the size of the file system determine how many inodes are possible. When the file system is made, an inode is created for every bytes-per-inode bytes of space:

number of inodes = total size of the file system divided by the number of bytes per inode

The number of inodes controls the number of files you can have in the file system: one inode for each file. To address the increased inode size and reduced usable space available, the default for the bytes-per-inode ratio was increased from 8192 bytes on SLES 10 to 16384 bytes on SLES 11. The doubled ratio means that the number of files that can be created is one-half of the number of files possible for an Ext3 file system on SLES 10.

IMPORTANT:After the inodes are allocated, you cannot change the settings for the inode size or bytes-per-inode ratio. No new inodes are possible without recreating the file system with different settings, or unless the file system gets extended. When you exceed the maximum number of inodes, no new files can be created on the file system until some files are deleted.

When you make a new Ext3 file system, you can specify the inode size and bytes-per-inode ratio to control inode space usage and the number of files possible on the file system. If the blocks size, inode size, and bytes-per-inode ratio values are not specified, the default values in the /etc/mked2fs.conf file are applied. For information, see the mke2fs.conf(5) man page.

Use the following guidelines:

Inode size:
The default inode size is 256 bytes. Specify a value in bytes that is a power of 2 and equal to 128 or larger in bytes and up to the block size, such as 128, 256, 512, and so on. Use 128 bytes only if you do not use extended attributes or ACLs on your Ext3 file systems.

Bytes-per-inode ratio:
The default bytes-per-inode ratio is 16384 bytes. Valid bytes-per-inode ratio values must be a power of 2 equal to 1024 or greater in bytes, such as 1024, 2048, 4096, 8192, 16384, 32768, and so on. This value should not be smaller than the block size of the file system, because the block size is the smallest chunk of space used to store data. The default block size for the Ext3 file system is 4 KB.

In addition, you should consider the number of files and the size of files you need to store. For example, if your file system will have many small files, you can specify a smaller bytes-per-inode ratio, which increases the number of inodes. If your file system will have a very large files, you can specify a larger bytes-per-inode ratio, which reduces the number of possible inodes.

Generally, it is better to have too many inodes than to run out of them. If you have too few inodes and very small files, you could reach the maximum number of files on a disk that is practically empty. If you have too many inodes and very large files, you might have free space reported but be unable to use it because you cannot create new files in space reserved for inodes.

If you do not use extended attributes or ACLs on your Ext3 file systems, you can restore the SLES 10 behavior specifying 128 bytes as the inode size and 8192 bytes as the bytes-per-inode ratio when you make the file system. Use any of the following methods to set the inode size and bytes-per-inode ratio:

Modifying the default settings for all new Ext3 files:
In a text editor, modify the defaults section of the /etc/mke2fs.conf file to set the inode_size and inode_ratio to the desired default values. The values apply to all new Ext3 file systems. For example:

blocksize = 4096
inode_size = 128
inode_ratio = 8192

At the command line:
Pass the inode size (-I 128) and the bytes-per-inode ratio (-i 8192) to the mkfs.ext3(8) command or the mke2fs(8) command when you create a new Ext3 file system. For example, use either of the following commands:

During installation with YaST:
Pass the inode size and bytes-per-inode ratio values when you create a new Ext3 file system during the installation. In the YaST Partitioner on the Edit Partition page under Formatting Options, select Format partition > Ext3, then click Options. In the File system options dialog box, select the desired values from the Block Size in Bytes, Bytes-per-inode, and Inode Size drop-down lists.

For example, select 4096 for the Block Size in Bytes drop-down list, select 8192 from the Bytes per inode drop-down list, select 128 from the Inode Size drop-down list, then click OK.

During installation with autoyast:
In an autoyast profile, you can use the fs_options tag to set the opt_bytes_per_inode ratio value of 8192 for -i and the opt_inode_density value of 128 for -I:

1.2.4 ReiserFS

Officially one of the key features of the 2.4 kernel release, ReiserFS has been available as a kernel patch for 2.2.x SUSE kernels since version 6.4. ReiserFS was designed by Hans Reiser and the Namesys development team. It has proven itself to be a powerful alternative to Ext2. Its key assets are better disk space utilization, better disk access performance, faster crash recovery, and reliability through data journaling.

Better Disk Space Utilization

In ReiserFS, all data is organized in a structure called a B*-balanced tree. The tree structure contributes to better disk space utilization because small files can be stored directly in the B* tree leaf nodes instead of being stored elsewhere and just maintaining a pointer to the actual disk location. In addition to that, storage is not allocated in chunks of 1 or 4 KB, but in portions of the exact size needed. Another benefit lies in the dynamic allocation of inodes. This keeps the file system more flexible than traditional file systems, like Ext2, where the inode density must be specified at file system creation time.

Better Disk Access Performance

For small files, file data and stat_data (inode) information are often stored next to each other. They can be read with a single disk I/O operation, meaning that only one access to disk is required to retrieve all the information needed.

Fast Crash Recovery

Using a journal to keep track of recent metadata changes makes a file system check a matter of seconds, even for huge file systems.

Reliability through Data Journaling

ReiserFS also supports data journaling and ordered data modes similar to the concepts outlined in Ext3. The default mode is data=ordered, which ensures both data and metadata integrity, but uses journaling only for metadata.

1.2.5 XFS

Originally intended as the file system for their IRIX OS, SGI started XFS development in the early 1990s. The idea behind XFS was to create a high-performance 64-bit journaling file system to meet extreme computing challenges. XFS is very good at manipulating large files and performs well on high-end hardware. However, even XFS has a drawback. Like ReiserFS, XFS takes great care of metadata integrity, but less care of data integrity.

A quick review of XFS’s key features explains why it might prove to be a strong competitor for other journaling file systems in high-end computing.

High Scalability through the Use of Allocation Groups

At the creation time of an XFS file system, the block device underlying the file system is divided into eight or more linear regions of equal size. Those are referred to as allocation groups. Each allocation group manages its own inodes and free disk space. Practically, allocation groups can be seen as file systems in a file system. Because allocation groups are rather independent of each other, more than one of them can be addressed by the kernel simultaneously. This feature is the key to XFS’s great scalability. Naturally, the concept of independent allocation groups suits the needs of multiprocessor systems.

High Performance through Efficient Management of Disk Space

Free space and inodes are handled by B+ trees inside the allocation groups. The use of B+ trees greatly contributes to XFS’s performance and scalability. XFS uses delayed allocation, which handles allocation by breaking the process into two pieces. A pending transaction is stored in RAM and the appropriate amount of space is reserved. XFS still does not decide where exactly (in file system blocks) the data should be stored. This decision is delayed until the last possible moment. Some short-lived temporary data might never make its way to disk, because it is obsolete by the time XFS decides where actually to save it. In this way, XFS increases write performance and reduces file system fragmentation. Because delayed allocation results in less frequent write events than in other file systems, it is likely that data loss after a crash during a write is more severe.

Preallocation to Avoid File System Fragmentation

Before writing the data to the file system, XFS reserves (preallocates) the free space needed for a file. Thus, file system fragmentation is greatly reduced. Performance is increased because the contents of a file are not distributed all over the file system.