Design for Large Allocation Blocks

Contents

Problem statement

If the vast majority of the files are large, and the system is running
under tight memory pressure, such that the block allocation bitmaps
and buddy bitmaps are frequently getting evicted from the system, if
the filesystem's free space is moderately fragmented, file allocations
and unlinks may require reading multiple block bitmaps into memory,
and this can be a surprisingly large amount of overhead, since the
reading the various block bitmaps is a synchronous operation that
tends to involve a large number of seeks.

One way of solving this problem is to use a large block size than 4k
--- say, perhaps 256k or 1 megabyte. By reducing the number of blocks
needed to allocate the same file, and at the same time increasing the
amount of storage that can be covered by a block allocation bitmap, we
can address this problem, at the cost of increased internal
fragmentation. However Linux has a fundamental assumption that the
page size is >= the file system block size.

So the goal is to design a system that allows us to get the benefits
of larger block sizes without the complexities caused by large block
sizes.

Constraints

We want to keep the changes to the ext4 code base small and easily
testable. Ideally, no changes to the core kernel should be required.
What follows from this is a requirement that the fundamental block
size (as far as the VM is concerned) must remain at 4k.

Design

The solution to this problem is to use an increased allocation size as
far as the block allocaiton bitmaps are concerned. However, the size
of allocation bitmaps, and the granularity of blocks as far as the the
extent tree blocks are concerned, are still based on the original
maximum 4k block size.

We assume the fields previously used for defining the fragment size in
the ext4 superblock: s_log_frag_size and redefine it to be
s_log_alloc_size. This defines a size, the allocation blocksize. The
allocation blocksize is only enabled if a new INCOMPAT feature is
defined, EXT4_FEATURE_INCOMPAT_BIGALLOC, and it must be greater than
or equal to the block size.

The allocation blocksize controls the allocation granularity of a bit
in the block bitmap. However, the maximum size of a block bitmap
remains a single block size. Hence, for a file system with a 4k block
size and a 1 megabyte allocation blocksize, there can be a maximum of
32,768 (4k * 8) allocation blocks, and so a single block group can now
cover a maximum of 32GiB.

To minimize changes to ext, we will constrain an allocation block such
that it must contain only fixed file system metadata (i.e., block
bitmaps, inode bitmaps, and inode table) or data blocks belonging to a
single inode. Furthermore, if allocation block is to be used for data
blocks for an inode, it must be used contiguously starting at a file
offset which is a multiple of the allocation blocksize. The first
block of the allocation block must be mapped in the extent tree (or
indirect block), so that the ext4_map_blocks() can easily find (and
continue using) partially used allocation block.

Because we are not changing the definition of a block, the only
changes that need to be made are at the intersection of allocating to
an inode (or to file system metadata). This is good, because it means
the bulk of ext4 does not need to be changed

Kernel Changes required

1) Globally throughout ext4: uses of EXT4_BLOCKS_PER_GROUP() need to
be audited to see if they should be EXT4_BLOCKS_PER_GROUP() or
EXT4_ALLOC_BLOCKS_PER_GROUP().

2) ext4_map_blocks() and its downstream functions need to be changed so
that they understand the new allocation rules, and in particular
understand that before allocating a new block, they need to see if a
partially allocated block has already been allocated, and can be used
to fulfill the current allocation.

3) mballoc.c will need little or no changes, other than the
EXT4_BLOCKS_PER_GROUP()/EXT4_ALLOC_BLOCKS_PER_GROUP() audit discussed
in (1).

E2fsprogs changes required

Changes to dumpe2fs, debugfs, etc., to understand the new

superblock field.

Changes to mke2fs to understand how to allocate the inode, block,

and inode table metadata blocks.

Changes to e2fsck pass1 and pass 1b/c/d to understand that a single

allocation block can be shared across multiple file system metadata
blocks or a file's data blocks. (But the fact that allocation block
can be used contiguously vis-a-vis an inode's logical block space will
make these changes relatively straightforwarwd.)

Upsides

Block allocation and deallocation for large files should be much
faster, since a block bitmap will cover much more space, and it is
much less likely that a single file system operation will need to
touch multiple block bitmaps.

Using a larger allocation blocksize will also reduce external
fragmentation. By reducing the free space fragmentation, when files
are written (and later read) seeking between data blocks will be
greatly reduced.

Directory blocks will be contiguous, which will speed up situations
where a directory is very, very large.

Downsides

Internal fragmentation will be expensive for small files. So this is
only useful for file systems where most files are large, or where the
file system performance is more important than the losses caused by
internal fragmentation.

Directories will also be allocated in chucks of the allocation block
size. If this is especially large (such as 1 MiB), and there are a
large number of directories, this could be quite expensive.
Applications which use multi-level directory schemes to keep
directories small to optimize for ext2's very slow large directory
performance could be especially vulnerable.