User Tools

Site Tools

Table of Contents

Overview

In recent Linux kernels filesystem DAX supports 2 MiB hugepage faults in addition to the standard 4 KiB page faults. This means that for each filesystem DAX page fault we can map either 4 KiB or 2 MiB worth of persistent memory into userspace.

Servicing page faults with 2 MiB hugepage mappings instead of 4 KiB mappings has several advantages. It will result in fewer page faults (a single 2 MiB hugepage fault instead of 512 page faults at 4 KiB), smaller page tables and less TLB contention. The end result of using filesystem DAX hugepages is reduced memory usage and increased performance.

However, for filesystem DAX to be able to use 2 MiB hugepages several things have to happen:

Our mmap() mapping has to be at least 2 MiB in size.

Our filesystem block allocation has to be at least 2 MiB in size.

Our filesystem block allocation has to have the same alignment as our mmap().

The first of these, the size of our mmap() region, is the most easily controlled. The filesystem block allocations, though, are a bit more tricky. Luckily the two filesystems that support filesystem DAX, ext4 and XFS, each have support for requesting specific filesystem block allocation alignments and sizes. This feature was introduced in support of RAID, but we can use it equally well for filesystem DAX.

System Configuration

Here are the steps that I've used to successfully get filesystem DAX PMDs:

This is important because when we ask the filesystem for 2 MiB aligned and sized block allocations it will provide those block allocations relative to the beginning of its block device. If the filesystem is built on top of a namespace whose data starts at a 1 MiB aligned offset, for example, a block allocation that is 2 MiB aligned from the point of view of the filesystem will still be only 1 MiB aligned from DAX's point of view. This will cause DAX to fall back to 4 KiB page faults.

We can find the alignment of the persistent memory namespaces by looking at /proc/iomem, among other places:

If we create any partitions on top of our PMEM namespace, we must ensure that those partitions are likewise 2 MiB aligned. By default fdisk will create partitions that are 1 MiB (2048 sector) aligned from the start of the parent block device:

2. Once we have a block device that starts at a 2 MiB aligned persistent memory address, we then need to create a filesystem on top of it that will give us 2 MiB aligned and sized block allocations. Here are the commands to do that with either ext4 or XFS:

3. Now that we have a filesystem that can give us 2 MiB sized and aligned
block allocations we just need to create a file that will receive those
allocations. To do this we need to begin with a file that is at least 2 MiB
in size. We can do this with
truncate(1),
ftruncate(2),
fallocate(1),
posix_fallocate(3), etc. For example:

# fallocate --length 1G /mnt/dax/data

or

# truncate --size 1G /mnt/dax/data

Verifying Results

Once we have a system that is capable of giving us 2 MiB filesystem DAX faults, we probably want to verify that we are actually succeeding in using faults of that size.

The way that I normally do this is by looking at the filesystem DAX tracepoints:

The first thing to look at is the NOPAGE return value at the end of the line. This means that the fault succeeded and didn't return a page cache page, which is expected for DAX. A 2 MiB fault that failed and fell back to 4 KiB DAX faults will instead look like this:

You can see that this fault resulted in a fallback to 4 KiB faults via the
FALLBACK return code at the end of the line. The rest of the data in this line can help you determine why the fallback happened. In this case it was because I intentionally created an mmap() area that was smaller than 2 MiB.