I’ve removed much of the output here to make things more readable. The directory file is fragmented, requiring multiple single-block extents, which is common for directories in XFS. The directory would start as a single block. Eventually enough files will be added to the directory that it needs more than one block to hold all the file entries. But by this time, the blocks immediately following the original directory block have been consumed– often by the files which make up the content of the directory. When the directory needs to grow, it typically has to fragment.

What is really interesting about multi-block directories in XFS is that they are sparse files. Looking at the list of extents at the end of the xfs_db output, we see that the first two blocks are at logical block offsets 0 and 1, but the third block is at logical block offset 8388608. What the heck is going on here?

If you recall from our discussion of block directories in the last installment, XFS directories have a hash lookup table at the end for faster searching. When a directory consumes multiple blocks, the hash lookup table and “tail record” move into their own block. For consistency, XFS places this information at logical offset XFS_DIR2_LEAF_OFFSET, which is currently set to 32GB. 32GB divided by our 4K block size gives a logical block offset of 8388608.

From a file size perspective, we can see that xfs_db agrees with our earlier ls output, saying the directory is 8192 bytes. However, the xfs_db output clearly shows that the directory consumes three blocks, which should give it a file size of 3*4096 = 12288 bytes. Based on my testing, the directory “size” in XFS only counts the blocks that contain directory entries.

We can use xfs_db to examine the directory data blocks in more detail:

I’m using the addr command in xfs_db to select the startblock value from the first extent in the array (the zero element of the array).

The beginning of this first data block is nearly identical to the block directories we looked at previously. The only difference is that single block directories have a magic number “XDB3”, while data blocks in multi-block directories use “XDD3” as we see here. Remember that the value that xfs_db lobels dhdr.hdr.bno is actually the sector offset to this block and not the block number.

Again we see the same header information. Note that each data block has it’s own “free space” array, tracking available space in that data block.

Finally, we have the block containing the hash lookup table and tail record. We could use xfs_db to decode this block, but it turns out that there are some interesting internal structures to see here. Here’s the hex editor view of the start of the block:

The “forward” and “backward” links would come into play if this were a multi-node B+Tree data structure rather than a single block. Unlike previous magic number values, the magic value here (0x3df1) does not correspond to printable ASCII characters.

After the typical XFS header information, there is a two-byte value tracking the number of entries in the directory, and therefore the number of entries in the hash lookup table that follows. The next two bytes tell us that there is one unused entry– typically a record for a deleted file.

We find this unused record near the end of the hash lookup array. The entry starting at block offset 0x840 has an offset value of zero, indicating the entry is unused:

Interestingly, right after the end of the hash lookup array, we see what appears to be the extended attribute information from an inode. This is apparently residual data left over from an earlier use of the block.

At the end of the block is data which tracks free space in the directory:

The last four bytes in the block are the number of blocks containing directory entries– two in this case. Preceding those four bytes is a “best free” array that tracks the length of the largest chunk of free space in each block. You will notice that the array values here correspond to the dhdr.bestfree[0].length values for each block in the xfs_db output above. When new directory entries are added, this array helps the file system locate the best spot to place the new entry.

We see the two bytes immediately before the “best free” array are identical to the first entry in the array. Did the /etc directory once consume three blocks and later shrink back to two? Based on limited testing, this appears to be the case. Unlike directories in traditional Unix file systems, which never shrink once blocks have been allocated, XFS directories will grow and shrink dynamically as needed.

So far we’ve looked at the three most common directory types in XFS: small “short form” directories stored in the inode, single block directories, and in this case a multi-block directories tracked with an extent array in the inode. In rare cases, when the directory is very large and very fragmented, the extent array in the inode is insufficient. In these cases, XFS uses a B+Tree to track the extent information. We will examine this scenario in the next installment.

In the previous installment, we looked at small directories stored in “short form” in the inode. While these small directories can make up as much as 90% of the total directories in a typical Linux file system, eventually directories get big enough that they can no longer be packed into the inode data fork. When this happens, directory data moves out to blocks on disk.

In the inode, the data fork type (byte 5) changes to indicate that the data is no longer stored within the inode. Extents are used to track the location of the disk blocks containing the directory data. Here is the inode core and extent list for a directory that only occupies a single block:

The data fork type is 2, indicating an extent list follows the inode core. Bytes 76-79 indicate that there is only a single extent. The extent starts at byte 176 (0x0B0), immediately after the inode core. The last 21 bits of the extent structure show that the extent only contains a single block. Parsing the rest of the extent yields a block address of 0x8118e7, or relative block 71911 in AG 2.

We can extract this block and examine it in our hex editor. Here is the data in the beginning of the block:

Multiply the block offset 4926183 by 8 sectors per block to get the sector offset value 39409464 that we see in the directory block header.

Following the header is a “free space” array that consumes 12 bytes, plus 4 bytes of padding to preserve 64-bit alignment. The free space array contains three elements which indicate where the three largest chunks of unused space are located in this directory block. Each element is a 2 byte offset and a 2 byte length field. The elements of the array are sorted in descending order by the length of each chunk.

In this directory block, there is only a single chunk of free space, starting at offset 1296 (0x0510) and having 2376 bytes (0x0948) of space. The other elements of the free space array are zeroed, indicating no other free space is available.

The directory entries start at byte 64 (0x040) and can be read sequentially like a typical Unix directory. However, XFS uses a hash-based lookup table, growing up from the bottom of the directory block, for more efficient searching:

The last 8 bytes of the directory block are a “tail record” containing two 4 byte values: the number of directory entries (0x34 or 52) and the number of unused entries (zero). Immediately preceding the tail record will be an array of 8 byte records, one record per directory entry (52 records in this case). Each record contains a hash value computed from the file name, and the offset in the directory block where the directory entry for that file is located. The array is sorted by hash value so that binary search can quickly find the desired record. The offsets are in 8 byte units.

The xfs_db program can compute hash values for us:

xfs_db> hash 03_smallfile
0x3f07fdec

If we locate this hash value in the array, we see the byte offset value is 0x12 or 18. Since the offset units are 8 bytes, this translates to byte offset 144 (0x090) from the start of the directory block.

Here are the first six directory entries from this block, including the entry for “03_smallfile”:

Padding for alignment is only included if necessary. Our “03_smallfile” entry starting at offset 0x090 is exactly 24 bytes long and needs no padding for alignment. You can clearly see the padding in the “.” and “..” entries starting at offset 0x040 and 0x050 respectively.

Deleting a File

If we remove “03_smallfile” from this directory, the inode updates similarly to what we saw with the “short form” directory in the last installment of this series. The mtime and ctime values are updated, and the CRC32 and Logfile Sequence Number fields as well. The file size does not change, since the directory still occupies one block.

The “tail record” and hash array at the end of the directory block change:

The tail record still shows 34 entries, but one of them is now unused. If we look at the entry for hash 0x3F07FDEC, we see the offset value has been zeroed, indicating an unused record.

We also see changes at the beginning of the block:

The free space array now uses the second element, showing 24 (0x18) bytes free at byte offset 0x90– the location where the “03_smallfile” entry used to reside.

Looking at offset 0x90, we see that the first two bytes of the inode field are overwritten with 0xFFFF, indicating an unused entry. The next two bytes are the length of the free space. Again we see 0x18, or 24 bytes.

However, since inode addresses in this file system fit in 32 bits, the original inode address associated with this file is still clearly visible. The rest of the original directory entry is untouched until a new entry overwrites this space. This should make file recovery easier.

Not Quite Done With Directories

When directories get large enough to occupy multiple blocks, the directory structure gets more complicated. We’ll examine larger directories in our next installment.

XFS uses several different directory structures depending on the size of the directory. For testing purposes, I created three directories– one with 5 files, one with 50, and one with 5000 file entries. Small directories have their data stored in the inode. In this installment we’ll examine the inode of the directory that contains only five files.

We documented the “inode core” layout and the format of the extended attributes in Part 2 of this series. In this inode the file type (upper nibble of byte 2) is 4, which means it’s a directory. The data fork type (byte 5) is 1, meaning resident data.

Resident directory data is stored as a “short form” directory structure starting at byte offset 176, right after the inode core. First we have a brief header:

First we have a byte tracking the number of directory entries to follow the header. The next byte tracks how many directory entries require 64 bits for inode data. As we saw in Part 1 of this series, XFS uses variable length addreses for blocks and inodes. In our file system, we need less than 32 bits to store these addresses, so there are no directory entries requiring 64-bit inodes. This means the directory data will use 32 bits to store inodes in order to save space.

This has an immediate impact because the next entry in the header is the inode of the parent directory. Since byte 177 is zero, this field will be 32 bits. If byte 177 was non-zero, then all inode entries in the header and directory entries would be 64-bit.

The parent inode field in the header is the equivalent of the usual “..” link in the directory. The current directory inode (the “.” link) is found in the inode core in bytes 152-159. The short form directory simply uses these values and does not have explicit “.” and “..” entries.

After the header come a series of variable length directory entries, packed as tightly as possible with no alignment constraints. Entries are added to the directory in order of file creation and are not sorted in any way.

Here is a description of the fields and a breakdown of the values in the five directories in this inode:

First we have a single byte for the file name length in bytes. Like other Unix file systems, there is a 255 character file name limit.

The next two bytes are based on the byte offset the directory entry would have if it were a normal XFS directory entry and not packed into a short form directory in the inode. In a normal directory block, directory entries are 64-bit aligned and start at byte offset 96 (0x60) following the directory header and “.” and “..” entries. The directory entries here are all 18 or 20 bytes long, which means they would consume 24 bytes (0x18) in a normal directory block. Using a consistent numbering scheme for the offset makes it easier to write code that iterates through directory entries, even though the offsets don’t match the actual offset of each directory entry in the short form style.

Next we have the characters in the file name followed by a single byte for the file type. The file type is included in the directory entry so that commands like “ls -F” don’t have to open each inode to get the file type information. The file type values in the directory entry do not use the same number scheme as the file type in the inode. Here are the expected values for directory entries:

Finally there is a field to hold the inode associated with the file name. In our example, these inode entries are 32 bits. 64-bit inode fields will be used if the directory header indicates they are needed.

Deleting a File

When a file is deleted from (or added to) a directory, the mtime and ctime in the directory’s inode core are updated. The directory file size changes (bytes 56-63). The CRC32 checksum and the logfile sequence number fields are updated.

In the data fork, all directory entries after the deleted entry are shifted downwards, completely overwriting the deleted entry. Here’s what the directory entries look like after “03_smallfile”– the third entry in the original directory– is deleted:

The four remaining directory entries are highlighted above. However, after those entries you can clearly see the residue of the entry for “05_smallfile” from the original directory. So as short-form directories shrink, they leave behind entries in the unused “inode slack”. In this case the residue is for a file entry that still exists in the directory, but it’s possible that we might get residue of entries deleted from the end of the directory list.

When Directories Grow Up

Another place you can see short form directory residue is when the directory gets large enough that it needs to move out to blocks on disk. I created a sample directory that initially had five files and confirmed that it was being stored as a short form directory in the inode. Then I added 45 more files to the directory, which made a short form directory impossible. Here’s what the first part of the inode looks like after these two operations:

The data fork type (byte 5) is 2, meaning an extent list after the inode core, giving the location of the directory content on disk. You can see the extent highlighted starting at byte offset 176 (0xb0). But immediately after that extent you can see the residue of the original short-form directory.

The format of directories changes significantly when directory entries move out into disk blocks. In our next installment we will examine the structures in these larger directories.

Part 1 of this series was a quick introduction to XFS, the XFS superblock, and the unique Allocation Group (AG) based addressing scheme used in the file system. With this information, we were able to extract an inode from its physical location on disk

In this installment, we will look at the structure of the XFS inode. Since we will want to see what remains in the inode after a file is deleted, I’m going to create a small file for testing purposes:

To save time, we’ll use the xfs_db program to convert that inode address into the values we need to extract the inode from its physical location on disk. Then we’ll use dd to extract the inode as we did in Part 1.

XFS inodes start with the 2 byte magic number value “IN”. Inodes also have a CRC32 checksum (bytes 100-103) to help detect corruption. The inode includes its own absolute inode number (bytes 152-159) and the file system UUID (bytes 160-175), which should match the UUID value from the superblock. Whenever the inode is updated, bytes 112-119 track the “logfile sequence number” (LSN) of the journal entry for the update. The inode format has changed across different versions of the XFS file system, so refer to the inode version in byte 4 before decoding the inode. XFS v5 uses v3 inodes.

The size of the file (in bytes) is a 64-bit value in bytes 56-63. The original XFS inode tracked the number of links as a 16-bit value (bytes 6-7), which is no longer used. Number of links is now tracked as a 32-bit value found in bytes 16-19.

Timestamps include both a 32-bit “Unix epoch” style seconds field and a 32-bit nanosecond resolution fractional seconds field. The three classic Unix timestamps– atime, mtime, ctime– are found in bytes 32-55 of the inode. File creation time (btime) was only added in XFS v5, so that timestamp resides in bytes 144-151 in the upper portion of the inode core.

File ownership and permissions are tracked as in earlier Unix file systems. There are 32-bit file owner (bytes 8-11) and group owner (bytes 12-15) fields. File type and permissions are stored in a packed 16-bit structure. The low 12 bits are the standard Unix permissions bits, and the upper four bits are used for the file type.

The 12 permissions bits are grouped into four groups of 3 bits, and are often written in octal notation– in our case we have 0644. The first group of three represents the “special” bit flags: set-UID, set-GID, and “sticky” (none of these are set for our test file). The remaining three groups represent “read” (r), “write” (w), and “execute” (x) permissions for three categories. The first set of bits applies to the file owner, the second to members of the Unix group that owns the file, and the last group for everybody else. The permissions on our test file are 644 or 110 100 100 aka rw-r–r–. In other words, read and write access for the file owner, and read only access for group members and for all other users on the system.

The remaining space after the 176 bytes of inode core is used to track the data blocks associated with the file (the “data fork” of the file) and any extended attributes that may be set. There are multiple ways in which data and attributes may be stored– locally resident within the inode, in a series of extents, or in a more complex B+Tree indexed structure. The data fork type flag in byte 5 and the extended attribute type flag in byte 83 document how this information is organized. The possible values for these fields are:

Currently XFS only uses resident or “local” storage for extended attributes and small directories. There is a proposal to allow small files to be stored in the inode (similar to NTFS), but this is still under development. The data fork for our small test file is type 2– an array of extent structures. The extended attributes are type 1, meaning they are stored locally in the inode.

The data fork starts at byte 176, immediately after the inode core. The start of the extended attribute data is found at an offset from the end of the inode core. This offset is byte 82 of the inode core, and the units are multiples of 8 bytes. In our sample inode, the offset value is 0x23 or 35. Multiplying by 8 gives a byte offset of 280 from the end of the inode core, or 176+280=456 bytes from the beginning of the inode.

Extent Arrays

The most common storage option for file content in XFS is data fork type 2– an array of 16 byte extent structures starting immediately after the inode core. Bytes 76-79 indicate how many extent structures are in the array. Our file is not fragmented, so there is only a single extent structure in the inode.

Theoretically, the 336 bytes following the inode core could hold 21 extent structures, assuming no extended attribute data. If the inode cannot hold all of the extent information (an extremely fragmented file), then the data fork in the inode becomes the root of a B+Tree (data fork type 3) for tracking extent information. We will see an example of this in a later installment in this series.

The challenging thing about XFS extent structures is that they are not byte aligned. They contain four fields as follows:

Flag (1 bit) – Set if extent is preallocated but not yet written, zero otherwise

If you think this makes manually decoding XFS extent information challenging, you’d be correct. Let’s break the extent structure down into individual bits in order to make decoding a bit easier. The extent starts at byte offset 176 (0xb0), and I’ll use a little command-line magic to see the bits:

Looks like we got it right. Note that XFS null fills file slack space, which is typical for Unix file systems.

Extended Attributes

XFS allows arbitrary extended attributes to be added to the file. Attributes are simply name, value pairs. There is a 255 byte limit on the size of any attribute name or value. You can set or view attributes from the command line with the “attr” command.

If the amount of attribute data is small, extended attributes will be stored in the inode, just as they are in our sample file. Large amounts of attribute information may need to be stored in data blocks on disk, in which case the attribute data is tracked using extents just like the data fork.

As we discussed above, resident attribute information starts at a specific byte offset from the end of the inode core. In our sample file the offset is 280 bytes from the end of the inode core or 456 bytes (280 + 176) from the start of the inode.

The length field unit is bytes and includes the 4 byte header. Our sample file only contains a single attribute.

Each attribute structure is variable length, to allow attributes to be packed as tightly as possible. Each attribute structure starts with a single byte for the name length, then a byte for the value length, and a flag byte. The rest of the attribute structure is the name followed by the value, with no null terminators or padding for byte alignment.

Breaking down the single attribute we have in our sample inode, we see:

This attribute holds the SELinux context on our file, “unconfined_u:object_r:admin_home_t:s0”. While extended attribute values are not required to be null-terminated, SELinux expects it’s context labels to have null terminators. So the 38 byte value length is 37 printable characters and a null.

The flags field is designed to control access to the attribute information. The flags byte is defined as a bit mask, but only four values appear to be used currently:

128 Attribute is being updated
4 "Secure" - attribute may be viewed by all but only set by root
2 "Trusted" - attribute may only be viewed and set by root
0 No restrictions

The Inode After Deletion

When a file is deleted, changes are limited to a small number of fields in the inode core:

The 2 byte file type and permissions field is zeroed

Link count, file size, number of blocks, and number of extents are zeroed

ctime is set to the time the file was deleted

The offset to the extended attributes is zeroed

The data fork and extended attribute type bytes are set to 2, which would normally mean an extent array

The “Generation number” field (inode bytes 92-95) is incremented–more testing is required, but it appears this field may be a usage count for the inode

The CRC32 checksum and the LSN are updated

No other data in the inode changes. So while the number of extents value is zeroed and so is the offset to the start of the extended attributes, the actual extent and attribute data remains in the inode.

This means it should be straightforward to recover the original file by parsing whatever extent data exists starting at inode offset 176. The XFS FAQ points to two Open Source projects that appear to use this idea to recover deleted files, and a little Google searching turns up several commercial tools that claim to do XFS file recovery:

In limited testing it also appears that the data fork and the extended attribute information are not zeroed when the inode is reused. This means there is the possibility of finding remnants of data from a previous file in the unused or “slack” space in the inode.

Using xfs_db to View Inodes

xfs_db allows you to quickly view the inode values, even for inodes that are currently unallocated:

What’s Next?

XFS does not store file name information in the inode, which is pretty typical for Unix file systems. The only place where file names exist is in directory entries. In our next installment we will begin to examine the different XFS directory types. Yes, it’s complicated.

The XFS file system was originally developed by Silicon Graphics for their IRIX operating system. The Linux version is increasingly popular– Red Hat has adopted XFS as their default file system as of Red Hat Enterprise Linux v7. Unfortunately, while XFS is becoming more common on Linux systems, we are lacking forensic tools for decoding this file system. This series will provide insights into the XFS file system structures for forensics professionals, and document the current state of the art as far as tools for decoding XFS.

I would like to thank the XFS development community for their work on the file system and their help in preparing these articles. Links to the documentation, source code, and the mailing list are available from XFS.org. I wouldn’t have been able to do any of this work without these resources.

A Quick Overview of XFS

XFS is a modern journaled file system which uses extent-based file allocation and B+Tree style directories. XFS supports arbitrary extended file attributes. Inodes are dynamically allocated. The block size is 4K by default, but can be set to other values at file system creation time. All file system metadata is stored in “big endian” format, regardless of processor architecture.

Some of the structures in XFS are recognizable from older Unix file systems. XFS still uses 32-bit signed Unix epoch style timestamps, and has the “Year 2038” rollover problem as a result. XFS v5– the version currently used in Linux– does have a creation date (btime) field in addition to the normal last modified (mtime), access time (atime), and metadata change time (ctime) timestamps. XFS timestamps also have an additional 32-bit nanosecond resolution element. File type and permissions are stored in a packed 16-bit value, just like in older Unix file systems.

Very little data gets overwritten when files are deleted in XFS. Directory entries are simply marked as unused, and the extent data in the inode is still visible after deletion. File recovery should be straightforward.

In addition, standard metadata structures in XFS v5 contain a consistent unique file system UUID value, along with information like the inode value associated with the data structure. Metadata structures also have unique “magic number” values. These features facilitate file system and data recovery, and are very useful when carving or viewing raw file system data. Metadata structures include a CRC32 checksum to help detect corruption.

One interesting feature of XFS is that a single file system is subdivided into multiple Allocation Groups— four by default on RHEL systems. Each allocation group (AG) can be treated as a separate file system with its own inode and block lists. The intention was to allow multiple threads to write in parallel to the same file system with minimal interaction. This makes XFS a quite high performing file system on multi-core systems.

It also leads to a unique addressing scheme for blocks and inodes that uses a combination of the AG number and a relative block or inode offset within that AG. These values are packed together into a single address, normally stored as a 64-bit value. However the actual length of the relative portion of the address and the AG value can vary from file system to file system, as we will discuss below. In other words, it’s complicated.

The Superblock

As with other Unix file systems, XFS starts with a superblock which helps decode the file system. The superblock occupies the first 512 bytes of each XFS AG. The primary superblock is the one in AG 0 at the front of the file system, with the superblocks in the other AGs used for redundancy.

Only the first 272 bytes of the superblock are currently used. Here is a breakdown of the information from the superblock:

Rather than discussing all of these fields in detail, I am going to focus in on the fields we need to quickly get into the file system.

First we need basic file system structure size information like the block size (bytes 4-7) and inode size (bytes 104-105). XFS v5 defaults to 4K blocks and 512 byte inodes, which is what we see here.

As we’ll discuss below, the number of AGs (bytes 88-91) and the size of each AG in blocks (bytes 84-87) are critical for locating data’s physical location on the storage device. This file system has 4 AGs which each contain 2,427,136 blocks (roughly 9.6GB per AG or just under 40GB for the file system).

The superblock contains the inode number of the root directory (bytes 56-63)– this value is normally 64. We also find the starting block of the file system journal (bytes 48-55) and the journal length in blocks (bytes 96-99). We’ll cover the journal in a later article in this series.

While looking at file system metadata in a hex editor is always fun, XFS does include a program named xfs_db which allows for more convenient decoding of various file system structures. Here’s an example of using xfs_db to decode the superblock of our example file system:

Inode and Block Addressing

Typically XFS metadata uses “absolute” addresses, which contain both AG information and a relative offset from the start of that AG. This is what we find here in the superblock and in directory files. Sometimes XFS will use “AG relative” addresses that only include the relative offset from the start of the AG.

While XFS typically allocates 64-bits to hold absolute addresses, the actual size of the address fields varies depending on the size of the file system. For block addresses, the number of bits for the “AG relative” portion of the inode is the log2(AG size) value found in superblock byte 124. In the example superblock, this value is 22. So the lower 22 bits of the block address will be the relative block offset. The upper bits will be used to hold the AG number.

The first block of the file system journal is at address 0x800004. Let’s write that out in binary showing the AG and relative block offset portions:

Inode addressing is similar. However, because we can have multiple inodes per block, the relative portion of the inode address has to be longer. The length of relative inode addresses is the sum of superblock bytes 123 and 124– the log2 value of inodes per block plus the log2 value of blocks per AG. In our example this is 3+22=25.

The inode address of the root directory isn’t a very interesting example– it’s just inode offset 64 from AG 0. For a more interesting example, I’ll use my /etc/passwd file at inode 67761631 (0x409f5df). Let’s take a look at the bits:

Where does this inode physically reside on the storage device? The relative block location of an inode in XFS is simply the integer portion of the inode number divided by the number of inodes per block. In our case this is 652767 div 8 or block 81595. The inode offset in this block is 672767 mod 8, which equals 7.

Now that we know the AG and relative block number for this inode, we can extract it as we did the first block of the journal. We can even use a second dd command to extract the correct inode offset from the block:

Note that the xfs_db program can perform address conversions for us. However, in order to use xfs_db it must be able to attach to the file system in order to have the correct length for the AG relative portion of the address. Since this may no always be possible, knowing how to manually convert absolute addresses is definitely a useful skill.

Here is how to get xfs_db to convert the block and inode addresses we used in the examples above:

The first two commands convert the starting block of the journal (xfs_db refers to absolute block addresses as “fsblock” values) to the AG number (agno) and AG relative block offset (agblock). We can also use the convert command to translate inode addresses. Here we calculate the AG number, AG relative inode (agino), the AG relative block for the inode, and even the offset in that block where the inode resides (offset). The values from xfs_db match the values we calculated manually above. You will note that we can use either hex or decimal numbers as input.

Now that we can locate file system structures on disk, Part 2 of this series will focus on the XFS inode format. I hope you will return for the next installment.