All Broken Up

We spend unconscionable amounts of money making sure
that Windows NT has all the resources it needs—and it
needs plenty. Tons of RAM, multiple fast processors, striped
RAID sets with fast, fat, and wide SCSI drives are all
important components of a speed machine. But there other
things you can do to keep performance up to snuff without
blowing your budget. For example, one of the most effective
yet unexciting tasks you can regularly perform on your
NT machines is to defragment the disk drives.

In many cases you probably already have a disk defragmentation
program installed. If you don’t, you need to correct that
mistake immediately. Regardless of your current situation,
I’m going to discuss some of the low-level details of
why defragmentation is important and how that process
is accomplished.

Inside NTFS

Before I discuss fragmentation solutions, let’s review
how data is allocated to disks in NT. I hope all your
volumes are formatted with NTFS rather than FAT, for reasons
I’ve outlined in these pages many times. As with most
file systems, NTFS is contained in a volume, which is
a logical partition on a physical disk—and, of course,
there can be multiple partitions on one disk. Unlike FAT,
which contains areas specifically formatted for use by
the various components of the file system, NTFS stores
all system files, including the Master File Table (MFT)
and the bootstrap file, as ordinary files.

As with the FAT file system, NTFS uses clusters to allocate
disk space. The size of the cluster is determined during
the format process and can range from 512 bytes up to
64K. The default cluster size for most disks today is
4K to support large partitions, avoid wasting disk space,
and minimize disk fragmentation (see Table 1). Also, keep
in mind that NTFS file compression isn’t supported on
any partition with a cluster size greater than 4K.

Table 1. Default cluster
sizes in NTFS

NTFS
volume size

Default
Cluster size

Up to 512M

512 bytes (or the
sector size if > 512 bytes

Larger than 512M
and up to 1G

1K

Larger than 1M and
up to 2G

2K

Larger than 2G

4K

Here’s an extreme example to illustrate the point. Let’s
say you have 5,000 files that are each 2K in size. On
a partition with 2K clusters, they’d consume 10M of disk
space (multiply 5,000 by 2,000 and you get 10M), with
each file fitting neatly in each cluster. Theoretically,
there wouldn’t be any wasted space or fragmentation. If
you copied those same files to a partition with 64K clusters,
they’d be allocated in 320M (5,000 x 64,000 = 320M) of
disk space with no cluster fragmentation but with massive
internal fragmentation, otherwise known as wasted space.
NTFS doesn’t concern itself with sector sizes and uses
a minimum of one complete cluster for each file, hence
the wasted space in the example. The sector size for hard
drives is determined when the drive is originally low-level
formatted and the tracks on the disk are broken up into
sectors.

Avoiding Disk Fragmentation

On the other hand, if you use the same two partitions
and one 10M file, you’ll have something else to consider:
fragmentation. With the 64K cluster size, the 10M file
will be allocated just under 160 clusters, while the 2K
cluster partition would allocate a whopping 5,000 clusters.
The more clusters needed to store a file, the more likely
the clusters won’t be located contiguously on the partition.
This lack of continuity means that the read/write head
of the physical disk has to move more often to access
any given file.

Because the read/write operation of a disk drive is the
slowest point in the disk access process, keeping file
fragmentation to a minimum can play a significant role
in system performance. When reading a sequential file
in one physical read operation, the system can use read-ahead
to extract more of the file’s data and keep it in cache
for later retrieval. Extracting this data from cache the
next time it’s needed is much faster than performing another
physical read. Obviously, in the real world systems don’t
have uniformly sized files, but you get the point. The
lack of uniformity in file size makes choosing the cluster
size to avoid fragmentation a very poor strategy.

Making things more complex for us but more flexible for
the file system, there are two types of clusters within
NTFS: Logical Cluster Numbers (LCNs) and Virtual Cluster
Numbers (VCNs). The LCNs are directly mapped to a physical
disk address by multiplying the cluster size of the partition
by a given sequential LCN. This provides an offset measure
in the number of bytes that the disk driver uses to read
and write data—very low-level stuff. VCNs map individual
files to LCNs using a series of sequential numbers incremented
for as many clusters as needed to contain the file. NTFS
uses VCNs to store files, and then VCNs use LCNs to allocate
the information to the disk.

Consistency is Key

The core of any NTFS volume is the Master File Table
(MFT), which is implemented as a file containing an array
of 1K records, regardless of sector size, and each of
which represents a file within the partition. Each 1K
segment of the array contains attributes for the file,
such as the security descriptor, filenames, timestamp,
and interestingly enough, the data. I call this interesting
because storing the data as just another file attribute
helps give NTFS a consistent architecture. If the data
fits within the 1K record, it’s stored in the MFT and
referred to as a resident attribute. Obviously very few
files are this span, so there’s also a nonresident attribute,
otherwise referred to as a “run,” that’s stored in the
next available clusters. As a file grows in size, more
runs are allocated to contain the additional data. Although
this process is usually associated with data files, any
attribute that can grow is handled in the same manner.
For example, if many users have permissions to files individually
rather than through group membership, the Discretionary
Access Control Lists (DACLs) can grow too large to remain
resident, in which case they’ll be allocated in a run.

Another example of non-resident file attributes being
stored in runs is a directory with a large number of files.
Directories are listed in the MFT like other files except
that they have an index root attribute containing a list
of the files associated with the directory. If the index
of files can’t be contained in the MFT record, a run is
created to allocate the overflowing information in as
many clusters as necessary to contain the filenames and
their associated VCN-to-LCN mappings. Such a consistent
approach to treating all information as attributes and
any increasing information as runs helps NTFS remain flexible
as different data types are created for future applications.
Regardless of its source or destination, data is simply
stored in attribute streams. NTFS doesn’t need to be concerned
with data types—it leaves that issue to higher-level application
processes.

Metadata Files

Along with the MFT is another set of files that complete
the NTFS structure: metadata files. These files use a
$filename naming convention, and each has a particular
function in the file system. During the NT boot process,
the kernel loads all the device drivers, including the
NTFS file system driver. During the volume mounting process,
the NTFS system driver looks for the $Boot file, which
contains the bootstrap code. The $Boot file is created
during the formatting process and is located at a specific
disk address. This file locates the physical disk address
of the $MFT, which contains the VCN-to-LCN information,
to obtain all of the MFT file attributes and MFT runs.
The first record in the MFT contains the attributes of
the MFT. In this manner the MFT first references itself,
then all other files in the partition. The second record
in the MFT contains the attributes of a partial copy of
the MFT, called $MFTMirr, which is a file placed in the
middle of the partition away from the MFT for redundancy
purposes. Because these are normal files, you can see
them with the DIR command (see Figure 1).

Figure 1. Metadata files
use a $filename naming convention and can be viewed
using the DIR command.

You can use the $MFTMirr file to locate the metadata
files if the MFT is somehow corrupt or missing. By implementing
the MFT as a normal file that references itself, NTFS
eliminates the need to locate it in any particular area
of the partition. This means that NTFS can relocate the
MFT file if it encounters a bad cluster or other disk
error. Two other interesting files are the $BadClus file,
which keeps a record of bad clusters on the disk; and
$Volume, which records the name, NTFS version number,
and corrupted disk bit—meaning it requires CHKDSK to be
run against it.

Additional
Information

A great reference that delves into
the NTFS internals even further is
David A. Solomon's Inside Windows
NT, Second Edition, Microsoft
Press, ISBM 1-57231-677-2. Chapter
9 covers NTFS.

Protecting System Files

One of the most compelling architectural benefits of
NTFS is its ability to provide transaction-based recovery.
This doesn’t extend to the user’s data files, but it does
protect the NTFS system files. This means that if the
system has a power failure or otherwise comes crashing
down, the partition will always be in a consistent state
and ready to offer a useful file system to the operating
system. Applications can also work to protect user data
by periodically flushing the cache to the same log file
the system uses.

The transaction-based recovery process is managed by
the Log File Service (LFS), part of the NTFS device driver.
Each time an NTFS volume is mounted and then accessed
by an application, the partition goes through a recovery
process where unresolved I/O transactions are either completed
or rolled back to the last known consistent state, based
on information contained in a transaction log.

To accomplish this, every five seconds the NTFS driver
writes a checkpoint record into a metadata file called
$Logfile, marking the entry of update records that are
copies of two tables of transaction information (see Figure
2). One is the dirty page table, which contains changes
to the file structure that haven’t been written to the
disk. The other is the transaction table, which is a record
of all disk transactions that are underway but haven’t
been completed.

Figure 2. The $Logfile
metadata file contains update records that are copies
of two tables of transaction information: file structure
changes that haven't been written to the disk, and
disk transactions that are underway but not complete.

During the recovery process the LFS can either redo the
steps that make up a complete transaction or undo a partial
set of steps of an uncompleted transaction. The LFS knows
whether to redo or undo the transaction based on the existence
of a record that declares a transaction complete or, in
database terminology, “committed.” If there’s no record
declaring a transaction committed, the LFS will undo each
step recorded by the $Logfile in the reverse order of
operation to rollback the transaction. In either case,
the file structure will be in a consistent, usable state.
This process of creating transaction records in the log
file occurs whenever performing operations such as creating,
deleting, renaming, setting security permissions, or making
any other type of change to the file system attributes.

Next Month: Tools of the Trade

As you can see, the NTFS environment is a busy and complex
place. Although the architecture of the file system is
designed to be efficient, the basis of the allocation
of disk space is still fundamentally at the cluster level.
Because of this, as files are deleted, expanded, and otherwise
altered, the MFT runs that keep track of the data attributes
can be scattered all over the disk in fragments that can
have a decided impact on I/O performance. Based on this
understanding of how the NTFS functions, next month I’ll
discuss specific tools that help manage this problem,
and I’ll show how they actually work with the NTFS.