NTFS Sparse Files For Programmers

Imagine that you are implementing a virtual drive that is based on a regular NTFS file.
In fact, such drives are quite common, just look on those numerous file-based
'virtual encrypted disks'.

What if a user requests creation of a drive, say,
10 Gb large? You have two options - either pre-allocate 10 gigabytes of storage
no matter how much is actually used, or implement a rather complex storage allocation
system, allowing the base file to grow and shrink dynamically.

Both the alternatives don't look especially good. Pre-allocation wastes a lot
of space, and shrinking the file on-the-fly will likely be a very time-consuming
operation.

Fortunately, Windows&nbsp2000 and later systems offer a better solution: sparse files.
When you need to free a chunk of in-file storage, you just tell the system that
that part of the file is no more used, and the system will free the corresponding
actual disk space. This way a file can allocate hundreds of gigabytes, but occupy
only a few kilobytes of physical storage. The
FlexHEX tour
shows an example of 250 megabyte large file which uses only 64 kb of disk space.

Note that a sparse file is still fully usable by customary applications. When a program
attempts reading from a 'hole' area, the system fills the read buffer with zeros.
For the program this operation will look as successful reading of a zero data block.

However a program that can distinguish between real data and sparse
zero areas, may have significant advantages over a sparse-unaware application.
Imagine a terabyte large sparse file - a sparse-aware program will load, copy, or scan
such a file in no time, whereas a customary application will either require ages
to process a file that large, or just fail completely.

The Actual Sparse File Layout

Although you can declare any area as sparse using the FSCTL_SET_ZERO_DATA
control code, the system considers this simply as a recommendation, which it doesn't
have to follow. Windows will rearrange the actual sparse area layout as it sees fit
(our FAQ mentions this effect).

When it comes to a compressed or a sparse file, NTFS divides the file into chunks
called compression units. If a chunk occupies a disk area of the same size,
then the disk area contains uncompressed data. If the allocated space is less than
the compression unit size, then the data is compressed. If the compression unit has
no corresponding disk clusters, then it contains sparse zeros.

It is obvious that a sparse zero area is always aligned to the nearest compression
unit boundaries (if it is not, NTFS considers the unit compressed, not sparse).
So the question is: "How large the compression unit is?". The number of disk
clusters per compression unit may vary for different files, and even for different
streams of the same file, however NTFS seems to use the same value of sixteen
clusters per unit for all data streams. Remembering that NTFS hard drives are
formatted with a cluster size of 4kb by default, the typical compression unit
size is 64kb.

For example, if your file contains a sparse zero area 60000 bytes long followed
by a single real data byte, the resulting file will contain 60001 bytes of
data and no sparse zeros at all.

If you create a file containing 70000 sparse zero bytes followed by a real
byte, the resulting file will have a sparse zero area 65536 bytes long followed
by (70001 - 65536) = 4465 bytes of real data.

Can They Really Be That Large?

Jeffrey Richter and Luis Felipe Cabrera in their article
"A File System for the 21st Century"
in November 1998 MSJ issue wrote about sparse streams:
"Since a stream can hold as many as 16 billion billion bytes...". This is not
exactly true. You can create a largest possible 16 terabyte sparse file if and only if it
consists of a single sparse zero area, no data at all. Writing even a single data byte
drops the size limit far, far below. The exact limit depends on many factors: availability
of system resources, the layout of the real data areas, even the order of the I/O requests;
in some cases the limit may be as low as several hundreds gigabytes. If you don't mind
experimenting, you can use FlexHEX to find
your system limit.

It is probably safe to assume that you always can create a 300-500 gigabyte
large sparse file, but any attempt to create a larger file might result in the
Disk full error, no matter how little real data have been written.
An amusing fact is that you may get this error even when you are marking some area
as a hole, thus releasing physical storage.

This does look strange because NTFS does not have any such limitation, and decoding
the data runs of a sparse stream is no more complex than obtaining the cluster list
for an ordinary file. Obviously Microsoft didn't believe anybody would ever need
a terabyte-large file, and never cared about efficient implementation.

Determining If A File Is Sparse

In order to check if a file is sparse, use GetFileAttributes, GetFileAttributesEx,
or GetFileInformationByHandle functions. Note, however, that the two former functions
return the attributes of the unnamed stream. That is, if a file consists of a monolithic main stream,
and a sparse alternate stream, both GetFileAttributes
and GetFileAttributesEx will report the file as not sparse. If your application is stream-aware
and can work with sparse alternate streams, you should use GetFileInformationByHandle.

For example, FlexHEX shows a composite sparse
attribute in its File Properties window. To find the actual attribute, it
calls GetFileInformationByHandle for each stream of the file, and reports the file as
sparse if it has one or more sparse streams.

Marking The File As Sparse

Use the DeviceIoControl function with the FSCTL_SET_SPARSE control code
to mark the file as sparse:

If you don't mark the file as sparse, the FSCTL_SET_ZERO_DATA control code
will actually write zero bytes to the file instead of marking the region as sparse
zero area.

Note that marking a file as sparse is a one-way operation. You cannot unmark
a sparse file even if it contains no sparse area; the only way to convert the file
back to the non-sparse state is to recreate it from the scratch.

Converting A File Region To A Sparse Zero Area

No trouble here. Just specify the starting and the ending address (not the size!)
of the sparse zero block:

Note, however, that this operation does not perform actual file I/O,
and unlike the WriteFile function, it does not move the current
file I/O pointer or sets the end-of-file pointer. That is, if you
want to place a sparse zero block in the end of the file, you must move
the file pointer accordingly using the SetFilePointer
function and call the SetEndOfFile function, otherwise
DeviceIoControl will have no effect.

You may ask: "What if we set a new end-of-file marker without calling DeviceIoControl?",
for example, by executing the following function calls:

What will we find in those 16 megabytes between the old and the new end-of-file markers?
Sparse zeros, real zeros, just some junk? The right answer is: sparse zeros. You don't
need to call DeviceIoControl/FSCTL_SET_ZERO_DATA to create a sparse zero block
in the end of the file - simply moving the end-of-file marker will do the trick.

The last thing worth mentioning is that we can use FSCTL_SET_ZERO_DATA
on a non-sparse file as well. MSDN states that "It is equivalent
to using the WriteFile function to write zeros to a file." This is
not quite correct though - unlike WriteFile, FSCTL_SET_ZERO_DATA
affects neither the current file I/O pointer, nor the end-of-file marker - exactly
as in the case of the sparse file. For instance, if you call WriteFile
immediately after FSCTL_SET_ZERO_DATA, it will overwrite the just written zeros.

Querying The Sparse File Layout

Not much of a problem either - just specify what range you wish to query and
provide a sufficient buffer for output info. The following example
prints the positions and sizes of allocated blocks in a sparse file.

Note that FSCTL_QUERY_ALLOCATED_RANGES returns the positions of
allocated areas, not the positions of non-zero areas. An allocated area
may consist of zeros as well; the only question that matters is whether
any given area occupies physical storage or not.

Determining the Actual File Size

In order to determine the actual file size, that is the amount of physical storage
being used by the file, just sum up all the file ranges. In the example code above
ranges is the array of the ranges, and n contains the number of ranges,
so the code for finding the actual size may look as follows:

All the content is provided on the "as is" basis and without any warranty, express or implied.
You can use the supplied tools and examples for any commercial or non-commercial purpose without
charge. You can copy and redistribute these tools and examples freely, provided that you distribute
the original umodified archives.

sparse.zip - Visual C++ source code
for a sparse file enabled Copy Stream utility.
As the original utility, it performs stream-to-stream copy, however if the source stream
is sparse, the target stream will also be sparse.