Memory management is one of the most important and most difficult duties of an operating system. This chapter presents a comprehensive overview of Windows 2000 memory management and the structure of the 4-GB linear address space.

Memory management is one of the most important and most difficult duties of
an operating system. This chapter presents a comprehensive overview of Windows
2000 memory management and the structure of the 4-GB linear address space. In
this context, the virtual memory addressing and paging capabilities of the Intel
i386 CPU family are explained, focusing on how the Windows 2000 kernel exploits
them. To aid the exploration of memory, this chapter features a pair of sample
programs: a kernel-mode device driver that collects information about the
system, and a user-mode client application that queries this data from the
driver via device I/O control and displays it in a console window. The "spy
driver" module will be reused in the remaining chapters for several other
interesting tasks that require execution of kernel-mode code. This
chapter—especially the first section—is tough reading because it puts
your hands directly on the CPU hardware. Nevertheless I hope you won't skip
it, because virtual memory management is an exciting topic, and understanding
how it works provides insight into the mechanics of a complex operating system
such as Windows 2000.

Intel i386 Memory Management

The Windows 2000 kernel makes heavy use of the protected-mode virtual memory
management mechanisms of the Intel i386 CPU class. To get a better understanding
of how Windows 2000 manages its main memory, it is important to be at least
minimally familiar with some architectural issues of the i386 CPU. The term
i386 might look somewhat anachronistic because the 80386 CPU dates back
to the early days of Windows computing. Windows 2000 is designed for Pentium
CPUs and above. However, even these newer processors rely on the memory
management model originally designed for the 80386 CPU, with some important
enhancements, of course. Therefore, Microsoft usually labels the Windows NT and
2000 versions built for Intel processors "i386" or even
"x86." Don't be confused about that—whenever you read the
numbers 86 or 386 in this book, keep in mind that the corresponding information
refers to a specific CPU architecture, not a specific processor
release.

Basic Memory Layout

Windows 2000 uses a very straightforward memory layout for application and
system code. The 4-GB virtual memory space offered by the 32-bit Intel CPUs is
divided into two equal parts. Memory addresses below 0x80000000 are
assigned to user-mode modules, including the Win32 subsystem, and the remaining
2 GB are reserved for the kernel. Windows 2000 Advanced Server also supports an
alternative memory model commonly called 4GT RAM Tuning, which has been
introduced with Windows NT 4.0 Server Enterprise Edition. This model features
3-GB address space for user processes, and 1-GB space for the kernel. It is
enabled by adding the /3GB switch to the bootstrap command line in the
boot manager configuration file boot.ini.

The Advanced Server and Datacenter variants of Windows 2000 support yet
another memory option named Physical Address Extension (PAE) enabled by
the boot.ini switch /PAE. This option exploits a feature of
some Intel CPUs (e.g., the Pentium Pro processor) that allows physical memory
larger than 4 GB to be mapped into the 32-bit address space. In this Chapter, I
will ignore these special configurations. You can read more about them in
Microsoft's Knowledge Base article Q171793 (Microsoft 2000c), Intel's
Pentium manuals (Intel 1999a, 1999b, 1999c), and the Windows 2000 Device Driver
Kit (DDK) documentation (Microsoft 2000f).

Memory Segmentation and Demand Paging

Before delving into the technical details of the i386 architecture,
let's travel back in time to the year 1978, when Intel released the mother
of all PC processors: the 8086. I want to restrict this discussion to the most
significant milestones. If you want to know more, Robert L. Hummel's 80486
programmer's reference is an excellent starting point (Hummel 1992). It is
a bit outdated now because it doesn't cover the new features of the Pentium
family; however, this leaves more space for important information about the
basic i386 architecture. Although the 8086 was able to address 1 MB of Random
Access Memory (RAM), an application could never "see" the entire
physical address space because of the restriction of the CPU's address
registers to 16 bits. This means that applications were able to access a
contiguous linear address space of only 64 KB, but this memory window could be
shifted up and down in the physical space with the help of a set of 16-bit
segment registers. Each segment register defined a base address in 16-byte
increments, and the linear addresses in the 64-KB logical space were added as
offsets to this base, effectively resulting in 20-bit addresses. This archaic
memory model is still supported even by the latest Pentium CPUs, and it is
called Real-Address Mode, commonly referred to as Real Mode.

An alternative mode was introduced with the 80286 CPU, referred to as
Protected Virtual Address Mode, or simply Protected Mode. It
featured a memory model where physical addresses were not generated by simply
adding a linear address to a segment base. To retain backward compatibility with
the 8086 and 80186, the 80286 still used segment registers, but they did not
contain physical segment addresses after the CPU had been switched to Protected
Mode. Instead, they provided a selector, comprising an index into a descriptor
table. The target entry defined a 24-bit physical base address, allowing access
to 16 MB of RAM, which seemed like an incredible amount then. However, the 80286
was still a 16-bit CPU, so the limitation of the linear address space to 64 KB
tiles still applied.

The breakthrough came in 1985 with the 80386 CPU. This chip finally cut the
ties of 16-bit addressing, pushing up the linear address space to 4 GB by
introducing 32-bit linear addresses while retaining the basic
selector/descriptor architecture of its predecessor. Fortunately, the 80286
descriptor structure contained some spare bits that could be reclaimed. While
moving from 16- to 32-bit addresses, the size of the CPU's data registers
was doubled as well, and new powerful addressing modes were added. This radical
shift to 32-bit data and addresses was a real benefit for programmers— at
least theoretically. Practically, it took several years longer before the
Microsoft Windows platform was ready to fully support the 32-bit model. The
first version of Windows NT was released on July 26th, 1993, constituting the
very first incarnation of the Win32 API. Whereas Windows 3.x programmers still
had to deal with memory tiles of 64 KB with separate code and data segments,
Windows NT provided a flat linear address space of 4 GB, where all code and data
could be addressed by simple 32-bit pointers, without segmentation. Internally,
of course, segmentation was still active, as I will show later in this chapter,
but the entire responsibility for managing segments finally had been moved to
the operating system.

Another essential new 80386 feature was the hardware support for paging, or,
more precisely, demand-paged virtual memory. This is a technique that allows
memory to be backed up by a storage medium other than RAM—a hard disk, for
example. With paging enabled, the CPU can access more memory than physically
available by swapping out the least recently accessed memory contents to backup
storage, making space for new data. Theoretically, up to 4 GB of contiguous
linear memory can be accessed this way, provided that the backup media is large
enough—even if the installed physical RAM amounts to just a small fraction
of the memory. Of course, paging is not the fastest way to access memory. It is
always good to have as much physical RAM as possible. But it is an excellent way
to work with large amounts of data that would otherwise exceed the available
memory. For example, graphics and database applications require a large amount
of working memory, and some wouldn't be able to run on a low-end PC system
if paging weren't available.

In the paging scheme of the 80386, memory is subdivided into pages of 4-KB or
4-MB size. The operating system designer is free to choose between these two
options, and it is even possible to mix pages of both sizes. Later I will show
that Windows 2000 uses such a mixed page design, keeping the operating system
in 4-MB pages and using 4-KB pages for the remaining code and data. The pages
are managed by means of a hierarchically structured page-table tree that
indicates for each page where it is currently located in physical memory. This
management structure also contains information on whether the page is actually
in physical memory in the first place. If a page has been swapped out to the
hard disk, and some module touches an address within this page, the CPU
generates a page fault, similar to an interrupt generated by a peripheral
hardware device. Next, the page fault handler inside the operating system kernel
will attempt to swap back this page to physical memory, possibly writing other
memory contents to disk to make space. Usually, the system will apply a
least-recently-used (LRU) schedule to decide which pages qualify to be swapped
out. By now it should be clear why this procedure is sometimes referred to as
demand paging: Physical memory contents are moved to the backup storage
and back on software demand, based on statistics of the memory usage of the
operating system and its applications.

The address indirection layer represented by the page-tables has two
interesting implications. First, there is no predetermined relationship between
the addresses used by a program and the addresses found on the physical address
bus of the CPU chip. If you know that a data structure of your application is
located at the address, say, 0x00140000, you still don't know
anything about the physical address of your data unless you examine the
page-table tree. It is up to the operating system to decide what this address
mapping looks like. Even more, the address translation currently in effect is
unpredictable, in part because of the probabilistic nature of the paging
mechanism. Fortunately, knowledge of physical addresses isn't required in
most application cases. This is something left for developers of hardware
drivers. The second implication of paging is that the address space is not
necessarily contiguous. Depending on the page-table contents, the 4-GB space can
comprise large "holes" where neither physical nor backup memory is
mapped. If an application tries to read to or write from such an address, it
will be aborted immediately by the system. Later in this chapter, I will show in
detail how Windows 2000 spreads its available memory over the 4-GB address
space.

The 80486 and Pentium CPUs use the very same i386 segmentation and paging
mechanisms introduced with the 80386, except for some exotic addressing features
such as the Physical Address Extension (PAE) of the Pentium Pro. Along with
higher clock frequencies, these newer models contain optimizations in other
areas. For example, the Pentium features a dual instruction pipeline that
enables it to execute two operations at the same time, as long as these
instructions don't depend on each other. For example, if instruction A
modifies a register value, and the consecutive instruction B uses the modified
value for a computation, B cannot be executed before A has finished. But if
instruction B involves a different register, the CPU can execute A and B
simultaneously without adverse effects. This and other Pentium optimizations
have opened a wide field for compiler optimization. If this topic looks
interesting, see Rick Booth's Inner Loops (Booth 1997).

In the context of i386 memory management, three sorts of addresses must be
distinguished, termed logical, linear, and physical addresses in
Intel's system programming manual for the Pentium (Intel 1999c).

1. Logical addresses: This is the most precise specification of a
memory location, usually written in hexadecimal form as XXXX:YYYYYYYY,
where XXXX is a selector, and YYYYYYYY is a linear offset into
the segment addressed by the selector. Instead of a numeric XXXX value,
it is also possible to specify the name of a segment register holding the
selector, such as CS (code segment), DS (data segment),
ES (extra segment), FS (additional data segment #1),
GS (additional data segment #2), and SS (stack segment). This
notation is borrowed from the old "segment:offset" style of specifying
"far pointers" in 8086 Real-Mode.

2. Linear addresses: Most applications and many kernel-mode
drivers disregard virtual addresses. More precisely, they are just interested in
the offset part of a virtual address, which is referred to as a linear
address. An address of this type assumes a default segmentation model,
determined by the current values of the CPU's segment registers. Windows
2000 uses flat segmentation, with the CS, DS, ES, and SS
registers pointing to the same linear address space; therefore, programs can
safely assume that all code, data, and stack pointers can be cast among one
another. For example, a stack location can be cast to a data pointer at any time
without concern about the values of the corresponding segment
registers.

3. Physical addresses: This address type is of interest only if
the CPU works in paging mode. Basically, a physical address is the voltage
pattern measurable at the address bus pins of the CPU chip. The operating system
maps linear addresses to physical addresses by setting up page-tables. The
layout of the Windows 2000 page-tables, which has some very interesting
properties for debugging software developers, will be discussed later in this
chapter.

The distinction between virtual and linear addresses is somewhat artificial,
and some documentation uses both terms interchangeably. I will do my best to use
this nomenclature consistently. It is important to note that Windows 2000
assumes physical addresses to be 64 bits wide. This might seem odd on Intel i386
systems, which usually have a 32-bit address bus. However, some Pentium systems
can address more than 4 GB of physical memory. For example, the Physical Address
Extension (PAE) mode of the Pentium Pro CPU extends the physical address space
to 36 bits, allowing access to 64 GB of RAM (Intel 1999c). Therefore, the
Windows 2000 API functions involving physical addresses usually rely on the data
type PHYSICAL_ADDRESS, which is just an alias name for the
LARGE_INTEGER structure, as shown in Listing 4-1. Both types are
defined in the DDK header file ntdef.h. The LARGE_INTEGER is a
structural representation of a 64-bit signed integer, allowing interpretation as
a concatenation of two 32-bit quantities (LowPart and
HighPart) or a single 64-bit number (QuadPart). The
LONGLONG type is equivalent to the native Visual C/C++ type
__int64. Its unsigned sibling is called ULONGLONG or
DWORDLONG and is based on the native unsigned __int64
type.

Figure 4-1 outlines
the i386 memory segmentation model, showing the relationship between logical
and linear addresses. For clarity, I have drawn the descriptor table and the
segment as small, nonoverlapping boxes. However, this isn't a requirement.
Actually, a 32-bit operating system usually applies a segmentation layout as
shown in Figure 4-2.
This so-called flat memory model is based on segments that span the entire 4-GB
address space. As a side effect, the descriptor table becomes part of the segment
and can be accessed by all code that has sufficient access rights.

The memory model in Figure
4-2 is adopted by Windows 2000 for the standard code, data, and stack segments,
that is, all logical addresses that involve the CS, DS, ES, and SS
segment registers. The FS and GS segments are treated differently.
GS is not used by Windows 2000, and FS addresses special system
data areas inside the linear address space. Therefore, its base address is greater
than zero and its size is less than 4 GB. Interestingly, Windows 2000 maintains
different FS segments in user-mode and kernel-mode. More on this topic
follows later in this chapter.

In Figures 4-1 and
4-2, the selector
portion of the logical address is shown to point into a descriptor table determined
by a register termed GDTR. This is the CPU's Global Descriptor
Table Register, which can be set by the operating system to any suitable linear
address. The first entry of the Global Descriptor Table (GDT) is reserved, and
the corresponding selector called "null segment selector" is intended
as an initial value for unused segment registers. Windows 2000 keeps its GDT
at address 0x80036000. The GDT can hold up to 8,192 64-bit entries,
resulting in a maximum size of 64 KB. Windows 2000 uses only the first 128 entries,
restricting the GDT size to 1,024 bytes. Along with the GDT, the i386 CPU provides
a Local Descriptor Table (LDT) and an Interrupt Descriptor Table (IDT), addressed
by the LDTR and IDTR registers, respectively. Whereas the
GDTR and IDTR values are unique and apply to all tasks executed
by the CPU, the LDTR value is task-specific, and, if used, contains
a 16-bit GDT selector.

Figure 4-3 demonstrates
the complex mechanism of linear-to-physical address translation applied by the
i386 memory management unit if demand paging is enabled in 4-KB page mode. The
Page-Directory Base Register (PDBR) in the upper left corner contains the physical
base address of the page-directory. The PDBR is identical to the i386 CR3
register. Only the upper 20 bits are used for addressing. Therefore, the page-directory
is always located on a page boundary. The remaining PDBR bits are either flags
or reserved for future extensions. The page-directory occupies exactly one 4-KB
page, structured as an array of 1,024 32-bit page-directory entries (PDEs).
Similar to the PDBR, each PDE can be divided into a 20-bit page-frame number
(PFN) addressing a page-table, and an array of bit flags. Each page-table is
page-aligned and spans 4 KB, comprising 1,024 page-table entries (PTEs). Again,
the

upper 20 bits are extracted from a PTE to form a pointer to a 4-KB data page.
Address translation takes place by breaking a linear address into three parts:
The upper 10 bits select a PDE out of the page-directory, the next lower 10 bits
select a PTE out of the page-table addressed by the PDE, and, finally, the lower
12 bits specify an offset into the data page addressed by the PTE.

In the 4-KB paging scheme, the 4-GB linear address space is addressable by
means of a double-layered indirection mechanism. In the worst case, 1,048,576
PTEs are required to cover the entire range. Because each page-table holds 1,024
PTEs, this amounts to 1,024 page-tables, which is the number of PDEs the
page-directory contains. With the page-directory and each page-table consuming 4
KB, the maximum memory management overhead in this paging model is 4 KB plus 4
MB, or 4,100 KB. That's a reasonable price for a subdivision of the entire
4-GB space into 4-KB tiles that can be mapped to any linear address.

In 4-MB paging mode, things are much simpler because one indirection layer
is eliminated, as shown in Figure
4-4. Again, the PDBR points to the page-directory, but now only the upper
10 bits of the PDE are used, resulting in 4-MB alignment of the target address.
Because no page-tables are used, this address is already the base address of
a 4-MB data page. Consequently, the linear address now consists of two parts
only: 10 bits for PDE selection and 22 offset bits. The 4-MB memory scheme requires
no more than 4 KB overhead, because only the page-directory consumes additional
memory. Each of its 1,024 PDEs can address one 4-MB page. This is just enough
to cover the entire 4-GB address space. Thus, 4-MB pages have the advantage
of keeping the memory management overhead low, but for the price of a more coarse
addressing granularity.

Both the 4-KB and 4-MB paging modes have advantages and disadvantages.
Fortunately, operating system designers don't have to decide for one of
them, but can run the CPU in mixed mode. For example, Windows 2000 works with
4-MB pages in the memory range 0x80000000 to 0x9FFFFFFF, where
the kernel modules hal.dll and ntoskrnl.exe are loaded. The
remaining linear address blocks are managed in 4-KB tiles. This mixed design is
recommended by Intel for improved system performance, because 4-KB and 4-MB page
entries are cached in different Translation Lookaside Buffers (TLBs) inside the
i386 CPU (Intel 1999c, pp. 3-22f). The operating system kernel is usually large
and is always resident in memory, so storing it in several 4-KB pages would
permanently use up valuable TLB space.

Note that all address translation steps are carried out in physical memory.
The PDBR and all PDEs and PTEs contain physical address pointers. The only linear
address found in Figures
4-3 and 4-4 is
the box in the lower left corner specifying the address to be converted to an
offset inside a physical page. On the other hand, applications must work with
linear addresses and are ignorant of physical addresses. However, it is possible
to fill this gap by mapping the page-directory and all of its subordinate page-tables
into the linear address space. On Windows 2000 and Windows NT 4.0, all

PDEs and PTEs are accessible in the address range 0xC0000000 to
0xC03FFFFF. This is a linear memory area of 4-MB size. This is
obviously the maximum amount of memory consumed by the page-table layer in 4-KB
paging mode. The PTE associated to a linear address can be looked up by simply
using its most significant 20 bits as an index into the array of 32-bit PTEs
starting at 0xC0000000. For example, the PTE of address
0x00000000 is located at 0xC0000000. The PTE index of address
0x80000000 is computed by shifting it right by 12 bits to get at the
upper 20 bits, yielding 0x80000. Because each PTE takes four bytes, the
target PTE is found at 0xC0000000 + (4 * 0x80000) =
0xC0200000. This result looks interesting—obviously, the address that
divides the 4-GB address space in two equal halves is mapped to a PTE address
that divides the PTE array in two equal halves.

Now let's go one more step ahead and compute the entry address of the
PTE array itself. The general mapping formula is ((LinearAddress >>
12) * 4) + 0xC0000000. Setting LinearAddress to 0xC0000000
yields 0xC0300000. Let's pause for a moment: The entry at linear
address 0xC0300000 points to the beginning of the PTE array in physical
memory. Now look back to Figure
4-3. The 1,024 entries starting at address 0xC0300000 must be the
page-directory! This special PDE and PTE arrangement is exploited by various
memory management functions implemented in ntoskrnl.exe. For example,
the (documented) API functions MmIsAddressValid() and MmGetPhysicalAddress()
take a 32-bit linear address, look up its PDE and, if applicable, its PTE, and
examine their contents. MmIsAddressValid() simply checks out whether
the target page is currently present in physical memory. If the test fails,
the linear address is either invalid or it refers to a page that has been flushed
to backup storage, represented by the set of system pagefiles. MmGetPhysicalAddress()
first extracts the page-frame number (PFN) corresponding to a linear address,
which is the base address of its associated physical page divided by the page
size. Next, it computes the offset into this page by extracting the least significant
12 bits of the linear address, and adds the offset to the physical base address
determined by the PFN.

More thorough examination of the implementation of
MmGetPhysicalAddress() reveals another interesting property of the
Windows 2000 memory layout. Before anything else, the code tests whether the
linear address is within the range 0x80000000 to 0x9FFFFFFF.
As already mentioned, this is the home of hal.dll and
ntoskrnl.exe, and it is also the address block where Windows 2000 uses
4-MB pages. The interesting thing is that MmGetPhysicalAddress()
doesn't care at all for PDEs or PTEs if the address is within this range.
Instead, it simply sets the top three bits to zero, adds the byte offset, as
usual, and returns the result as the physical address. This means that the
physical address range 0x00000000 to 0x1FFFFFFF is mapped 1:1
to the linear addresses 0x80000000 to 9FFFFFFF! Knowing that
ntoskrnl.exe is always loaded to the linear address
0x80400000, this means that the Windows 2000 kernel is always found at
physical address 0x00400000, which happens to be the base address of
the second 4-MB page in physical memory. In fact, examination of these memory
regions proves that the above assumptions are correct. You will have the
opportunity to see this with the memory spy presented in this chapter.