This chapter is from the book

This chapter is from the book

Linux processes execute in a virtual environment that makes it appear as if
each process had the entire address space of the CPU available to itself. This
virtual address space extends from address 0 all the way to the maximum address.
On a 32-bit platform, such as IA-32, the maximum address is 232 -
1 or 0xffffffff. On a 64-bit platform, such as IA-64, this is
264 -1 or 0xffffffffffffffff.

While it is obviously convenient for a process to be able to access such a
huge address space, there are really three distinct, but equally important,
reasons for using virtual memory.

Resource virtualization. On a system with virtual memory,
a process does not have to concern itself with the details of how much physical
memory is available or which physical memory locations are already in use by
some other process. In other words, virtual memory takes a limited physical
resource (physical memory) and turns it into an infinite, or at least an
abundant, resource (virtual memory).

Information isolation. Because each process runs in its own
address space, it is not possible for one process to read data that belongs to
another process. This improves security because it reduces the risk of one
process being able to spy on another process and, e.g., steal a
password.

Fault isolation. Processes with their own virtual address
spaces cannot overwrite each other's memory. This greatly reduces the risk
of a failure in one process triggering a failure in another process. That is,
when a process crashes, the problem is generally limited to that process alone
and does not cause the entire machine to go down.

In this chapter, we explore how the Linux kernel implements its virtual
memory system and how it maps to the underlying hardware. This mapping is
illustrated specifically for IA-64. The chapter is structured as follows. The
first section provides an introduction to the virtual memory system of Linux and
establishes the terminology used throughout the remainder of the chapter. The
introduction is followed by a description of the software and hardware
structures that form the virtual memory system. Specifically, the second section
describes the Linux virtual address space and its representation in the kernel.
The third section describes the Linux page tables, and the fourth section
describes how Linux manages the translation lookaside buffer (TLB), which
is a hardware structure used to accelerate virtual memory accesses. Once these
fundamental structures are introduced, the chapter describes the operation of
the virtual memory system. Section five explores the Linux page fault handler,
which can be thought of as the engine driving the virtual memory system. Section
six describes how memory coherency is maintained, that is, how Linux ensures
that a process sees the correct values in the virtual memory locations it can
access. Section seven discusses how Linux switches execution from one address
space to another, which is a necessary step during a process context switch. The
chapter concludes with section eight, which provides the rationale for some of
the virtual memory choices that were made for the virtual memory system
implemented on IA-64.

4.1 Introduction to the Virtual Memory System

The left half of Figure 4.1 illustrates the virtual address space as it might
exist for a particular process on a 64-bit platform. As the figure shows, the
virtual address space is divided into equal-sized pieces called virtual
pages. Virtual pages have a fixed size that is an integer power of 2. For
example, IA-32 uses a page size of 4 Kbytes. To maximize performance, IA-64
supports multiple page sizes and Linux can be configured to use a size of 4, 8,
16, or 64 Kbytes. In the figure, the 64-bit address space is divided into 16
pages, meaning that each virtual page would have a size of 264/16 =
260 bytes or 1024 Pbytes (1 Pbyte = 250 bytes). Such large
pages are not realistic, but the alternative of drawing a figure with several
billion pages of a more realistic size is, of course, not practical either.
Thus, for this section, we continue to illustrate virtual memory with this huge
page size. The figure also shows that virtual pages are numbered sequentially.
We can calculate the virtual page number (VPN) from a virtual
address by dividing it by the page size and taking the integer portion of the
result. The remainder is called the page offset. For example, dividing
virtual address 0x40000000000003f8 by the page size yields 4 and a
remainder of 0x3f8. This address therefore maps to virtual page number
4 and page offset 0x3f8.

Let us now turn attention to the right half of Figure 4.1, which shows the
physical address space. Just like the virtual space, it is divided into
equal-sized pieces, but in physical memory, those pieces are called page
frames. As with virtual pages, page frames also are numbered. We can
calculate the page frame number (PFN) from a physical address by
dividing it by the page frame size and taking the integer portion of the result.
The remainder is the page frame offset. Normally, page frames have the
same size as virtual pages. However, there are cases where it is beneficial to
deviate from this rule. Sometimes it is useful to have virtual pages that are
larger than a page frame. Such pages are known as superpages. Conversely,
it is sometimes useful to divide a page frame into multiple, smaller virtual
pages. Such pages are known as subpages. IA-64 is capable of supporting
both, but Linux does not use them.

While it is easiest to think of physical memory as occupying a single
contiguous region in the physical address space, in reality it is not uncommon
to encounter memory holes. Holes usually are caused by one of three
entities: firmware, memory-mapped I/O devices, or unpopulated memory. All three
cause portions of the physical address space to be unavailable for storing the
content of virtual pages. As far as the kernel is concerned, these portions are
holes in the physical memory. In the example in Figure 4.1, page frames 2 and 3
represent a hole. Note that even if just a single byte in a page frame is
unusable, the entire frame must be marked as a hole.

4.1.1 Virtual-to-physical address translation

Processes are under the illusion of being able to store data to virtual
memory and retrieve it later on as if it were stored in real memory. In reality,
only physical memory can store data. Thus, each virtual page that is in use must
be mapped to some page frame in physical memory. For example, in Figure 4.1,
virtual pages 4, 5, 8, 9, and 11 are in use. The arrows indicate which page
frame in physical memory they map to. The mapping between virtual pages and page
frames is stored in a data structure called the page table. The page
table for our example is shown on the left-hand side of Figure 4.2.

The Linux kernel is responsible for creating and maintaining page tables but
employs the CPU's memory management unit (MMU) to translate
the virtual memory accesses of a process into corresponding physical memory
accesses. Specifically, when a process accesses a memory location at a
particular virtual address, the MMU translates this address into the
corresponding physical address, which it then uses to access the physical
memory. This is illustrated in Figure 4.2 for the case in which the virtual
address is 0x40000000000003f8. As the figure shows, the MMU extracts
the VPN (4) from the virtual address and then searches the page table to find
the matching PFN. In our case, the search stops at the first entry in the page
table since it contains the desired VPN. The PFN associated with this entry is
7. The MMU then constructs the physical address by concatenating the PFN with
the frame offset from the virtual address, which results in a physical address
of 0x70000000000003f8.

4.1.2 Demand paging

The next question we need to address is how the page tables get created.
Linux could create appropriate page-table entries whenever a range of virtual
memory is allocated. However, this would be wasteful because most programs
allocate much more virtual memory than they ever use at any given time. For
example, the text segment of a program often includes large amounts of error
handling code that is seldom executed. To avoid wasting memory on virtual pages
that are never accessed, Linux uses a method called demand paging. With
this method, the virtual address space starts out empty. This means that,
logically, all virtual pages are marked in the page table as not present.
When accessing a virtual page that is not present, the CPU generates a page
fault. This fault is intercepted by the Linux kernel and causes the page
fault handler to be activated. There, the kernel can allocate a new page frame,
determine what the content of the accessed page should be (e.g., a new, zeroed
page, a page loaded from the data section of a program, or a page loaded from
the text segment of a program), load the page, and then update the page table to
mark the page as present. Execution then resumes in the process with the
instruction that caused the fault. Since the required page is now present, the
instruction can now execute without causing a page fault.

4.1.3 Paging and swapping

So far, we assumed that physical memory is plentiful: Whenever we needed a
page frame to back a virtual page, we assumed a free page frame was available.
When a system has many processes or when some processes grow very large, the
physical memory can easily fill up. So what is Linux supposed to do when a page
frame is needed but physical memory is already full? The answer is that in this
case, Linux picks a page frame that backs a virtual page that has not been
accessed recently, writes it out to a special area on the disk called the
swap space, and then reuses the page frame to back the new virtual page.
The exact place to which the old page is written on the disk depends on what
kind of swap space is in use. Linux can support multiple swap space areas, each
of which can be either an entire disk partition or a specially formatted file on
an existing filesystem (the former is generally more efficient and therefore
preferable). Of course, the page table of the process from which Linux
"stole" the page frame must be updated accordingly. Linux does this
update by marking the page-table entry as not present. To keep track of
where the old page has been saved, it also uses the entry to record the disk
location of the page. In other words, a page-table entry that is present
contains the page frame number of the physical page frame that backs the
virtual page, whereas a page-table entry that is not present contains the
disk location at which the content of the page can be found.

Because a page marked as not present cannot be accessed without first
triggering a page fault, Linux can detect when the page is needed again. When
this happens, Linux needs to again find an available page frame (which may cause
another page to be paged out), read the page content back from swap space, and
then update the page-table entry so that it is marked as present and maps
to the newly allocated page frame. At this point, the process that attempted to
access the paged-out page can be resumed, and, apart from a small delay, it will
execute as if the page had been in memory at all along.

The technique of stealing a page from a process and writing it out to disk is
called paging. A related technique is swapping. It is a more
aggressive form of paging in the sense that it does not steal an individual page
but steals all the pages of a process when memory is in short supply.
Linux uses paging but not swapping. However, because both paging and swapping
write pages to swap space, Linux kernel programmers often use the terms
"swapping" and "paging" interchangeably. This is something
to keep in mind when perusing the kernel source code.

From a correctness point of view, it does not matter which page is selected
for page out, but from a performance perspective, the choice is critical. With a
poor choice, Linux may end up paging out the page that is needed in the very
next memory access. Given the large difference between disk access latency (on
the order of several milliseconds) and memory access latency (on the order of
tens of nanoseconds), making the right replacement choices can mean the
difference between completing a task in a second or in almost three hours!

The algorithm that determines which page to evict from main memory is called
the replacement policy. The provably optimal replacement policy
(OPT) is to choose the page that will be accessed farthest in the future. Of
course, in general it is impossible to know the future behavior of the
processes, so OPT is of theoretical interest only. A replacement policy that
often performs almost as well as OPT yet is realizable is the least recently
used (LRU) policy. LRU looks into the past instead of the
future and selects the page that has not been accessed for the longest period of
time. Unfortunately, even though LRU could be implemented, it is still not
practical because it would require updating a data structure (such as an LRU
list) on every access to main memory. In practice, operating
systems use approximations of the LRU policy instead, such as the clock
replacement or not frequently used (NFU)policies [11, 69].

In Linux, the page replacement is complicated by the fact that the kernel can
take up a variable amount of (nonpageable) memory. For example, file data is
stored in the page cache, which can grow and shrink dynamically. When the kernel
needs a new page frame, it often has two choices: It could take away a page from
the kernel or it could steal a page from a process. In other words, the kernel
needs not just a replacement policy but also a memory balancing
policy that determines how much memory is used for kernel buffers
and how much is used to back virtual pages. The combination of page replacement
and memory balancing poses a difficult problem for which there is no perfect
solution. Consequently, the Linux kernel uses a variety of heuristics that tend
to work well in practice.

To implement these heuristics, the Linux kernel expects the platform-specific
part of the kernel to maintain two extra bits in each page-table entry: the
accessed bit and the dirty bit. The accessed bit is an
indicator that tells the kernel whether the page was accessed (read, written, or
executed) since the bit was last cleared. Similarly, the dirty bit is an
indicator that tells whether the page has been modified since it was last paged
in. Linux uses a kernel thread, called the kernel swap daemon kswapd,
to periodically inspect these bits. After inspection, it clears the accessed
bit. If kswapd detects that the kernel is starting to run low on memory, its
starts to proactively page out memory that has not been used recently. If the
dirty bit of a page is set, it needs to write the page to disk before the
page frame can be freed. Because this is relatively costly, kswapd
preferentially frees pages whose accessed and dirty bits are
cleared to 0. By definition such pages were not accessed recently and do not
have to be written back to disk before the page frame can be freed, so they can
be reclaimed at very little cost.

4.1.4 Protection

In a multiuser and multitasking system such as Linux, multiple processes
often execute the same program. For example, each user who is logged into the
system at a minimum is running a command shell (e.g., the Bourne-Again shell,
bash). Similarly, server processes such as the Apache web server often use
multiple processes running the same program to better handle heavy loads. If we
looked at the virtual space of each of those processes, we would find that they
share many identical pages. Moreover, many of those pages are never modified
during the lifetime of a process because they contain read-only data or the text
segment of the program, which also does not change during the course of
execution. Clearly, it would make a lot of sense to exploit this commonality and
use only one page frame for each virtual page with identical content.

With N processes running the same program, sharing identical pages
can reduce physical memory consumption by up to a factor of N. In
reality, the savings are usually not quite so dramatic, because each process
tends to require a few private pages. A more realistic example is illustrated in
Figure 4.3: Page frames 0, 1, and 5 are used to back virtual pages 1, 2, and 3
in the two processes called bash 1 and bash 2. Note that a total of nine virtual
pages are in use, but thanks to page sharing, only six page frames are needed.

Of course, page sharing cannot be done safely unless we can guarantee that
none of the shared pages are modified. Otherwise, the changes of one process
would be visible in all the other processes and that could lead to unpredictable
program behavior.

This is where the page permission bits come into play. The Linux
kernel expects the platform-specific part of the kernel to maintain three such
bits per page-table entry. They are called the R, W, and X permission bits and
respectively control whether read, write, or execute accesses to the page are
permitted. If an access that is not permitted is attempted, a page
protection violation fault is raised. When this happens, the kernel
responds by sending a segmentation violation signal (SIGSEGV) to the process.

The page permission bits enable the safe sharing of page frames. All the
Linux kernel has to do is ensure that all virtual pages that refer to a shared
page frame have the W permission bit turned off. That way, if a process
attempted to modify a shared page, it would receive a segmentation violation
signal before it could do any harm.

The most obvious place where page frame sharing can be used effectively is in
the text segment of a program: By definition, this segment can be executed and
read, but it is never written to. In other words, the text segment pages of all
processes running the same program can be shared. The same applies to read-only
data pages.

Linux takes page sharing one step further. When a process forks a copy of
itself, the kernel disables write access to all virtual pages and sets
up the page tables such that the parent and the child process share all page
frames. In addition, it marks the pages that were writable before as
copy-on-write(COW). If the parent or the child process attempts
to write to a copy-on-write page, a protection violation fault occurs. When this
happens, instead of sending a segmentation violation signal, the kernel first
makes a private copy of the virtual page and then turns the write permission bit
for that page back on. At this point, execution can return to the faulting
process. Because the page is now writable, the faulting instruction can finish
execution without causing a fault again. The copy-on-write scheme is
particularly effective when a program does a fork() that is quickly
followed by an execve(). In such a case, the scheme is able to avoid
almost all page copying, save for a few pages in the stack- or data-segment [9,
64].

Note that the page sharing described here happens automatically and without
the explicit knowledge of the process. There are times when two or more
processes need to cooperate and want to explicitly share some virtual memory
pages. Linux supports this through the mmap() system call and through
System V shared memory segments [9]. Because the processes are cooperating, it
is their responsibility to map the shared memory segment with suitable
permission bits to ensure that the processes can access the memory only in the
intended fashion.