The IRIX Process Data Area (PRDA)

The first bug appeared when trying to run the IRIX version of
Photoshop. It manifested itself as an unexpected SIGSEGV,
which usually deeply depresses me. Tracing which bug in the emulation
subsystem caused a segmentation fault can be extremely difficult, since
the emulation inconsistency can be quite far away from the segmentation
fault. It is not easy, but at least we can try.

Here is the kernel trace before the segmentation fault, obtained with
ktrace:

The instruction that caused the exception is lw
$t7,3584($t6). It was supposed to store the value at the address
T6+0x3584 into T7. Given that T6 is 0x200000, we end up
accessing 0x200e00, where no memory is mapped, hence the
SIGSEGV.

Reading the code, it is interesting to note that T6 has just been
initialized by a constant value: lui stands for load upper
half word of an integer, hence lui $t6,0x20 just caused T6 to
be filled with 0x200000. This attempt to read data at
0x200e00 is not a consequence of some data handed out by the
emulation subsystem.

The idea that a page of memory should magically be mapped at that address
seems a bit odd, but it is worth a program to check it. Here is code to do the
job:

The next step is to dump this area and to have the NetBSD kernel
prepare it just like the IRIX kernel does. This could have been a
difficult job, but fortunately the "magic page" is documented. I have to
thank Chuck Silvers for pointing this out to me. We had an email exchange
about how to handle some issues related to the "magic page", and Chuck
called it the PRDA. I asked him why he used this name, and Chuck told me
this was in the sproc(2)
man page.

The sproc(2) man page explains that when you create a
thread sharing the virtual memory space with the parent, everything is
shared except the Process Data Area (PRDA). A reference to
<sys/prctl.h> is also given to get more information about
the PRDA. In this header file, we can find the actual definition of the
structures contained in the PRDA, making it much easier to emulate.

The real problem now is to implement this feature correctly on NetBSD.
Mapping and filling the PRDA at process creation time is easy, but having
our sproc(2) emulation share the whole virtual memory space
except for the PRDA is more difficult. We will cover this in the next
section.

Private Mappings in Share Groups

We have to handle shared virtual memory spaces that contain one private
page. In fact, we have to handle potentially multiple private pages,
since the PRDA is not the only situation where this property is needed. In
the IRIX mmap(2) man page, we can see that there is a
MAP_LOCAL option used to make a private mapping within the
share group shared virtual memory space.

Sharing the whole virtual memory space is simple: there is a field in
the proc structure called p_vmspace (this is defined in <
sys/proc.h>). This field is a pointer to a struct
vmspace (defined in <uvm/uvm_extern.h>) which described a process
virtual address space. When we want to share the whole virtual address
space, we share the same struct vmspace among different
processes.

qThe struct vmspace contains a substructure called
vm_map (defined in
<uvm/uvm_map.h>) which in turn contains the list of map entries,
describing the various mappings in the process virtual address space. To
share only some pages, we must have different vmspace structures with
different list of mappings. The map entries linked in the lists will be
the same for all the processes in the share group when they describe the
shared regions. For private regions, each process will have its own map
entries.

The real problem, since we have different lists for each process, is
keeping the lists synchronized. If a process modifies the mapping in a
shared region, the modification must be visible to all other
processes. What are the ways of modifying the virtual address space
mappings?

Through system calls, virtual memory mappings are affected by
mmap(2), munmap(2), break(2),
shmsys(2), mprotect(2), plock(2),
mpin(2), munpin(2), memcntl(2),
ptrace(2), and syssgi(2) commands such as
SGI_ELFMAP (see part 3 of this series to learn
about SGI_ELFMAP). For each of these system calls, if the
operation is successful, the map entry list must be kept in sync.

Additionally, memory mappings can be affected by page fault handling.
We have to handle these as well and maintain the map entries in sync
within the share group each time a member makes a page fault. This is a
primer on the NetBSD emulation subsystem, which has only been used to
emulate system calls and signal handling so far.

We therefore had to write a per emulation page fault handling in the
emulation subsystem. The goal is exactly the same as system call
emulation: emulation independent code has some hooks to handle emulation
specific behavior. The hook is usually done by using a pointer from
struct emul, which points to some emulation specific function
or data for each emulation. As an example, here is how the MIPS system
call handler handles error codes (sys/arch/mips/mips/syscall.c:syscall_plain()):

if (p->p_emul->e_errno)
error = p->p_emul->e_errno[error];

p is a pointer to the current process, and
error the error code that the kernel wants to return to
userland. e_errno is a field in struct emul
which points to an emulation dependent array, defining the translation
between native NetBSD error codes and emulated error codes.

Let us examine trap handling now. We introduced an e_fault
field to struct emul which points to a function responsible
for trap handling. Native NetBSD processes will want to use
uvm_fault() and IRIX will want to use
irix_vm_fault(), which is implemented in sys/compat/irix/irix_prctl.c.
In the MIPS trap handler (sys/arch/mips/mips/trap.c),
we have:

This is because we need the struct proc pointer in
irix_vm_fault(), and changing uvm_fault()
prototype would have caused too many invasive changes in NetBSD's virtual
memory subsystem. Because there is no strong requirement to have the same
prototype, we used a slightly different one. The struct
vm_map pointer can be easily derived from struct proc
(it is just p->p_vmspace->vm_map), so the struct
proc can just replace the vm_map argument in the
irix_vm_fault() prototype.

irix_vm_fault() just calls uvm_fault() and
makes a virtual address space mapping sync across the share group. The
next point is about how the actual sync is done. Whether requested from a
memory related system call or from irix_vm_fault(), the sync
is implemented by the irix_vm_sync() function (implemented in
sys/compat/irix/irix_prctl.h).

irix_vm_sync() takes the struct proc pointer
of the modified process as an argument. For each process in the share
group, it unmaps shared regions and remaps them as in the modified
process. The unmapping is done by uvm_unmap(9), and the copy of the modified process
mapping is done by uvm_map_extract(9).

We have to keep track of which region is shared and which region is
private. This information could have been added to the VM map entries,
but adding emulation specific data there is not a good practice. We
therefore use a chained list whose head is in struct
emuldata:

This list is modified when we create the PRDA and when the
MAP_LOCAL option is requested in
irix_sys_mmap(). When we modify the list, we have to compute
region intersections to avoid having information about the same region
twice. This is done through irix_isrr_insert() in
sys/compat/irix/irix_prctl.c.

There is a debug function in the same file,
irix_isrr_debug(), which is useful for checking what happens
to the list. If the kernel is built with the DEBUG_IRIX
option, this function is called each time the list is modified. Here is
some output from the debug function, which helps to explain what is going
on:

With information from this list, irix_vm_sync() is able to
perform the sync, sharing mappings only for shared regions. The code is
getting quite complicated, but we are getting close to genuine IRIX
behavior. irix_vm_sync() now has to compute the intersection
of shared and private regions across the share group, and make the memory
mapping synchronize accordingly.

We end with quite a horrible piece of code. On a plain page fault, we
have to walk through multiple chained lists, which is a pain on the
performance front. But we have no choice: we want to emulate an odd
feature, so we get odd code.

In the next part, if I am courageous enough to write about it, we will
look at the emulation of an IRIX pseudo-device driver that is used to
implement pollable semaphores. This includes reverse engineering the
driver entry points, since nearly no documentation is available about it,
and, of course, implementing the driver in NetBSD.

Emmanuel Dreyfus
is a system and network administrator in
Paris, France, and is currently a developer for NetBSD.