IRIX Binary Compatibility, Part 5Reverse Engineering Threading

Welcome back to our series on IRIX binary compatibility. In this part, we
will study IRIX and NetBSD threading models. We will also examine how it is
possible to emulate IRIX native threads on NetBSD, though NetBSD does not
support a similar feature for its native binaries.

Introduction to Multithreading

Traditional UNIX systems implement multitasking by using multiple
processes. A process is the unit of execution. It has an execution context and
a virtual memory space, both private and specific to the process.

This scheme has some drawbacks. Consider the need for efficient
interprocess communication. Traditionally, the programmer needs to arrange for
an area of memory to be shared between processes. This makes the program more
complex. Another problem is that switching virtual memory spaces on process
switches can be expensive.

The solution to these problems is multithreading. In UNIX terminology,
multithreading is the ability to run multiple execution contexts within the
same virtual address space. Each execution context is known as a thread. The
set of threads sharing the same address space is usually known as the process.
Since memory is shared among the different threads of a process, communication
between threads is transparent and efficient.

There are two popular ways of implementing threads on a UNIX system: kernel
assisted threads and user threads.

With user threads, the kernel is not aware that there are multiple
execution contexts in the process. Everything is done in userland, by timely
switching between the threads. The advantage of this method is that there is no
need for kernel support, which means that it is easy to implement user threads
on any UNIX system. Additionally, there is an existing standard for the
programming interface: POSIX.1c. Because of these two advantages, user threads
are often referred as portable threads or POSIX threads.

In This Series

IRIX Binary Compatibility, Part 6
With IRIX threads emulated, it's time to emulate share groups, a building block of parallel processing. Emmanuel Dreyfus digs deep into his bag of reverse engineering tricks to demonstrate how headers, documentation, a debugger, and a lot of luck are helping NetBSD build a binary compatibility layer for IRIX.

IRIX Binary Compatibility, Part 2
Emmanual Dreyfus shows us how he implemented the things necessary to start an IRIX binary. These things include the program's arguments, environment, and for dynamic binaries, the ELF auxiliary table, which is used by the dynamic linker to learn how to link the program.

IRIX Binary Compatibility, Part 1
This article details the IRIX binary compatibility
implementation for the NetBSD operating system. It covers creating a new emulation subsystem inside the NetBSD kernel as well as some reverse engineering to understand and reproduce how IRIX internals work.

The big disadvantage of user threads is that because the kernel has no
knowledge of the existence of the threads, when one thread makes a blocking
system call, the process, and therefore all of the threads in that process, are
blocked. Also, because the kernel only sees one process, user threads cannot
take advantage of multiple processors.

Kernel assisted threads are designed to address these two drawbacks. With
this scheme, the kernel is responsible for scheduling the different threads
within a process. Because it is aware of their existence, the kernel can
schedule other threads of a process when a thread makes a blocking system call.
The kernel is also able to balance multiple threads of a single process among
multiple processors. One big disadvantage of kernel assisted threads is that it
requires kernel support, which means that some UNIX kernels will not support
them. Kernel assisted threads are not easily portable. This is why kernel
assisted threads are also known as native threads.

Another big drawback is that there is no standardized programming interface
for handling native threads, and each OS has its own set of system calls syntax
and semantics. For instance, Linux creates native threads using Linux uses clone(2)
and IRIX uses sproc(2).

One side note about kernel processes. Kernel processes are UNIX processes
running entirely in kernel mode and never executing in user mode. Examples of
kernel processes are the pagedaemon(8) on NetBSD, whose job is to
swap out pages that have been unused for some time. Because it only manipulates
kernel structures, pagedaemon does not need to run in user mode, therefore it is named as a kernel process. The problem is that kernel processes
are sometimes referred as kernel threads because they are different execution
contexts all sharing the kernel virtual memory space. This creates confusion
with native threads, which do exactly the same thing though they share a user
virtual memory space. In order to avoid confusion, we will try to speak about
kernel processes and kernel assisted threads (or native threads) using these
names in this document.

Thread Models in IRIX and NetBSD

IRIX supports portable threads, of course, but because this is all
implemented in userland, it is transparent for our IRIX emulation on NetBSD.
We do not have to set up anything to support that.

Native threads are implemented through sproc(2) system calls
and variants: nsproc(2), sprocsp(2),
and so on. Variants differ from the original system call by offering options
such as the possibility to choose where the new thread's stack will be
allocated. IRIX has special terminology for native threads created by
sproc(2): they constitute a share group.

Here is sproc(2)'s prototype:

pid_t sproc (void (*entry)(void *), unsigned inh, void *arg);

The first argument is the thread entry point. Once the kernel has created
the new thread, it returns to userland and resumes execution at the address
pointed to by the entry argument.

inh stands for inherit. This integer value holds a few flags
which specify whether various resources should be shared or not between the
thread and its parent. Sharable resources include: address space, UID/GID,
process limits, file descriptors, umask value, and working directory.

Finally, arg is a pointer to the thread entry function's
arguments. There is no argument counter, as the parent and the child are
supposed to know each other well enough to be aware of what is pointed to by
arg.

Now, what about NetBSD? Of course NetBSD supports user threads, although
there is no built-in user thread library. If you need user threads, you have to
install a third-party thread library. The NetBSD package collection includes
multiple user thread libraries: devel/unproven-pthreads,
devel/pth, and devel/mit-pthreads. Of course all the
packages work. There are three implementations available because none is
perfect, so we leave the choice to the user.

With native threads, things become more difficult since NetBSD does not
support native threads yet. Given this situation, emulating IRIX native threads
may seem an impossible task. Fortunately, it is not, and we will explain why a
bit later. For now, let us explain why NetBSD is so poor regarding threading.

We explained before that both user threads and native threads have their
drawbacks. In an ideal world, we would want both a standardized API, and a
thread model where threads can take advantage of multiple processors and are
not blocked because of a blocking system call issued by another thread.

NetBSD is now close to be that ideal world thanks to Nathan J. Williams, who is working
on an hybrid thread model that maps user threads to kernel threads. This
concept is called Scheduler Activations and was first introduced by Sun
Microsystems in Solaris.

There is no thread capability built into NetBSD, whether user or native,
just because we want the ideal world for NetBSD and nothing else. But before
we enter the perfect world, we want to emulate IRIX native threads.

In fact, NetBSD does support native threads. NetBSD just lacks the API to
spawn native threads from userland. From the kernel, and especially from a
foreign OS emulation subsystem, there is no problem: you can create a native
thread. This has been used for Linux emulation first since a lot of Linux
applications require native threads through the clone(2) system
call. In order to emulate Linux's clone(2), NetBSD needed some
support for native threads. We end up with this strange situation where the
NetBSD kernel is able to emulate features that it does not support natively.

In the NetBSD kernel sources, native threads are supported through the
fork1(9) kernel function, found in sys/kern/kern_fork.c.
fork1(9) has some options, including the possibility to share the
virtual memory space with the parent.

It is worth mentioning that the NetBSD fork(2) system call
implementation is just a wrapper to fork1(9) with appropriate
arguments:

In this file, there is also a sys___clone() function, which is
a Linux clone(2) compatible system call. It is not exported in the
kernel API, but if you ever want to natively build a Linux application that
uses clone(2) on NetBSD, consider adding
sys___clone() to NetBSD native's syscall.master and
rebuild a kernel (note that you also need to patch the libc to make the system
call available).

Having described threads and the NetBSD implementation, let us move on to
something more serious: the actual implementation.

Everything is done in sys/compat/irix/irix_prctl.c.
Here we have the various sproc(2) variant implementations:
irix_sys_pidsprocsp(), irix_sys_sprocsp() and
irix_sys_sproc(), respectively implementing
pidsprocsp(2), sprocsp(2), and sproc(2).
All these functions call irix_sproc() with appropriate arguments.
irix_sproc() does the actual work, which is mostly to call
fork1(9).

Lists, Locks and Reference Counts

But things are not that simple. For various reasons, we need to be able to
walk through the threads list in a share group. Therefore, we need a chained
list of processes in the share group.

There are handy macros in the NetBSD headers
<sys/queue.h> that are really helpful here. Documentation
can be found in the queue(3) manual page. There are macros to
declare the list, to walk through it, to insert elements in it, and so on. All
the pointer arithmetic is done in the macro, and the programmer can focus on the
actual use of the list. This is wonderful.

Creating the list is easy. To store the pointers to next/previous elements
in the list, we can use a field of the emulation specific part of struc proc
(it is pointed to by the p_emuldata field, and we created a struct
irix_emuldata to describe it. Check the previous article about signal emulation if you need details about this). The only problem is
deciding where to store the head of the list. Storing it in the
irix_emuldata structure is not a good idea, since we have no
guarantee at all that the processes holding the head of the queue will not
terminate before the others.

The solution is to allocate a structure, which describes the share group. It
is defined in sys/compat/irix/irix_prctl.h and is called struct
irix_share_group.

Here we have the list head, defined using the LIST_HEAD macro,
one lock on the list and a reference counter.

The List

The list declaration will expand in the declaration of a structure. The
first argument to LIST_HEAD is the structure name. This name does
not really matter; we use the name of the field in struct
irix_share_group, as this is more convenient. The second argument
does matter; it is the type of the list elements. Here we specify that our list
is made of struct irix_emuldata elements. (Note that 'struct' is
omitted. The macro will add it.)

In the struct irix_emuldata, we have one field that holds the
list pointers. The declaration is done with another macro:

LIST_ENTRY(irix_emuldata) ied_sglist;

The Lock

The lock is used because several processes could try to manipulate the list
at once. This can of course happen on a multiprocessor system, where the two
processors may be in kernel mode at the same time, trying to manipulate the
same kernel structure. It can also happen on a uniprocessor system, too.
Imagine we have to walk the share group list to apply some collective
operation. For some reason this operation may cause the process to sleep,
awaiting some particular event. During the sleep, other processes will be
scheduled, and they may try to modify the list too. When our first process
wakes up, it may be working on a list element that no longer exists, which is
quite likely to cause a panic.

The solution to this is to use a lock. Locks are manipulated through the
lockmgr(9) function. Simplest uses are to get a shared or
exclusive lock and to release a lock. Here is an example of
lockmgr(9) use:

We use shared locks when we want to read the list. Several processes may be
doing this at once. They will all ask for a shared lock and get it.

Exclusive locks are requested when we want a process to modify the list.
Upon exclusive lock request, lockmgr(9) will sleep until all
shared locks are released, which guarantees that no reading process is still
reading. Once exclusive lock is acquired by one process, any other process
calling lockmgr(9) for a shared or exlcusive lock will go to
sleep, thus ensuring that no process can read while our process is writing.

The Reference Count

The reference count enables us to track the irix_share_group
structure usage. When a process is added to the list, we increase the count.
When a process is released, we decrease the count. When the count hits zero, we
can free the structure, since nobody is using it anymore.

One of the first jobs of irix_sproc() is to check if the
irix_share_group structure has already been allocated. If this is
not the case, the structure is allocated and the calling process is added to
the list. The child process will be added later, as we cannot add it before it
is actually created.

Parent and Child Life in the Kernel

fork1() has an argument named func which is used
to specify the child entry point. This can be a NULL pointer if we want to
return immediately to userland, or it can be the address of a kernel function
used to perform some custom tasks before returning to userland. Since we need
to arrange the child environment and to add it to the share group list, we use
this feature. The child entry function is called
irix_sproc_child() and is defined in
sys/compat/irix/irix_prctl.c.

It is worth mentioning that the child entry function must be a kernel
function. Passing a userland address here will cause a panic. To return to a
custom userland address after fork1(9), the solution is to specify
a kernel child entry function and to modify the child registers saved onto the
stack in that function so that, on return to userland, the process will execute
our userland function.

While executing irix_sproc_child(), we need some information
from the parent. (The kernel stores all fundamental information about a
process in the proc structure, so we read that.) This can lead to some
problems if we just let the parent exit irix_sproc() while
irix_sproc_child() is executing. If the parent returns to
userland and catches a deadly signal, our child can be left manipulating a
stale proc structure. This is very likely to cause a panic.

This situation can arise easily in a multiprocessor system where the second
CPU will execute the child while the first CPU executes the parent. But it can
happen on a uniprocessor system as well. If the child sleeps awaiting some
resource, the kernel may schedule the parent to run and resume the child later.

The solution is to force the parent to sleep on a flag and make the child
awaken the parent when it will no longer be fatal for the child to be orphaned.
The address of the flag is passed to the child using the arg
argument to fork1(9). This argument is a pointer that gets passed
to the child entry function. For irix_sproc_child(), we defined it
as a pointer to the irix_sproc_child_args structure (defined in sys/compat/irix/irix_prctl.c):

The structure is allocated in irix_sproc() and is used to
transmit a lot of information between the parent and the child: the child proc
structure pointer (isc_proc), the entry,
arg, len, and inh arguments that were
given to the sproc(2) system call (isc_entry,
isc_arg, isc_len, and isc_inh), the
parent proc structure address (isc_parent), the share group
structure address (isc_share_group), and the flag on which the
parent is sleeping, awaiting for the child to finish building itself
(isc_child_done).

Now that you have the whole picture, let us move to something even more
interesting. irix_sproc_child() needs to set up the stack and
registers just like IRIX does. Of course this is not documented anywhere, and
this is why things are getting more interesting. Time has come to reverse
engineer IRIX's sproc(2) implementation. But this will come in the
next part, so stay tuned.

Emmanuel Dreyfus
is a system and network administrator in
Paris, France, and is currently a developer for NetBSD.