IRIX Binary Compatibility, Part 1

Author's Note: This article details the IRIX binary compatibility implementation for the NetBSD operating system. This includes the
creation of a new emulation subsystem inside the NetBSD kernel and
a lot of reverse engineering to understand and reproduce how IRIX
internals work.

Because this article includes an introduction to all kernel subsystems involved with IRIX binary compatibility, we assume the reader has some experience in user-land Unix programming.

An Introduction to Binary Compatibility

References

Kernel and User-Mode Overview

Unix systems have two distinct modes of operation, known as user mode and kernel (or system) mode. In user mode, the operating system (OS)
executes code provided by users. It could be a Web browser, a computer-science-student's project, a Web server (in this case, the user running the
program is usually the system administrator), and so on. This code is
run with limited privileges. It has limited access to the computer's
memory, and usually no access at all to the hardware.

When running in kernel mode, the OS is only executing trusted code,
which was loaded at boot time. This code is known as the OS kernel. The
kernel has full access to the memory and hardware. It is here to provide
services to user programs:

It gives user programs access to the hardware. It provides an abstraction layer, presenting files and terminals to user programs where in fact only zeros and ones exist on hard disk and display I/O controllers.

It periodically switches execution between several user programs
(which are called processes), maintaining the illusion of multitasking.

It ensures that a user accesses resources which correspond to the user's privileges.

User processes call kernel code by issuing a trap. A trap is a hardware
or software exception that suspends user process execution, and gives
control to kernel code. The kernel will handle the exception, after which it
may return to user mode and resume the execution of the user process, or it
may destroy the user process. Example of traps are division by zero, memory
faults (accessing any virtual addresses where no physical memory is
mapped), timer interrupts (that are used to switch between user processes), or
requests by the user process to access some resource controlled by the
kernel.

These requests can be opening a file, reading from a network
connection, or
creating a new process. The process does this by issuing a system call,
like open(2), read(2), or fork(2). The system call is in fact a CPU
instruction that causes a trap.

Here is an example of MIPS assembly to call the fork(2) system call on
NetBSD:

li $v0,2 # 2 is the system call number for fork()
# v0 is the register holding the system call number
syscall # syscall is the CPU instruction to do a system call

On the syscall instruction execution, the kernel executes a particular
trap handler, which is known as the system call handler. For NetBSD/mips, it
can be found in sys/arch/mips/mips/syscall.c:syscall_plain(). The system
call handler expects an argument, which is the system call number. The
system call handler uses a table, called the system call table, to look up a
kernel function that will be called in order to complete the system call. On
NetBSD, the system call table for native processes is generated from
sys/kern/syscalls.master.

System calls are the way a user process requests action from the
kernel, but
there is also a mechanism used by the kernel to notify the user process
of
unusual conditions: signals. Signals are issued by various traps and
system calls, to notify the process that it raised an exception: memory
fault
(the famous segmentation fault, well known to students learning C),
division
by zero and so on.

For each signal, the user process can decide to take default action on
some signals (by default, some signals cause program abortion, other
are simply ignored), to ignore it, or to execute a function called a
signal
handler. This choice is made using the signal(3) library call or the
sigaction(2) system call.

Binary Compatibility at a Glance

There is a clean separation between user mode and kernel mode. User
processes run on top of the kernel with very little knowledge of what
is
inside a system call. All they do is issuing system calls, expecting a
behavior documented by kernel developers in a set of man pages. Most
programs
do not care about kernel internals and will just work if you change the
kernel,
as long as the system call behavior is left unchanged.

This is how NetBSD binary compatibility works. When launching a new
program,
the kernel is able to distinguish between native NetBSD binaries and,
for example Linux or FreeBSD binaries on NetBSD/i386. It will hence
choose an
alternative system call table for this program, which will contain
appropriate entries for the emulated OS. For instance, NetBSD/i386 uses
sys/compat/linux/arch/i386/syscalls.master to provide the system call
table for Linux binaries.

When a Linux binary running on NetBSD does a system call, the NetBSD
kernel
will run the appropriate function in the Linux system call table. This
function emulates the behavior of the Linux system call so that the
user
program is fooled into thinking that it is running on the Linux kernel
whereas it is in fact running on the NetBSD kernel.

Some system calls have the same behavior in NetBSD and in the emulated
OS; in this case, the emulation system call table just uses the same
corresponding function. Sometime the behavior is a bit different. For
instance some flags have different values, or there are different
system call semantics. In this case, the system call table references an
emulation function, which will call the native function after adapting the
arguments and/or behavior. This is done, for instance, in sys/compat/linux/common/linux_misc.c:linux_sys_uname() for Linux
uname(2) emulation. Last but not least, the emulated system call may have no
native equivalent. The emulation function that implements the system calls
must hence do all the work, or just act as the work has been done and just
return, hoping that the user process will not notice the broken behavior (yes,
sometimes it works).

The other part of the job is implementing signal emulation. Care should
be taken in order to ensure the system call handler is called in the
same way the emulated OS would have done it. This job leads to the
manipulation of machine registers and assembly language, and hence it is quite
machine dependent.