contents

meta

A Taste of Computer Security

Sandboxing

Sandboxing is a popular technique for creating confined execution environments, which could be be used for running untrusted programs. A sandbox limits, or reduces, the level of access its applications have — it is a container.

If a process P runs a child process Q in a sandbox, then Q's privileges would typically be restricted to a subset of P's. For example, if P is running on the "real" system (or the "global" sandbox), then P may be able to look at all processes on the system. Q, however, will only be able to look at processes that are in the same sandbox as Q. Barring any vulnerabilities in the sandbox mechanism itself, the scope of potential damage caused by a misbehaving Q is reduced.

Sandboxes have been of interest to systems researchers for a long time. In his 1971 paper titled Protection, Butler Lampson provided an abstract model highlighting properties of several existing protection and access-control enforcement mechanisms, many of which were semantic equivalents (or at least semantic relatives) of sandboxing. Lampson used the word "domain" to refer to the general idea of a protection environment.

There were several existing variants of what Lampson referred to as domains: protection context, environment, ring, sphere, state, capability list, etc.

Historical Example: Hydra

Hydra [1975] was a rather flexible capability-based protection system. One of its fundamental principles was the separation of policy and mechanism. A protection kernel provided mechanisms to implement policies encoded in user-level software that communicated with the kernel. Hydra's philosophy could be described as follows:

It is easier to protect information if it is divided into distinct objects.

Assign a particular type to each object. Some types (such as Procedure and Process) are provided natively by Hydra, but mechanisms exist for creating new types of objects. Hydra provides generic (type-independent) operations to manipulate each object, in addition to mechanisms for defining type-specific operations.

Capabilities are used to determine if access to an object is allowed. Objects do not have "owners".

Each program should have no more access rights than that are absolutely necessary.

Hydra abstracted the representation and implementation of operations for each object type in modules called subsystems. A user could only access an object through the appropriate subsystem procedures. Moreover, Hydra provided rights amplification — wherein a called procedure could have greater rights in the new domain (created for the invocation) than it had in the original (caller's) domain.

Hydra was relevant from several (related) perspectives: protection, sandboxing, security, virtualization, and even quality of service. The Hydra kernel's component for handling process scheduling was called the Kernel Multiprocessing System (KMPS). While parametrized process schedulers existed at that time, Hydra provided more flexibility through user-level schedulers. In fact, multiple schedulers could run concurrently as user-level processes called Policy Modules (PMs). The short-term scheduling decisions made in the kernel (by KMPS) were based on parameters set by the PMs. There was also provision for having a "guarantee algorithm" for allocating a rate guarantee (a fixed percentage of process cycles over a given period) to each PM.

Varieties

Today sandboxes are used in numerous contexts. Sandbox environments range from those that look like a complete operating environment to applications within, to those that provide a minimum level of isolation (to a few system calls, say). Some common sandbox implementation categories include:

Virtual machines (VMs) can be used to provide runtime sandboxes, which are helpful in enhancing users' level of confidence in the associated computing platform. The Java virtual machine is an example, although it can only be used to run Java programs (and is therefore not general purpose).

Virtualization

Applications could contain a sandbox mechanism within themselves. Proof-carrying code (PCC) is a technique used for safe execution of untrusted code. In the PCC system, the recipient of such code has a set of rules guaranteeing safe execution. Proof-carrying code must be in accordance with these safety rules (or safety policies), and carries a formal proof of this compliance. Note that the producer of such code is untrusted, so the proof must be validated by the recipient before code execution. The overall system is such that the programs are tamperproof — any modification will either render the proof invalid, or in case the proof still remains valid, it may not be a safety proof any more. In the case where the proof remains valid and is a safety proof, the code may be executed with no harmful effects.

A layer could be implemented in user-space, perhaps as a set of libraries that intercept library calls (in particular, invocations of system call stubs) and enforce sandboxing. A benefit of this approach is that you do not need to modify the operating system kernel (which might be operationally, technically, or politically disruptive to deployment). However, this approach might not yield a secure, powerful, or flexible enough sandboxing mechanism, and is often frowned upon by "the kernel people".

The sandboxing layer could be implemented within the operating system kernel.

Ideally you would design a system with explicit support for sandboxing, but it is often more practical to retrofit sandboxing into existing systems. Intercepting system calls is a common approach to creating sandboxes.

Intercepting System Calls

A common technique used for retrofitting sandboxing mechanisms is the "taking over", or interception of system calls. While such implementation is a powerful tool and is trivial to implement on most systems (usually without any kernel modifications, such as through loadable modules), it could, depending upon the situation, be extremely difficult to use effectively.

After you intercept a system call, you could do various things, such as:

Deny the system call

Audit the system call's invocation

Pre-process the arguments

Post-process the result

Replace the system call's implementation

There are several problems, however. If you are implementing a sandboxing mechanism (or mechanisms for access control, auditing, capabilities, etc.), you may not have all the context you need at the interception point. In many cases, even the original system call does not have this context, because such context may be a new requirement, introduced by your mechanism. Moreover, you may have to replicate the original system call's implementation because pre- and post-processing do not suffice. In closed source systems, this is a bigger problem. Even with an open source system, a system call could be off-loading most of its work to an internal kernel function, and you may not wish to make kernel source modifications. Chaining of calls (such as calling the original implementation from within the new one) will often not be atomic, so race conditions are possible. You also might need to ensure that semantics are fully preserved.

On many systems, the system call vector (usually an array of structures, each containing one or more function pointers — implementation(s) of that particular system call) is accessible and writable from within a kernel module. If not, there are workarounds. Even internal functions could be overtaken, say, by instruction patching. However, such approaches might turn out to be maintenance nightmares in case you are shipping a product that does this on an operating system that is not your own.

Examples

Let us briefly discuss a few examples that involve sandboxing.

TRON - ULTRIX (1995)

TRON was an implementation of a process-level discretionary access control system for ULTRIX. Using TRON, a user could specify capabilities for a process to access filesystem objects (individual files, directories, and directory trees). Moreover, TRON provided protected domains — restrictive execution environments with a specified set of filesystem access rights.

Rights on files included read, write, execute, delete, and modify (permissions), while rights on directories included creation of new files or links to existing files. TRON enforced capabilities via system call wrappers that were compiled into the kernel. Moreover, the implementation modified the process structure, key system calls such as fork and exit, and the system call vector itself.

LaudIt - Linux (1997)

I was a system administrator during my days at IIT Delhi as an undergraduate student. I had considerable success with a Bastard Operator From Hell (BOFH) facade, but certain mischievous users were quite relentless in pursuing their agenda. It was to foil them that I initially came up with the idea of LaudIt. First implemented in 1997, LaudIt was perhaps one of the earliest user-configurable and programmable system call interception mechanisms for the Linux kernel.

I implemented LaudIt as a loadable kernel module that dynamically modified the system call vector to provide different security policies. A user-interface and an API allowed a privileged user to mark system calls as having alternate implementations. Such a re-routed system call could have a chain of actions associated with the call's invocation. An action could be to log the system call, or consult a set of rules to allow or disallow the invocation to proceed. If no existing rule involved a re-routed system call, its entry in the system call vector was restored to its original value.

LaudIt required no modification (or recompilation) of the kernel itself, and could be toggled on or off.

Thus, using LaudIt, system calls could be audited, denied, or re-routed on a per-process (or per-user) basis. It was also possible to associate user-space programs with certain system call events. For example, you could implement per-file per-user passwords on files.

ECLIPSE/BSD (1997)

I worked on introducing Resource Management for Quality of Service in a custom Operating System derived from FreeBSD. This included work on schedulers for CPU, network, and disk, a pseudo filesystem based resource management API, and a resource management layer to provide seamless quality of service to legacy, unmodified applications.

Ensim Private Servers (1999)

Ensim Corporation did pioneering work in the area of virtualizing operating systems on commodity hardware. Ensim's Virtual Private Server (VPS) technology allows you to securely partition an operating system in software, with quality of service, complete isolation, and manageability. There exist versions for Solaris (see below), Linux, and Windows. Although all solutions are kernel-based, none of these implementations require source code changes to the kernel.

Solaris Virtual Private Server

I researched, designed, and implemented a "virtualized" version of Sun's Solaris operating system. The idea is to divide the operating system (by creating a software layer in the kernel) into multiple virtual environments, where each virtual OS is capable of running arbitrarily complicated existing applications unmodified. Such a complicated application (Oracle, for example) would typically exercise most components/subsystems of the OS.
Resources are made available to a virtual instance with Quality of Service. Moreover, applications in one virtual OS instance are in complete isolation from applications in other instances on the same "real" machine.

Each virtual instance can be managed (administered, configured, rebooted, shutdown etc.) completely independently of the others, and is visible as the "normal" operating system to applications within it. Note that this is different from an emulator: there is only one instance of the OS kernel, but the APIs have been virtualized in the kernel. This results in a much higher virtual instance performance than would be possible with an emulator.

Specific virtualization components include (but are not limited to):

Virtual system calls

Virtual uid 0 (each instance has its own "root" user)

Fair share network scheduler

Per-virtual OS resource limits on memory, CPU, and link

Virtual sockets and TLI (including the sockets port space)

Virtual IP address space

Virtual NFS

Virtual disk driver and enhanced VFS (each instance sees its own physical disk that may be resized dynamically, and partitioned as usual)

Virtual System V IPC layer (each instance gets its own IPC namespace)

Virtual /dev/kmem (each instance can access /dev/kmem appropriately without compromising other instances, or the system in general)

Virtual /proc filesystem (each instance gets its own /proc with only its processes showing up)

Virtual syslog facility

Virtual device filesystem

Overall system management layer

Note that this was product quality software and all work was done without ever having seen the source code for Solaris (which obviously is proprietary to Sun, and their source license had enough "wrong" strings attached from my company's point of view).

Note: I have been asked sometimes how this virtualization project (referred to as "V" from now on) relates to, or is different from the upcoming "Zones" feature in Solaris 10, or the FreeBSD "jail" subsystem. Here are some thoughts on this (assuming the reader is familiar with Solaris Zones and FreeBSD "jail"):

"V"'s goals are different (loftier, in many ways) from the others: it strives to give you the benefits of an OS emulator (or a real OS running on a hardware platform emulator) with far less overhead. Like others, it provides you isolated environments in which you can securely run applications, but unlike others, its isolated environments are very much like the full-blown underlying OS. For example, a virtual environment in "V" comes up just like a normal system (its own init and startup scripts). Having said that, "Zones" does appear to be very similar.

"V" lets you install and run most applications within a virtual instance, except those that access hardware directly. Since there is a virtual disk driver, applications that want to access disk(s) directly are allowed. The capacity of this virtual disk can be altered dynamically, even if there is a filesystem on it.

"V" is implemented as a set of loadable kernel modules, without referring to the kernel source. It can be dynamically introduced into a running system.

The project was started in 1999 (on Solaris 7, later carried over to Solaris 8) and was largely done in 2000. At that time, there was no talk of Solaris "Zones" and "jail" was not as mature as it is today. It makes me happy to see Sun heading in a similar direction today with "Zones" (they do have the kernel source!)

A Solaris Zone cannot be an NFS server, while a "V" instance could.

A Solaris Zone does not allow the mknod system call, while "V" did.

There are most likely features in Zones that "V" did not.

Solaris Zones (2004)

Sun introduced static partitioning in 1996 on its E10K family of servers. The partitions, or domains, were defined by a physical subset of resources - such as a system board with some processors, memory, and I/O buses. A domain could span multiple boards, but could not be smaller than a board. Each domain ran its own copy of Solaris. In 1999, Sun made this partitioning "dynamic" (known as Dynamic System Domains) in the sense that resources could be moved from one domain to another.

By the year 2002, Sun had also introduced Solaris Containers: execution environments with limits on resource consumption, existing within a single copy of Solaris. Sun has been improving and adding functionality to its Resource Manager (SRM) product, which was integrated with the operating system beginning with Solaris 9. SRM is used to do intra-domain management of resources such as CPU usage, virtual memory, maximum number of processes, maximum logins, connect time, disk space, etc.

The newest Sun reincarnation of these concepts is called "Zones": a feature in the upcoming Solaris 10. According to Sun, the concept is derived from the BSD "jail" concept: a Zone (also known as a "trusted container") is an isolated and secure execution environment that appears as a "real machine" to applications. There is only one copy of the Solaris kernel.