Oracle Blog

Blogged by David Comay

These Boots are Made for Walkin'

One of the most gratifying and exciting aspects of the
OpenSolaris
project is a return (for me, at least) to working on operating system
design and research with the larger, open community. In another era
while I was an undergraduate at
Berkeley,
I was fortunate enough to see the 2.x and 4.x BSD development effort up
close and to see the larger community formed between the University and
external organizations that had
UNIX
source licenses. It was not an open source community, of course, but
it was a community none the less, and one that shared fixes, ideas and
other software projects built on top of the operating system. Our
hopes for OpenSolaris are that in addition to releasing operating
system source code that can be used for many different purposes, Sun
and the community will innovate together while maintaining the core
values that
Solaris
provides today.

One of the many pieces of OpenSolaris which is of personal interest is
the
Zones
virtualization technology introduced in Solaris 10. Zones provide a
lightweight but very efficient and flexible way of consolidating and
securing potentially complex workloads. There is a wealth of technical
information about Zones in Solaris available at the
OpenSolaris Zones Community
and the
BigAdmin System Administration Portal.

One of the things about Zones that people notice right away is how
quickly they boot. Of course, booting a zone does not cause a system
to run its power-on self-test (POST) or require the same amount of
initialization
that takes place when the hardware itself is booting. However, I
thought it might be useful to do a tour of the dance that takes place
when a non-global zone is booted. I call it a dance since there
is a certain amount of interplay between the primary players -
zoneadm,
zoneadmd
and the kernel itself - that warrants an explanation.

Although the virtualization that Zones provides is spread throughout
the source code, the primary implementation in the kernel can be found
in
zone.c.
As with many OpenSolaris frameworks, there is a big block
comment at the start of the file which is very useful for
understanding the lay of the land with respect to the code. Besides
describing the data structures and locking strategy used for Zones,
there is a description of the states a zone can be in from the kernel's
perspective and at what points a zone may transition from one state to
another. For brevity, only the states covered during a zone boot are
listed here

...
\*
\* Zone States:
\*
\* The states in which a zone may be in and the transitions are as
\* follows:
\*
\* ZONE_IS_UNINITIALIZED: primordial state for a zone. The partially
\* initialized zone is added to the list of active zones on the system but
\* isn't accessible.
\*
\* ZONE_IS_READY: zsched (the kernel dummy process for a zone) is
\* ready. The zone is made visible after the ZSD constructor callbacks are
\* executed. A zone remains in this state until it transitions into
\* the ZONE_IS_BOOTING state as a result of a call to zone_boot().
\*
\* ZONE_IS_BOOTING: in this shortlived-state, zsched attempts to start
\* init. Should that fail, the zone proceeds to the ZONE_IS_SHUTTING_DOWN
\* state.
\*
\* ZONE_IS_RUNNING: The zone is open for business: zsched has
\* successfully started init. A zone remains in this state until
\* zone_shutdown() is called.
...

It is important to note here that there are a number of zone states not
represented here - those are for zones which do not (yet) have a kernel
context. An example of such a state is for a zone that is in the
process of being installed. These states are defined in
libzonecfg.h.

One of the players in the zone boot dance is the zoneadmd
process which runs in the global zone and performs a number of critical
tasks. Although much of the virtualization for a zone is implemented
in the kernel, zoneadmd manages a great deal of a zone's
infrastructure as outlined in
zoneadmd.c

/\*
\* zoneadmd manages zones; one zoneadmd process is launched for each
\* non-global zone on the system. This daemon juggles four jobs:
\*
\* - Implement setup and teardown of the zone "virtual platform": mount and
\* unmount filesystems; create and destroy network interfaces; communicate
\* with devfsadmd to lay out devices for the zone; instantiate the zone
\* console device; configure process runtime attributes such as resource
\* controls, pool bindings, fine-grained privileges.
\*
\* - Launch the zone's init(1M) process.
\*
\* - Implement a door server; clients (like zoneadm) connect to the door
\* server and request zone state changes. The kernel is also a client of
\* this door server. A request to halt or reboot the zone which originates
\* \*inside\* the zone results in a door upcall from the kernel into zoneadmd.
\*
\* One minor problem is that messages emitted by zoneadmd need to be passed
\* back to the zoneadm process making the request. These messages need to
\* be rendered in the client's locale; so, this is passed in as part of the
\* request. The exception is the kernel upcall to zoneadmd, in which case
\* messages are syslog'd.
\*
\* To make all of this work, the Makefile adds -a to xgettext to extract \*all\*
\* strings, and an exclusion file (zoneadmd.xcl) is used to exclude those
\* strings which do not need to be translated.
\*
\* - Act as a console server for zlogin -C processes; see comments in zcons.c
\* for more information about the zone console architecture.
\*
\* DESIGN NOTES
\*
\* Restart:
\* A chief design constraint of zoneadmd is that it should be restartable in
\* the case that the administrator kills it off, or it suffers a fatal error,
\* without the running zone being impacted; this is akin to being able to
\* reboot the service processor of a server without affecting the OS instance.
\*/

When a user wishes to boot a zone, zoneadm will attempt to
contact zoneadmd
via a
door
that is used by all three components for a number of things including
coordinating zone state changes. If for some reason zoneadmd is not
running, an attempt will be made to
start it.
Once that has completed, zoneadm tells zoneadmd to
boot the zone
by supplying the appropriate
zone_cmd_arg_t
request via a door call. It is worth noting that the same door is used
by zoneadmd to return messages back to the user executing zoneadm and
also as a way for zoneadm to indicate to zoneadmd the
locale
of the user executing the boot command so that messages are localized
appropriately.

Looking at the
door server
that zoneadmd implements, there is some straightforward sanity checking
that takes place on the argument passed via the door call as well as
the use of some of the technology that came in with the introduction of
discrete privileges in Solaris 10.

Using
door_ucred,
the user credential can be checked to determine whether the request
originated in the global zone,1 whether the user making the request had
sufficient privilege to do so2 and whether the request was a result of
an upcall from the kernel. That last piece of information is used,
among other things, to determine whether or not messages should be
localized by
localize_msg.

It is within the door server implemented by zoneadmd that transitions
from one state to another take place. There are two states from which
a zone boot is permissible, installed and ready. From
the installed state,
zone_ready
is used to create and bring up the zone's
virtual platform
that consists of the zone's kernel context (created using
zone_create)
as well as the zone's specific file systems (including the root file
system) and logical networking interfaces. If a zone is supposed to be
bound to a non-default
resource pool,
then that also takes place as part of this state transition.

When a zone's kernel context is created using zone_create, a
zone_t
structure is allocated and initialized. At this time, the the status
of the zone is set to
ZONE_IS_UNINITIALIZED.
Some of the initialization that takes place is in order to set up the
security boundary which isolates processes running inside a zone. For
example, the
vnode_t
of the zone's
root file system,
the zone's
kernel credentials
and the
privilege sets
of the zone's future processes are all initialized here.

Before returning back to the zoneadmd command,
zone_create adds the primordial zone to a doubly-linked list
and two hash tables,
3
one hashed by
zone name
and the other by
zone ID.
These data structures are protected by the
zonehash_lock
mutex which is then dropped after the zone has been added. Finally a
new kernel process is then created,
zsched,
which is where kernel threads for this zone are parented. After
calling
newproc
to create this kernel process, zone_create will wait using
zone_status_wait
until the zsched kernel process has completed initializing the
zone and has set its status to
ZONE_IS_READY.

Since the user structure of the process initialization has not been
completed, the first thing the new zsched process does is
finish that initialization along with reparenting itself to PID 1 (the
global zone's
init,
process). And since the future processes to be run within the new zone
may be subject to resource controls, that initialization takes place
here in the context of zsched.

After grabbing the
zone_status_lock
mutex in order to set the status to ZONE_IS_READY,
zsched will then suspend itself, waiting for the zone's status
to been changed to
ZONE_IS_BOOTING.

Once the zone is in the ready state, zone_create
returns control back to zoneadmd and the door server continues
the boot process by calling
zone_bootup
This initializes the zone's console device, mounts some of the standard
OpenSolaris file systems like /proc and /etc/mnttab
and then uses the
zone_boot
system call to attempt to boot the zone.

As the comment that introduces zone_boot points out, most of
the heavy lifting has already been done either by zoneadmd or
by the work the kernel has done through zone_create. As this
point, zone_boot saves the requested
boot arguments
after grabbing the zonehash_lock mutex and then further grabs
the zone_status_lock mutex in order to set the zone status to
ZONE_IS_BOOTING. After dropping both locks, it
is zone_boot that suspends itself waiting for the zone status
is be set to
ZONE_IS_RUNNING.

Since the zone's status has now been set to ZONE_IS_BOOTING,
zsched now continues where it left off after it has suspended
itself with its call to
zone_status_wait_cpr
After checking that the current zone status is indeed ZONE_IS_BOOTING,
a new kernel process is created in order to run init in the
zone. This process calls
zone_icode
which is analogous to the traditional
icode
function that is used to start init in the global zone and in
traditional UNIX environments. After doing some zone-specific
initialization, each of the icode functions end up calling
exec_init
to actually exec the init process after copying out the path
to the executable, /sbin/init, and the boot arguments. If the
exec is successful, zone_icode will set the zone's status to
ZONE_IS_RUNNING and in the process, zone_boot will
pick up where it had been suspended. At this point, the value of
zone_boot_err
indicates whether the zone boot was successful or not and is used to
set the global errno value for zoneadmd.

There are two additional things to note with the zone's transition to
the running state. First of all,
audit_put_record
is called to generate an event for the Solaris auditing system so that
it's known which user executed which command to boot a
zone. In addition, there is an internal zoneadmd event
generated to indicate on the zone's console device that the zone is
booting. This internal stream of
events
is sent by the door server to the zone console subsystem for all state
transitions, so that the console user can see which state the zone is
transitioning to.

1
This is a bit of defensive programming since unless the global zone
administrator were to make the door in question available through the
non-global zone's own file system, there would be no way for a
privileged user in a non-global zone to actually access door used by
zoneadmd.

2zoneadm itself checks that the user attempting to boot a zone
has the necessary privilege but it's possible some other privileged
process in the global zone might have access to the door but
lack the necessary
PRIV_SYS_CONFIG
privilege.