being taken.
The basic strategy is to have node_reboot (when -p option supplied)
invoke a special command on the node that will cause the shutdown
procedure to run prepare as it goes single user, but before the
network is turned off and the machine rebooted. The output of the
prepare run is capture and send back via the tmcd BOOTLOG command and
stored in the DB, so that create_image can dump that to the logfile
(so that the person taking the image can know for certain that the
prepare ran and finished okay).
On linux this is pretty easy to arrange since reboot is actually
shutdown and shutdown runs the K scripts in /etc/rc.d/rc6.d, and at
the end the node is basically single user mode. I just added a new
script to run prepare and send back the output.
On FreeBSD this is a lot harder since there are no decent hooks.
Instead, I had to hack up init (see tmcd/freebsd/init/{4,5,6}) with
some simple code that looks for a command to run instead of going to a
single user shell. The command (script) runs prepare, sends the output
back to tmcd, and then does a real reboot.
Okay, so how to get -p passed to node_reboot? I hacked up the
libadminmfs code slightly to do that, with new 'prepare' argument
option. This may not be the best approach; might have to do this as a
real state transition if problems develop. I will wait and see.
Also, I changed www/loadimage.php3 to spew the output of the
create_image to the browser.

splitting the existing code between a frontend script that parses arguments
and does taint checking, and a backend library where all the work is done
(including permission checks). The interface to the libraries is simple
right now (didn't want to spend a lot of time on designing interface
without knowing if the approach would work long term).
use libreboot;
use libosload;
nodereboot(\%reboot_args, \%reboot_results);
osload(\%reload_args, \%reload_results);
Arguments are passed to the libraries in the form of a hash. For example,
in os_setup:
$reload_args{'debug'} = $dbg;
$reload_args{'asyncmode'} = 1;
$reload_args{'imageid'} = $imageid;
$reload_args{'nodelist'} = [ @nodelist ];
Results are passed back both as a return code (-1 means total failure right
away, while a positive argument indicates the number of nodes that failed),
and in the results hash which gives the status for each individual node. At
the moment it is just success or failure (0 or 1), but in the future might
be something more meaningful.
os_setup can now find out about individual failures, both in reboot and
reload, and alter how it operates afterwards. The main thing is to not wait
for nodes that fail to reboot/reload, and to terminate with no retry when
this happens, since at the moment it indicates an unusual failure, and it
is better to terminate early. In the past an os_load failure would result
in a tbswap retry, and another failure (multiple times). I have already
tested this by trying to load images that have no file on disk; it is nice
to see those failures caught early and the experiment failure to happen
much quicker!
A note about "asyncmode" above. In order to promote parallelism in
os_setup, asyncmode tells the library to fork off a child and return
immediately. Later, os_setup can block and wait for status by calling
back into the library:
my $foo = nodereboot(\%reboot_args, \%reboot_results);
nodereboot_wait($foo);
If you are wondering how the child reports individual node status back to
the parent (so it can fill in the results hash), Perl really is a kitchen
sink. I create a pipe with Perl's pipe function and then fork a child to so
the work; the child writes the results to the pipe (status for each node),
and the parent reads that back later when nodereboot_wait() is called,
moving the results into the %reboot_results array. The parent meanwhile can
go on and in the case of os_setup, make more calls to reboot/reload other
nodes, later calling the wait() routines once all have been initiated.
Also worth noting that in order to make the libraries "reentrant" I had to
do some cleaning up and reorganizing of the code. Nothing too major though,
just removal of lots of global variables. I also did some mild unrelated
cleanup of code that had been run over once too many times with a tank.
So how did this work out. Well, for os_setup/os_load it works rather
nicely!
node_reboot is another story. I probably should have left it alone, but
since I had already climbed the curve on osload, I decided to go ahead and
do reboot. The problem is that node_reboot needs to run as root (its a
setuid script), which means it can only be used as a library from something
that is already setuid. os_setup and os_load runs as the user. However,
having a consistent library interface and the ability to cleanly figure out
which individual nodes failed, is a very nice thing.
So I came up with a suitable approach that is hidden in the library. When the
library is entered without proper privs, it silently execs an instance of
node_reboot (the setuid script), and then uses the same trick mentioned
above to read back individual node status. I create the pipe in the parent
before the exec, and set the no-close-on-exec flag. I pass the fileno along
in an environment variable, and the library uses that to the write the
results to, just like above. The result is that os_setup sees the same
interface for both os_load and node_reboot, without having to worry that
one or the other needs to be run setuid.

instead of rebooting. If the reconfig fails, fail through to reboot.
A reconfig will "halt" the current vnodes (or remove ones that are no
longer assigned to the node) reconfig the pnode, and then restart the
vnodes that are still assigned to the node (or create new ones for
newly assigned vnodes). A halt stops the vnode, but leaves the
vnode filesystem intact.
Not bothering to reconfig individual vnodes yet since thats pretty
much like a reboot of a vnode. Difference in time is tiny.
Lbs

determining if a bootinfo wakeup will suffice. This avoids a flurry
of children trying to connect to the DB. Also, move the TBdbfork()
until after the DoesPing() test, so that the connection flurry is
spaced out a bit more. DoesPing() does not talk to the DB, as its name
might not imply.
Bump batchcount to 12.

filename to boot, and all local nodes will boot the same pxeboot kernel,
which has been extended to allow for jumping directly into a specific MFS
(in addition to the usual testbed boot into a partition or multiboot
kernel).
Bootinfo and the bootwhat protocol extended to tell the client node what
MFS to jump into directly, without a reboot. pxe_boot_path and
next_pxe_boot_path are now deprecated, with bootinfo used to control which
MFS to boot. Nodes now boot a single pxeboot kernel, and bootinfo tells
them what to do next.
Bootinfo greatly simplifed. temp_boot_osid has been added to allow for
temporary booting of different kernels (such as with ndoe_admin or
create_image). Unlike next_boot_osid which is a one-shot boot,
temp_boot_osid causes the node to boot that OS until told not too.
next_boot_path and def_boot_path in the nodes table are now ignored.
Bootinfo gets path info strictly from the os_info table entry for the osid
given in one of def_boot_osid, temp_boot_osid, or next_boot_osid. This
makes the selection of what to do in bootinfo a lot simpler (and for
TBBootWhat in libdb). The os_info table also modified to include an MFS
flag so that bootinfo knows to tell the client that the path refers to an
MFS and not a multiboot kernel.
Change to boot sequence; free nodes no longer boot into the default OSID.
Instead, they are told to wait in pxeboot until told what to do, which
will typically be when the node is allocated and a specific OSID
picked. If the node needs to be reloaded, then the node is told to jump
directly into the Frisbee MFS, which saves one complete reboot cycle
whether the node has the requested OS installed, or not. New program
added called "bootinfosend" that is used by node_reboot to "wake up" up
nodes sitting in pxewait mode, so that they query bootinfo again and boot.
node_reboot changed to look at the event state of a node, and use
bootinfosend to wake up nodes, rather then power cycle, since pxeboot does
not repsond to pings. Retry (if the UDP packet is lost) is handled by
stated.
Event support added to bootinfo, to replace the event generation that was
in proxydhcp. I have not included the caching that Mac had in proxydhcp
since it does not appear that bootinfo packets are lost very
often. Cleaned up all of the event and DB queury code to use lib/libtb for
DB access, and moved all of the event code into a separate file. The
event sequence when a node boots now looks like this:
'SHUTDOWN' --> 'PXEBOOTING' (BootInfo)
'PXEBOOTING', --> 'PXEBOOTING' (BootInfo Retry)
'PXEBOOTING', --> 'BOOTING' (Node Not Free)
'PXEBOOTING', --> 'PXEWAIT' (Node is Free)
'PXEWAIT', --> 'PXEWAKEUP' (Node Allocated)
'PXEWAKEUP', --> 'PXEWAKEUP' (Bootinfo Retry)
'PXEWAKEUP', --> 'PXEBOOTING' (Node Woke Up)
Change stated to support resending PXEWAKEUP events when node times out.
After 3 tries, node is power cycled. Other minor cleanup in stated.
Clean up and simplify os_select, while adding support for temp_next_boot
and removing all trace of def_boot_path and next_boot_path processing.
Remove all pxe_boot_path and next_pxe_boot_path processing. Changed
command line interface to support "clearing" fields. For example,
node_admin changed to call os_select like this to have the node
temporarily boot the FreeBSD MFS:
os_select -t FREEBSD-MFS pcXXX
which sets temp_boot_osid. To turn admin mode off:
os_select -c -t pcXXX
which says to clear temp_boot_osid.
sql/database-fill-supplemental.sql modifed to add os_info table
entries for the FreeBSD, Frisbee, and newnode MFS's.
Be sure to change dhcpd config, restart dhcp, kill proxydhcp, restart
bootinfo,

model of waiting for state changes. Before we were watching the database
(which means we can only watch for terminal/stable/long-lived states, and
have to poll the db). Now things that are waiting for states to change
become event listeners, and watch the stream of events flow by, and don't
have to do any polling. They can now watch for any state, and even
sequences of states (ie a Shutdown followed by an Isup).
To do this, there is now a cool StateWait.pm library that encapsulates the
functionality needed. To use it, you call initStateWait before you start
the chain of events (ie before you call node reboot). Then do your stuff,
and call waitForState() when you're ready to wait. It can be told to
return periodically with the results so far, and you can cancel waiting
for things. An example program called waitForState is in
testbed/event/stated/ , and can also be used nicely as a command line tool
that wraps up the library functionality.
This also required the introduction of a TBFAILED event that can be sent
when a node isn't going to make it to the state that someone may be
waiting for. Ie if it gets wedged coming up, and stated retries, but
eventually gives up on it, it sends this to let things know that the node
is hozed and won't ever come up.
Another thing that is part of this is that node_reboot moves (back) to the
fully-event-driven model, where users call node reboot, and it does some
checks and sends some events. Then stated calls node_reboot in "real mode"
to actually do the work, and handles doing the appropriate retries until
the node either comes up or is deemed "failed" and stated gives up on it.
This means stated is also the gatekeeper of when you can and cannot reboot
a node. (See mail archives for extensive discussions of the details.)
A big part of the motivation for this was to get uninformed timeouts and
retries out of os_load/os_setup and put them in stated where we can make a
wiser choice. So os_load and os_setup now use this new stuff and don't
have to worry about timing out on nodes and rebooting. Stated makes sure
that they either come up, get retried, or fail to boot. tbrestart also
underwent a similar change.

really-reboot-nodes-that-timeout stuff.
NOTE: Until the timeout/retry stuff is gone from os_load/os_setup, it is
disabled in stated. It will still only send email. But all the stuff is
there and has been tested.
NOTE: Until other things don't depend on the old behavior of node_reboot
(when it returns, all nodes are in SHUTDOWN), the event stuff is disabled.
Real mode is the default, and can be run by anyone.
In short, this commit is new versions of stated and node_reboot that act
almost exactly like the old ones. But I wanted to commit them before I go
on making a bunch more changes, to have a checkpoint that I know works.

node_reboot reports node activity into the "last_ext_act" column of
node_activity. (Ie activity that is external to the node.)
This means that swapin, swapout, reload, etc etc, anything that reboots
the node from boss/ops, will count as activity.

then remove special case for sending REBOOTING event in node_reboot/power
when using NORMAL mode. Now SHUTDOWN is always sent. (Important side note:
SHUTDOWN needs to be a valid state in every machine now.)

local. For local nodes, need to cull out jailed nodes if the phys node
is also going to reboot. Jailed nodes are rebooted serially since they
go down much faster.
Fix up recently added wait mode for jailed nodes. Also, I noticed that
I was having problems with events not filtering through stated before
going into the ISUP wait loop; I was catching the nodes still in ISUP
instead of SHUTDOWN. I added a sleep(2) before going into wait mode,
but this might be something to watch out for elsewhere too.

Changes to watch out for:
- db calls that change boot info in nodes table are now calls to os_select
- whenever you want to change a node's pxe boot info, or def or next boot
osids or paths, use os_select.
- when you need to wait for a node to reach some point in the boot process
(like ISUP), check the state in the database using the lib calls
- Proxydhcp now sends a BOOTING state for each node that it talks to.
- OSs that don't send ISUP will have one generated for them by stated
either when they ping (if they support ping) or immediately after they get
to BOOTING.
- States now have timeouts. Actions aren't currently carried out, but they
will be soon. If you notice problems here, let me know... we're still
tuning it. (Before all timeouts were set to "none" in the db)
One temporary change:
- While I make our new free node manager daemon (freed), all nodes are
forced into reloading when they're nfreed and the calls to reset the os
are disabled (that will move into freed).

Add logging to @prefix@/log/power.log - writes down when and how it was rebooted. Will aid in some debugging tasks, and will be much more important after merging the new stated stuff, when node_reboot will check the state of the node before rebooting it.

Remove -e flag from calls to power. node_reboot sends an event only when ssh reboot or ipod are successful in rebooting the node, and only calls power when they are not successful. So an event should be sent by power every time node_reboot calls it. This explains some of the problems we were having with tons of email from stated about invalid transitions: since the state changes weren't always happening, it appeared to skip over states.

started.) Instead of avoiding sending an event when we think the
node will be sending it, go back to always sending it. Nodes
sending the events on shutdown just isn't reliable enough. Instead,
we'll just make two REBOOT events in a row legal.

script that checks the database to see if local or remote. The problem
with this is that the ssh syntax makes it hard to determine the host
name by inspection. Would need to parse all the ssh args (bad idea),
ot work backwards and try to figure out the difference between the
command (which is not a string but a sequence of args) and the host
and the preceeding ssh args. Hell with that! Changed sshtb to require
a specific -host argument. Read the args and look for it. Error out of
not found, to catch improper usage.
The moral of this update: "sshtb [ssh args] -host <host> [more args ...]