I have been playing with a brand new IBM pSeries 610 that
arrived early this week. I came preinstalled with minimal
AIX 5.1, they even left out the IDE CD-ROM driver the system
needs to install any further software. As I wanted to know
if they system would work with the AIX from CD-ROM anyways I
just booted from CD and this time AIX did have an IDE CD-ROM
driver.

This time I installed most of the software I believed we
would need. I did not see the option in the install menu to
make the 64 bit kernel the default so I ended up with a 32
bit kernel. I had to test our AppleTalk kernel modules in 32
and 64 bit mode so I started with 32. After rebooting in 64
bit mode first I had no NFS mounts, the mount command was
barfing that one of the NFS kernel modules is using an old
obsolete format. Appearently I installed one package to
much, after removing the des package NFS works fine.

The AppleTalk kernel module was an easy port, just Makefile
adaptions to compile a 32 bit as well as a 64 bit module
from the same sources and archive these together into an ar
archive. The AIX kernel is smart enough to select the proper
version from the archive depending upon the mode it is
running in, pretty nifty.

While testing some stuff in 64 bit mode I noticed that
Apache (as delivered by IBM as Websphere server) did core
dump upon starting. Dbx does tell me the core file is
invalid, strange. I started httpd with dbx and the -X option
and dbx did hang. I kill -9'ed httpd and could exit dbx. I
then attempted an apachectl start and whoops, I was talking
with the service processor instead of AIX (I was sitting at
the console). I rebooted and looked at the generated vmcore
file and it did point at the kernel based linker that AIX
uses for its shared libraries, it appeared to have stumbled
across a NULL pointer while loading an httpd module.

A few of the other subsystems also produce strange failure
messages in 64 bit mode, all in all I am not convinced about
AIX 5.1 64 bit. AIX has been rock solid for me since the
early beginnings, this is really disappointing. I looked at
the AIX fixes page and tried the new order system for AIX 5
fixes as I found out that I did not yet have the latest
components. One does click on the packages needed and they
did tell me they would process my order and send me a
notification with a download URL. After a few hours waiting
no URL yet, not encouraging.

In the last diary entry I said MacOS X is not Unix. Today I
have to say that Solaris threading and setitimer is really
broken. The setitimer man page says that SIGALRM signals
cannot be blocked in threaded code and this is a bug that is
not going to be fixed.

In converting an existing event based system to cooperate
with threads I really had to provide an efficient API to
have multiple millisecond resolution timers that only happen
to run while the main app is waiting for file descriptors
via poll/select. All signals are blocked while not waiting
for fds, so one can even to malloc inside signal handlers.
This makes for some really easy event driven programming and
we have used this framework for a really long time now.

For getting around the Solaris setitimer problem I did do a
workaround by actually only having a really primitive
SIGALRM signal handler that signals a real time signal that
can be blocked as a replacement. This works really fine
(with some overhead of the extra signal delivery) and I
thought problem solved. Well, after some months using this
on the development machines I have found out that this free
running SIGALRM really wreaks havoc with one assumption
everywhere in the code: no signal will happen unless in
poll/select and thus EINTR is impossible.

Now with SIGALRM running freely without being blocked any
slow I/O on pipes, sockets and terminals can cause EINTR to
happen and strange failures creep into code running since
years. Due to the interaction with timing these bugs are
really difficult to find. We will have to wrap any of the
read, write, readv, writev and so on calls into safe ones
that retry on EINTR and change all of the places that need
the wrappers. This really sucks.

Did some debugging yesterday that showed again that MacOS X
is not Unix, it is something else. If you start a background
daemon from a shell window while logged on with the Aqua GUI
and then log out newly forked processes from the background
daemon will not be able to do any get*ent lookups any more.
The C library on MacOS X does attempt to re-establish on
fork a mach IPC send right for the lookupd cache management
server, and this fails due to the MACH bootstrap server
having destroyed the current context on logout. This
basically means you are not able do start background daemons
from shell windows, this really sucks.

The dladdr idea is hopelessly machine dependent and I have
decided not pursue the idea further. We thus compile in the
name of the shared library and search that in the standard
places.

In the mean time the rework of the admin protocol for all
the PC style stuff and the new printer interface types
progresses well, the server part is done. Heinrich works on
the client side, this takes longer as it is much more
work.

I have meanwhile started to put in the AFP 3.0 extensions
into our afpsrv, although I do not necessarily expect to be
finished in time for the initial MacOS X release. The
important infrastructure changes are already done, namely
the 64 bit file I/O stuff and the new shared arena. Also AFP
3.0 does allow for long UTF8 file names, which we can now do
easily as we did extend our desktop database format. I will
first implement the 64 bit I/O calls and than the Unix style
permissions, leaving the more complicated UTF8 file name
stuff for later.

After returning from Yellowstone
I was busy the last few days to abstract a few operations we
have been doing all the years although it is possible to
optimize them. In particular we do append resources (the
idea is loosely based on the Mac idea) to the end of our
executables for small information items that should always
be in sync with the compiled code. Under Unix there is no
standard way to open the current process executable, so the
original code did search for argv[0] along the
path.

Under more modern Unix variants there is the /proc that
allows one to open the running executable more easily, for
example /proc/self/exe under Linux or
/proc/self/object/a.out under Solaris. A few platforms like
Irix or Tru64 make that more difficult as one has to open
/proc/<pid> first and then use ioctl(..., PIOCOPENM,
0) to
retrieve an open file descriptor for the zero mapping (the
main executable).

Still some Unix variants like AIX 4 or MacOS X do not
provide any of this so we still have to search along the
path, a bit fragile and ugly.

Appearently even more ugly it gets if you want to open a
shared library. The current solution compiles in the name of
the shared library (ugh) and searches according the OS
search rules for shared libraries. As far as I thought about
this one could either call dladdr to find the name of the
shared library a function is in or use the /proc file system
mapping enumeration to do it. I will see which version works
best.

Today was the last day before leaving for the CIFS
conference in Bellevue, WA followed by a week of vacation in
Yellowstone National Park. As it is with these last days,
the MacOS X beta was supposed to be ready today as well.
Alas, as it turned out there was just that show stopper bug
that turned up late in the afternoon after we did already
put a version on our web server (not visible if you do not
know the path). The TNT folks delivered a set of new CD's
with the latest MacOS X 10.1 build, and to our horror a few
programs did just core dump upon starting up.

Examining the core dump showed svc_getreqset as the leaf
function on the stack, this immediatly rang a bell with me.
I had that problem before, but with the change from AIX 4.2
to 4.3. I looked into sys/types.h of the new MacOS X version
and indeed, they increased FD_SETSIZE from 256 to 1024. This
is no bad idea as 256 is rather small, but the design of the
SUN RPC library is really bad in this regard, as it passes
the address of an fd_set but not how large it is. From the
application side it is also difficult to prevent this, as
there is no way to find out which value of FD_SETSIZE was
used to compile the C library. As it stands, we did simply
define FD_SETSIZE to be 1024 even as we are still compiling
on the older system, this way the structure is large
enough.

As we had to clean out all object code the build is still
running and we will have to put it up on the web server on
monday. One day we will have to get rid of the elaborate
makefile system and use some more sane perl scripts to do
the build, this way it would be easier to distribute the
build across multiple systems.

OK, I have got the packaging to work under MacOS X. The
problem was that I set the destination to /usr/local/helios
and put all relative path names into the pax.gz and .bom
files. This was in an attempt to leave open the option to
make fully relocatable packages, but this does not work out
easily as you can not easily find out the installation
directory of a base package if you have multiple add-on
packages. I have now put root-relative path names (including
the usr/local/helios prefix) into the .bom and .pax.gz files
and set the destination to /. Now the packages install fine
even if I re-install.

The MacOS X package format is driving me crazy. Appearently
if you do install the packages I did for the second time the
files do end up in the wrong bin directory. I did make
packages non-relocatable with the destination
/usr/local/helios, with subdirectories like bin, sbin, etc
and so on. The strange thing is that on the second install
the contents of /usr/local/helios/bin winds up in /bin, but
the same thing does not happen with sbin nor etc. I am at a
loss to explain that one.

I have got a note from nriley on how to do
UFS disk images, thanks! BTW, I did fill out my email
address in the advogato account form, but this field is not
listed anywhere on the personal page.

Today I did chase down a really weird bug. As I am working
on server system software with lots of services I do have
lots of processes listening for incoming sessions, like one
for AFP file requests, SMB file requests, network print
jobs, mail and so on. One of the servers is a mail server,
it does listen to POP, APOP and a custom protocol for our
own mail client protocol via either ADSP or TCP. The custom
protocol also has provisions for sending mail via the same
authenticated session used to retrieve mail, and there the
bug did happen. Just upon sending an email message
all listening servers would die, with the
exception of the mail connection itself. So what does
sending an email message have to do with terminating file
service
sessions and all that?

The solution is process groups. Previously our software used
individually from shell scripts started daemon programs,
each one daemonifying and backgrounding itself. The
daemonifying includes calling setsid(), which also arranges
for each of the listening servers to be in its own process
group. But this has changed recently, in particular to solve
the problem of inter-server dependencies with optional
add-on servers, which was easier to solve using a custom
starter program that topologically sorts the dependencies.
This program also does daemonify and expects that it's child
do not daemonify so it is able to monitor them with via
SIGCHLD and to be able to log failures.

This new scheme (which is similar in design to the AIX
system resource controller or the Windows NT service
controller) thus caused all our servers to be in one process
group. The mail component used a very strange interprocess
communication method for new mail: it does listen for the
comsat (biff) service socket in the master listening process
and it does kill(0, SIGUSR1) to notify its children if new
mail is available. The children in turn stat their mailboxes
to find which one got a new message. This way each user sees
a newly arrived message immidiatly without the common
polling for changess. Unfortunately the default signal
disposition for the SIGCHLD signal is to terminate a
process, thus all the servers in the same process group not
prepared to handle the signal did exit. The solution was
simply for the service starter to call setsid() just after
fork before execing the programs so each of the listening
servers is again it its own process group.

After some fiddleing with the MacOS X package format I got
working packages and an umbrella .mpkg to install all of the
in one fellow swoop. I also got familar with hdutil to put
it all into one disk image file. Currently I only got HFS
format images working, I was not able to get an UFS image
working like for example the ones from Apple. This is
probably not important, but I really would like to know what
I am doing wrong as the Finder pops up if I insert one of
those images I did a newfs on and offers me to reformat that
one as the format cannot be recognized.

While experimenting with the packages I did run the GUI
PackageMaker quite often to compare what I did in my shell
script with what PackageMaker generates. Once I did forget
to set the default install dir, which means that it was by
default /. The package I was doing did also have bin, sbin,
etc and var directories like one has in /, so after
installing that one for testing I blew away my MacOS X
installation. Oops.