Project Lullaby

A few years before my appointment to the tech lead job, several of us in the org had been complaining about the tools used to build the ON source tree.

During his time as ON gatekeeper, James McPherson had become frustrated about the Makefiles used in the build, which were gradually growing out of control. New Makefiles would get created by copy/pasting older ones, and as a result, errors in poorly written Makefiles would propagate across the source tree, which was obviously a problem.

The ON build at the time also used to deposit built objects within the source tree, rather than to a separate build directory.

While that was convenient for developers, it meant some weird Makefile practices were being used to allow concurrent builds on NFS-mounted workspaces (to avoid x86 and sparc from writing to the same filesystem location) and any generated sources (code generators such as lex/yacc) could accidentally be committed to the repository. So, another goal of the project was to change the build so that it wouldn’t attempt to modify the source tree.

Along with that, we had some pretty creaky shell scripts which drove the build, (primarily nightly.sh) producing a single giant log file.

Worse still, the build was completely monolithic – if one phase of the build failed, you had to restart the entire thing. Also, we were running out of alphabet for the command line flags – seeing this getopts argument made my heart sink:

+ABCcDdFfGIilMmNnOPpRrS:TtUuWwxz

Something had to be done.

So, “Project Lullaby” was formed – to put “nightly.sh” to sleep!

James and Mark Nelson started work on the Makefiles, and I set to work writing build(1)

My targets were nightly.sh, and its configuration file, often called developer.sh, along with a script to produce an interactive build environment, bldenv.sh.

We chose Python as an implementation language, deciding that while we probably could have written another set of shell scripts, a higher-level language was likely required, and that turned out to be a great decision.

This work was begun as essentially a port of nightly.sh, but as the tool progressed, we found lots of ways to make the developer experience better.

First of all, the shell scripts which defined the build environment had to go. “developer.sh” essentially just set a series of UNIX environment variables, but it didn’t do anything to attempt to clean up the existing evironment – this meant that two builds of the same source tree by different engineers could produce different results, and we ran into some nasty bugs that way.

Not being able to easily audit the contents of the developer.sh script was also bad: since the configuration file was essentially executable code, it wasn’t possible to determine what the exact build environment would be without executing it, and that meant that it was difficult to determine exactly what sort of build would be produced by a given configuration file.

I replaced developer.sh with a Python ConfigParser file, bldrc, and made build(1) responsible for generating the config file. This meant that the version of build(1) in the workspace could always produce its own config file, so we’d never have mismatched tools, where we’d build the workspace with the wrong version of the tools.

Since all bldrc files would generated the same way, it was easy to compare two files to see how the builds they would produce would differ, and easy to determine whether a build was ready to integrate (that is, all build checks had been run, and the build was clean)

Early on in the invocation of build(1) we would also verify that the build machine itself was valid: that the correct compilers were being used, that we have sufficient swap configured, etc. Of course we also have packages to pull in build-time dependencies that ought to be installed on all build machines, but a belt-and-braces approach resulted in fewer surprises – nothing’s worse than getting a few hours into a build only to discover that we’re using the wrong compiler!

Furthermore, we made sure that we’d complain about config files with values we didn’t recognise, and also removed almost all comments from the generated file, instead implementing a build explain command to document what each variable did.

Finally, we included a build regenerate command, to allow a new version of build generate a new config file from any older one, allowing us to upgrade from older versions of the tool, without necessarily needing to version the config file format itself.

For the interactive build environment, we wrote build shell (aliased to build -i), which produced exactly the same UNIX shell environment used by the rest of the build tool (before, nightly.sh and bldenv.sh could end up using different environments!) We made sure to properly sanitize the calling environment, passing through certain harmless, but important variables such as $DISPLAY, $HOME, $PWD, $SSH_AGENT_* etc.

Having taken care of the build environment and config files, most of the rest of build(1) defined a series of build ‘tasks’ – some of which are composite tasks, so “build nightly” does “build setup”, “build install”, “build check”, etc. (this was just using the Composite design pattern)

Each build task writes its own log file, and we used Python’s logging framework to produce both plaintext and syntax-highlighted HTML log files, each with useful timestamps, and the latter with href anchors so you could easily point at specific build failures.

To avoid overloading developers, we made sure that, with few execptions, all build tasks took the same command line arguments, to reduce the amount of cognative load on developers trying to learn how to build the source tree. Instead of adding arguments for slightly different flavours of a given command, we preferred to write a new build task (of course, using class-based inheritance under the hood)

Finally, we had a few “party tricks” that we were able to add in – build tasks which didn’t produce build artifacts, but instead provided useful features that improve ON developers’ lives – for example ‘build serve’ starts a simple ephemeral HTTP server pointing to the build logs in a workspace allowing you to share logs with other engineers who might be able to fix a problem you’re seeing.

Similarly, we have a ‘build pkgserve’ task, which starts up an IPS repository server allowing you to easily install test machines over HTTP with the artifacts from your build.

“build pid” returned the process ID of the build command itself, and since all dmake(1S) invocations were run within a Solaris project(5) we were able to install a signal handler such that that stopping an entire build was as easy as:

$ kill -TERM `build pid`

Finally, we added ZFS integration, such that before each build task was executed, we’d snapshot the workspace, allowing us to quickly rollback to the previous state of the workspace and fix any problems. This turned out not to be terribly useful by the time we’d shaken the bugs out of build(1) itself, but was incredibly helpful during its development.

One more artifact that was important, was the mail notification developers get when the build completes, and we spent time improving that format so that it was easier to determine what part of the build failed, and excerpted relevant messages from the build logs so users could tell at a glance where the issue was.

At the time of writing, here are all of the build(1) tasks we implemented:

timf@whero[123] build help -v
Usage: build [subcommand] [options]
build -i [options] [commands]
build --help [-v]
Subcommands:
all_tests (a synonym for 'check.all_tests')
archive Archive build products
check Run a series of checks on the source and proto trees
check.all_tests Run all tests known to the workspace
check.cores Look for core files dumped by build processes
check.cstyle Do cstyle and hdrchck across the source tree
check.ctf Check CTF data in the non-debug proto area
check.elf Run a series of checks on built ELF objects
check.elfsigncmp Determines whether elfsigncmp is used to sign binaries
check.findunref Find unused files in the source tree
check.fish Do checks across the fish subrepo
check.install-noise Looks for noise in the install.log file
check.lint Do a 'dmake lint' on the source tree
check.lint-noise Looks for noise in the check.lint.log file
check.parfait Run parfait analysis on a built workspace
check.paths Run checkpaths(1) on a built workspace
check.pmodes Run a pmodes check on a built workspace
check.protocmp Run protolist and protocmp on a built workspace
check.rti_ready Check that this build config can be submitted for RTI
check.splice Compare splice build repositories to baseline
check.tests Run tests for sources changed since the last build
check.uncommitted Look for untracked files in the workspace
check.wsdiff Run wsdiff(1) to compare this and the previous proto area
clobber Do a workspace clobber
closed_tarball Generates tarballs containing closed binaries
cscope (a synonym for 'xref')
explain Print documentation about any configuration variable
fish Build only the fish subrepo
fish.ai_iso Build 'nas' Fish AI iso images only
fish.conf Verify mkak options
fish.destroy_dc Remove datasets tagged 'onbld:dataset' at/under 'dc_dataset'
fish.gk_images Build Fish images appropriate for gk builds
fish.images Build 'nas' Fish upgrade images only
fish.install Build Fish sources, writing to the fish proto area
fish.jds_ai_iso Build 'jds' Fish AI iso images only
fish.jds_all_images Build all 'jds' Fish images
fish.jds_gk_images Build 'jds' Fish images appropriate for gk builds
fish.jds_images Build 'jds' Fish upgrade images only
fish.jds_txt_iso Build 'jds' Fish text iso images only
fish.nas_all_images Build all 'nas' Fish images
fish.packages Build Fish IPS package archives
fish.re_build Runs AK image construction tasks for Release Engineering
fish.save_artifacts Save all build artifacts from the 'dc_dataset' directory
fish.txt_iso Build 'nas' Fish text iso images only
generate Produce a default bldrc configuration file
generate_tpl Generate THIRDPARTYLICENSE files
help Print help text about one or all subcommands
here Runs a 'dmake ...' in the current directory
hgpull Do a simple hg pull for all repositories in this workspace
install Build OS sources, writing to the proto area
kerneltar Create a tarball of the kernel from a proto area.
nightly Do a build, running several other subcommands
packages Publish packages to local pkg(7) repositories
parfait_remind_db Generate a database needed by the parfait_remind pbchk
pid Print the PID of the build task executing for this workspace
pkgdiff Compare reference and resurfaced package repositories
pkgmerge Merge packages from one or more repositories
pkgserve Serve packages built from this workspace over HTTP
pkgsurf Resurface package repositories
pull Do a hg pull and report new changesets/heads
qnightly Do a nightly build only if new hg changesets are available
regenerate Regenerate a bldrc using args stored in a given bldrc
save_packages Move packages to $PKGSURFARCHIVE as a pkgsurf reference
serve Serve the contents of the log directory over HTTP
setup Do a 'dmake setup', required for 'install' and 'here'
test Runs tests matching test.d/*.cfg file or section names
tools Do a 'dmake bldtools', for 'setup', 'install', and 'here'
tstamp Update a build timestamp file
update_diag_db Download a new copy of the stackdb diagnostic database
update_diverge_db Generate AK/Solaris divergence database
update_parent Update a parent ws with data/proto from this workspace
xref Build xref databases for the workspace

I hope to talk about a few of these in more detail in future posts, but feel free to ask if you’ve any questions.

In the end, I’m quite proud of the work we did on Lullaby – the build is significantly easier to use, the results easier to understand and since the Lullaby project integrated in 2014, we’ve found it very simple to maintain and extend.

However after we integrated, I have a feeling the folks looking for a new ON tech lead decided to give me a vested-interest in continuing to work on it, and so, “Party Like a Tech Lead” began!