Port CI

Overview

The build application will be the central place for queueing, monitoring, and viewing exp-run results. It will have an HTML and JSON REST interface. It will also display production package build status. This application will be the "master" for exp-runs.

New builds will be queued into the master. They may or may not have patches. Patches can be for the ports tree or src tree. Builds can be configured to do a comparison once completed against a known good build to find new failures.

Client builders will check in to the master and ask for work. They will occasionally report their status as well.

Lost/crashed builds will be retried a few times before being marked failed.

Problems with exp-run/build automation not solved here

Patches against src/ports tree that the nightly reference is not using require running a specific reference for that patch. It could try to find the "closest" reference, but would deliver many false-positives. We don't want people looking at the list of failures and misjudging the results. We want confidence in the results so we do not introduce new failures into head.

Similar to 2, need to automatically run a reference build if needed. Leaving out of initial design as it complicates things too much. In general the nightly reference build has worked well. Asking submitter to rebase patch against a known reference revision seems reasonable for version 1.

Build types

Exp-run

Patch against the src or ports tree.

All ports will be built.

All existing packages will be deleted.

Completion will compare against a Reference build.

svn patch; bulk -ca

Reference

No Patch.

All ports will be built.

All existing packages will be deleted.

Completion will compare against previous Reference build.

bulk -ca

QAT

No patch.

All ports will be built.

All existing packages will be deleted.

Plist testing will be done.

Completion will compare against previous QAT build.

bulk -cat

Port

Patch against the src or ports tree.

Only the 1 port will be built.

Existing packages for that 1 port will be deleted.

Plist testing will be done.

svn patch; testport -n -o

Package

No patch.

All ports will be built.

Existing packages kept.

NO_PACKAGE packages not built, RESTRICTED packages cleaned up after bulk.

Completion will compare against previous build for new failures/skipped (not built as it will be different due to incremental).

bulk -a

Authentication

All connections will require SSL.

The master's REST port SSL cert will be self-signed with our own CA that the builders will use exclusively to validate the cert.

All requests and responses between builder/master will use a shared secret to ensure the commands are coming from known trusted clients.

Builder/Master

Builders will be authenticated with known tokens that can be revoked.

Builders will be configured (on the builder machine) to specify which job types and archs it can handle. This will allow having the package build machines claim package builds jobs, but never any kind that has a patch. This ensures the package builder is still secure in this system as it only takes an order to start a build, never executes any arbitrary data.

Builder registration

Portmgr will login to master and generate a token. This token will only display once and never again. This is the shared secret for authenticating requests and must remain secure.

Portmgr will setup new builder and specify the token and the master's SSL CA public key in its configuration.

Builder starts up and generates an id from uuidgen.

Builder connects to Master to register its id and hostname with the token.

Possibly uses DH1080 here to insert another shared secret that is never exchanged, for each to use in future communications.

Master registers the builder and associates its id and hostname with the token.

If token is already registered, deny registration.

Builder requests

All builder requests to master will include its authentication key. An invalid key will be rejected. This is only so a rogue/unapproved builder cannot steal work.

Master responses

Master responses to builder will include a validation key so that the builder can authenticate the master is valid to prevent it receiving a rogue job. Note that this key is different than the request key so that it cannot just be replayed from the request. It does contain the same shared secret token though. Any responses not containing this valid key will be ignored.

response_key: sha256("response", builder hostname, uuid, token)

Notifications

Status updates for unapproved/start/stop/crashed/cancelled/finished/analysed jobs will go to IRC.

IRC

The IRC bot will be read-only. No queueing or approving of jobs as there is no way to authenticate securely, even with SSL the IRCD is a weak point.

Gnats

Updates to builds that have an associated PR will send an update to gnats when the build crashes, finishes, and once analyzed. There may be a long delay between finished and analyzed which is why there is an extra notice.

Email

Final results for builds will be emailed to appropriate parties according to the build type.

Queue/Build process

Patch is uploaded by committer and configured for a specified build.

Only exp-run and port builds are available to non-portmgr.

Portmgr can queue any build type

Build is marked unapproved until a portmgr approves it.

Builders check in frequently for work.

Once builder takes a build, a new job is created and assigned to the builder.

The build will be marked running. It will provide a Build URL back to the master for it to update the job object.

A patch id will be given in the response, along with checksum.

Builder will download associated patch from master if provided and compare checksum.

Build starts.

The builder will checkin every 5 minutes and on boot.

Missing 4 checkins will consider the job as lost and cause it to have the build's fail_cnt incremented and its status moved to retry to retry.

A crashed or stale build seen on startup will notify master of the failure and have the build's fail_cnt incremented and its status moved to approved to retry.

If a build crashes 3 times it is marked as failed and not retried again. It may be causing panics and should not continue to bring down builders.

A build can be cancelled at any time by a user. When the builder checks in, if the build the job is is running for is cancelled it will receive that notice and cancel its work. When completed it will report back to the master and it will mark the job and build as cancelled.

When a job is completed the builder will notify the master and provide a URL to a tarball of its log files, along with checksum.

If the reported job is successful and has been requeued/reassigned then the new job should be aborted.

If the reported job is not finished, but is already considered lost, then the job will be aborted.

Master will download the logfiles, compare checksum, and then extract locally for later display and analysis.

Exp-run/Port process

When the build is completed, it will be marked as pending-analysis

When the master has an adequate reference build available it will compare against pending-analysis builds, update their results, and then mark them as analysed.

Results will be mailed to any associated PR, portmgr, and the person who queued it.

QAT process

When the build is completed, it will be marked as pending-analysis

When the master has an adequate reference build available it will compare against pending-analysis builds, update their results, and then mark them as analysed.

New failures will be mailed to ports@, portmgr and potentially CC all committers on the hook for the commit range.

Master

Periodic checks

Check queue to find timed out jobs. Increment timeout_cnt for the job. Once the timeout count reaches 4, the job will be aborted, build's fail_cnt incremented and the build requeued by changing its status to retry.

Check Exp-runs/Port jobs in pending-analysis state and try to compare against a Reference build. If a reference is not ready, try later. If a reference is done, update status to analyzed and send job notifications.