Fortified Bikesheds

Wednesday, May 25, 2016

In the past I've been quite skeptical of the "best practice" of tagging builds. Build tags just clutter up your repository and can actually cause real performance issues on large git repositories and large number of builds.

I am evolving on this issue though, and have come to appreciate build tags for an unexpected reason: they are a wonderful place to stash build metadata.

Any complex build will have to resolve dependency graphs. Modern build systems seem to encourage the use of "dynamic" dependencies, which have the advantage of requiring less effort to specify, and automatically "upgrade" as new releases become available.

The disadvantage of dynamic dependencies is that they lead to non-reproducible builds, and also often lead to unexpected breakage when some third party produces a buggy release or breaks the implicit contract in the semantic version scheme.

Therefore, it is generally good practice to archive the results of a dependency resolution pass someplace, so that you can always revert back to a known good version, and also reliably reproduce a specific build if needed.

Most build systems have some form of support for this. Node, for example, has shrinkwrap, and gradle has a dependency lock plugin. All these mechanisms rely on dumping out the actual resolution to a file, and have the ability to use that file to impose a pre-determined resolution onto a new build.

In practice, though, this is hard to do. If you check the file into git, then your build may have to resolve merge conflicts on that file - which can be difficult to automate. You also lose the implicit linkage between the git commit that was built and the contents of the file. In other words, if you change the dynamic dependency specification, then the contents of the persisted file no longer has any relationship to that specification, so it should really be deleted. Nobody will ever remember to do that...

I think that attaching the resolution file as an annotation to a git tag is the right solution. It has all the correct properties:

It is unique to the git commit used for the build

There is no merge conflict, it can easily be done even if newer git commits exist

The presence or absence of a build tag on some git commit can be automatically detected and used to control whether the predetemined resolution should be used (i.e. when we are rebuilding an old build), and when a new resolution needs to be done.

If the result of the resolution becomes invalid, you can simply delete the build tag.

So how does one deal with the clutter of many build tags?

One solution is to delete old build tags - but that is kind of sad, as you lose source data for a variety of historical stats. If you care about the stats, you suddenly need to find a different place to store the data.

Fortunately, git is a very versatile system, and there are alternatives. For example, you can stash them in a different ref on your central server. So when you apply a tag, you do this:

% git tag -m 'Annotation' MYTAG

% git push origin refs/tags/MYTAG:refs/builds/tags/MYTAG

Normally, you'd push them into refs/tags/, but by pushing them into refs/builds/tags/, they will not be "seen" by the default refspec, so normal developer git pull commands stay fast and efficient.

If they want to see the build tags, they can pull them in trivially using this:

% git fetch origin 'refs/builds/tags/*:refs/tags/*'

And if they want to remove them again from their local repo, they run:

Friday, June 5, 2015

I'm going to define "positive" build avoidance to be avoiding a rebuild when it's already built. "Negative" build avoidance is avoiding to rebuild something that has failed to build before.

Positive build avoidance has been around for quite some time, and is usually easy to implement: simply check if the target artifact exists and has been created from your current source set. This can be as naive as make's timestamp check (if it's newer, then it has to be from the current source) to more sophisticated checksum or hash signature checks.

I don't know of any system that does negative build avoidance, so I'm building one.

I happen to already have a system which does positive build avoidance by storing build artifacts using a version computed from the git tree hashes of all its source components. I recently added an artifact that gets published on every build, no matter whether it failed or succeeded: the build log. Besides impressing auditors, this actually allows me to implement negative build avoidance:

If all artifacts are present, positive build avoidance as usual, no need to rebuild;

If artifacts are missing, but the build log artifact is present, negative build avoidance occurs, and I do not even bother to attempt to redo a build that is known to fail;

If artifacts are missing and the build log is missing, then the build either didn't occur or crashed in the middle. In that case, schedule the build.

The nice thing about implementing negative build avoidance is that it allows me to deal with unstable build farms (and any Jenkins build farm of substantial side is bound to be unstable). Simply keep recomputing the build schedule after every pass until every build either succeeded or failed with a build log. Makes the build team look good, since all build failures are now clearly the developer's fault.

Sunday, January 4, 2015

In part one, I described how we can and should be computing the version strings for build artifact from the hashes of the source files used to build them. Now we need to devise a way to implement this strategy.

The maven build system, in spite of its many shortcomings, does have the right idea: the POM (Project Object Model).

It's intended to be declarative. You describe the artifacts and their dependencies.

Sadly, the implementation really does have a lot of shortcomings:

It's in XML, making it very tedious to read and manipulate.

It is too java centric, and generally too much concerned with java specific implementation details

In spite of initially being declarative, it has too many procedural details, mainly around managing versions, which is the one thing we wish to avoid here.

In the spirit of taking the best parts and leaving behind the bad parts, I decided to implement a POM-like document: the Bill of Materials.

I very explicitly separate out any build procedural details by simply referencing the build scripts explicitly as an artifact property. In other words: "I don't care how you produce the artifact, just tell me where it is when the build script is done".

Doing this opens up more refactoring opportunities: note that the${Attribute} are evaluated after the itemized list is constructed, so it is totally ok to reference a${Attribute}even if it is not defined at that same level:

I have found it convenient to have two ways to declare dependencies between artifacts.

Declare upstream dependencies in the classic maven way, by saying Requires: <artifact>. This method is useful for the classic shared code dependencies.

Declare downstream dependencies using a DeployTo: entry. This method is useful for build flow dependencies, for example to aggregate and validate build results and test results, or to bundle a bunch of individual pieces into an installer.

Example of the first type: refactor code to create a separate artifact for the shared code portion:

Sunday, November 2, 2014

Artifacts belong into an artifact repo, which can be anything from a shared file system to more sophisticated artifact servers, for example Sonatype Nexus or Artifactory.

What Are We Solving?

A big challenge for getting fast build turnaround is how to avoid rebuilding artifacts that have already been built before.

Most modern build systems have a way to define dependencies to prebuilt artifacts. Maven, for example, defined artifacts via a triplet of group-id, artifact-id and version. Simply listing all these GAV triplets (sometimes expanded via classifiers and extensions) in your maven POM (Project Object Model) file will cause the build to retrieve and use the desired prebuilt artifacts.

The problem with this approach is that usually, in a larger system, developers will work on multiple pieces at the same time. Usually developers need to make a choice whether they want the pieces to all live together in the same source tree within "sub-projects", or whether they wish to "publish" the artifacts in a separate build and then retrieve it in a different build.

Arguably, you want both. If you don't care about your dependencies, you just want to fly with the prebuilt ones and not worry. If you do care, you want to check them out and integrate your changes seamlessly into the build.

Maven's "solution" to this problem is the "SNAPSHOT" dependency. You publish every artifact using a version string ending in "SNAPSHOT". This is a marker where all artifacts published by the same build get stored as the same "version" in your artifact repository. This version is usually "opaque", that is you don't know and shouldn't care which snapshot you are getting as long as it's a consistent set.

Sadly, the SNAPSHOT solution has several problems:

The definition of a "consistent set" is more like a happenstance. If it's the same build invocation, you get them - if your build requires multiple invocations (for example because you need to build on multiple architectures), it stops working right. In other words, there is no way to collect artifacts built from a single set of sources - you may get whatever happens to get uploaded last.

Since artifacts are cached, how the cache is refreshed matters. You cannot rely on using the latest artifacts to resolve your dependencies except if you clear the cache. And even if you get the latest, see above.

When you finalize your release, you need to go through all the version strings and remove the SNAPSHOT keyword, rebuild, publish the release, bump the version and then add the SNAPSHOT keywords back. The infamous maven release plugin will help with that, but in practice this is a lot of churn, and the whole point of using prebuilt artifacts is lost.

Snapshots piggy-back on a semantic version scheme which implies an ordering (v2.3.1 is newer than v2.3.0), so you have no way to support code branches where you don't know or can't commit to any ordering. For example, if you use gerrit as your review system, you have no way of knowing the order in which changes will get merged, but you still need to get efficient validation builds done.

Now, in practice, it often works out fine, but wouldn't it by nice if we had a way to really know what we're getting and why? And wouldn't it be nice if we could just stop fiddling with the version strings?

Why Bother with Prebuilt Artifacts?

Besides saving build time, it soothes the nerves of your QA folks. If they know an artifact hasn't changed (and they know this because it is bitwise identical to their previously tested one), and some problem occurred, then they can start by looking at the artfacts that changed instead of having to examine all of them. Obviously, the bug may have been latent in one of the old artifacts and just triggered by a new usage in the newer ones, but even then looking at the trigger makes troubleshooting a lot simpler.

Finally, rebuilding from the same source never guarantees the same outcome, especially if a lot of time elapsed between now and the previous rebuild.

So, What Can We Do?

Well, we can retrace some of the history of what made git so successful. We have to give up on "readable" version strings. Git replaced the revision number with a SHA1 hash. We can do the same.

A big part of the git magic is achieved by computing SHA1 sums of the file content and accumulating them over the various subdirectories in your source tree. You can easily retrieve those SHA1 sums using the following command:

% git rev-parse "$ref:$path"

Here, $ref is any commit reference - either a commit hash, a branch name, a tag, or just HEAD. $path is simply the relative path from the top of the git repo to your file or directory. Note that the whole tree is represented by the empty string, not ".".

Since git computes and recomputes these hashes on every commit, rev-parse simply retrieves precomputed values and is therefore very fast.

The next step is to understand your build process. Unless you have a very small project, most build artifacts are generated from a subset of your git repository. The trick is to know which subset it is, and to record this information someplace.

In some build systems (most of the "sane" ones at least), this information is (or should be) encoded in the build script itself. You should simply be able to extract the file list from it.

Sadly, very few shops use sane build systems.

The next best thing is to at least figure out a rough partitioning of your source tree, mapping whole directories if needed.

Some folks go all the way and split their git repos into independent projects, thereby trading one complexity (figuring out the subtrees) into a different complexity (figuring out how to manage collections of git repositories consistently), but the principles used here apply.

Regardless of how you partition your source tree, you can go and list the fole and directory locations required to build your artifact. You then get the SHA1s of those locations from git using the command above, sort them and re-hash them into a new SHA1. That is going to be the version of your artifact.

This method is fast, and has some significant advantages:

Assuming you keep the mapping of source tree locations to artifacts in a file under git, you can compute versions the artifacts would have, given any commit, tag, branch, ref, whatever.

The build system can check if the artifact already is in your artifact repository. If yes, don't rebuild it. If not, go ahead, and publish it to the repository.

A release becomes an atomic non-event. Simply apply a git tag, done. Any deploy script simply performs the computation of the version strings, given the tag, and retrieves the appropriate artidacts from the repository.

What About Dependencies?

The short answer is to declare the dependencies in the file which maps source locations to artifacts, and when it comes time to compute the version strings, expand those dependencies to produce the transitive closure of the file locations used to build your artifact.

Labels

Story So Far

The goal of the first batch of posts is to describe a software build and release process which assumes that builds are not necessarily reproducible and expensive to perform, and also assumes a large number of independent development teams all working on some grand piece of software.

Good release management starts at the source, so the first few postings deal about source code control and change management, and how to mine the change data correctly.

Once we can reliably build stuff, we need to manage the build products, the artifacts, so they can be reused, tested and released.