Sunday, May 21, 2017

The Problem

On my current team, we use a common tool stack of Jenkins, Maven, Robot Framework, and Java. Our build system is centralized with a couple of dozen projects. We have suffered build and developer friction, namely

Our version of Jenkins is older than the Internet and cannot be easily upgraded

Upgrading a plugin/tool risks incompatibility with other plugins/tools

We have accumulated snippets of shell scripts embedded int POM files

Developers sometimes use our continuous integration (CI) to do compiles and test code

We sometimes have trouble with local builds passing and CI builds failing

These and more are the Usual Suspects on almost any build system that has grown over time. Having been on this team for around two years now, I have identified a direction we should take to modernize our build system.

A Solution

Docker and containerized builds have been around since at least 2013. There are some compelling reasons to consider using containerized builds. This is my attempt to clearly enumerate the advantages and to convince my chain-of-command to make the jump. Hopefully this will help any readers to do the same.

Moving to containers means we
can deploy onto a cloud with minimal work. This can address scaling
issues effectively. Note that some builds will still depend on lab access to local devices,
and these dependencies will not scale.

Containerizing the build
pipeline means easier upgrading. For example, running a component in its own
container isolates it so other containers that depend on it are forced through a
standard, explicit interface.

Containerizing the build
means better partitioning. So instead of making environments that contain all
possible tools and libraries, a container can only use those needed for its
specific purpose. This has the side effect of reducing defects due to
interacting 3rd party components.

Containerizing the build
means a clean reset state. Instead of writing reset scripts, the container is
discarded and resurrected from a fixed image. This is a phoenix (burn-it-down) immutable view of containers, and forces all build and configuration to be explicit (not accumulate in snow flake instances).

Containerizing means 100%
consistency between local development and the build system, which should eliminate
the but-it-works-on-my-machine problem.

Containerizing the build
means effective postmortems for build failures, potentially leaving failed
runs in the exact state of failure, rather than relying solely on extracted log files.

Containerizing the build means
building and installing an intermediate, shared artifact onces, instead of 52 times, and
potentially speeds up the build cycle.

Containerizing the build
means that some tests can make temporary checkpoints via a temporary
image/container and roll back to that state rather than tearing down and rebuilding,
affording a quicker build.

Judicious use of containers
might help with diagnosing hard-to-reproduce issues in the field. I have seen instance of technical support sending/receiving VM images to/from customers. Containers would be both simpler and could be a lot smaller.

Containerizing the build is
considered a modern best-practice and affords access to all kinds of build
work flow and management tools.

That's it! Good luck convincing your management when the time comes to modernize your build systems.

Monday, April 3, 2017

Recently I was given the opportunity to work on an open source project. My team had developed a new additional feature and we were eager to get is pushed upstream before a code freeze milestone. Our team was closed source and project was proprietary, so this effort was essentially to push a portion upstream as open source.

We decided to do three pushes due to various time and resource constraints: push the basic feature, then an enhancement, then some automated integration tests. Our team was not familiar with the process or requirements of the upstream project but two thirds of the code was written under contract to us by a former major upstream committer to that project.

The push of the basic feature was rougher than I thought, given the authorship of much of the code. Most of the issues were stylistic and our reliance on the contractor to "do the right thing" had been a bit shakey. After some corrective actions, the first merge happened. Based on code review issues raised on the first push, I revised the second batch of code, and submitted a pull request. As expected, there were a few minor issues raised which I corrected.

So this project has checkstyle enabled as warnings. But the build is so full of noise and warnings (that is deserving of another rant on its own) that developers tend to ignore warnings, especially those involving things like whitespace, which happens a lot with checkstyle.

So I submitted the pull request and sort of expected an approval and merge. This was highly desirable because the code freeze deadline was only a couple of days away. Instead what I got was a request to move the checkstyle setting from warning level (the project default) to error level.

This was a stunning request for several reasons.

* First, making the code freeze deadline was now unlikely.

* Second, this would be a new requirement on our code that was not met by any of the hundreds of thousands of lines of code that already existed. Plus our previous commit was accepted without that obligation.

* Third, the time to enable something like checkstyle is day zero of a project. Turn it on then and let issues be fixed as part of every single commit. If you have never turned on an analyzer like checkstyle on to an unsuspecting code base, it is dismal. Our commit was probably a few thousand lines of code, but checkstyle emitted about 250 "errors". This would be such gems as "there is a space between your method name and the parameter list". Manually fixing all of these would be tedious and expensive - probably several days of work.

There were a half-dozen "functional" issues that I fixed, like "unused import" and "catching naked exceptions" that were entirely reasonable to ask for a fix. Those functional fixes were done and the new pull request has been awaiting approval for three days now. I guess in open source, not everyone has a sense of urgency. Complicating this, your reviewer might actually work for a competing company, so politics might interlope.

Some observations...

I wanted to point out an issue of open versus closed source. In the past, on smaller closed source teams, we never enforced cosmetic coding standards. Everyone's slightly different coding styles acted as fingerprints so you could tell without looking who the developer was. But in an open source context, where there could be large numbers of developers, it makes a lot of sense to have strict coding standards and enforecment of those standards should be enabled by default.

Readability is subjective. The wars of bracket placement, and other whitespace issues, happen all the time and clearly have no real impact to the business. To me, a perfect solution would be that whitespace is always managed by the tools locally to each developer. For example, a developer's whitespace style is applied to the code base upon "get", does not show up as a "diff", and is removed as part of a "push".

The imperfect solution is the best we have today. You can get plug-ins into IntelliJ or Eclipse that will show you violations in "real time" so that you can take instant corrective actions. Unfortunately there is no comprehensive code reformatting tool to manage style violations. Many of them are amenable to automated fixes, but not all of them.

We have some ways to go before we can have our cake and eat it too. Maybe it will happen when we have spent enough expensive developer time chasing down that stray space in our code-under-review.

Monday, March 27, 2017

Recently I had to commit work upstream to the public Open Daylight project. As part of that effort, I had to learn some new tools, particularly Gerrit. This is a code review/management tool that works with Git, which I was already familiar with.

Here are some ad hoc things I learned as part of committing work upstream...

Gerrit (plugin) has some non-obvious controls, namely that you can kick off another build by putting "recheck" as your comment. Others are "rerun", "verify" as in here.

Upstream coders usually add the prefix "WIP: " to their bug report to let other developers know things are not ready for prime time reviews yet.

Mechanically you can review the issues by using "Diff against:" drop list to pick a revision to start at then go into the first file and then use the green upper right arrows to navigate between files. (It'd be nice if Gerrit had a "next feedback item" navigation).

The master branch on ODL can and does break, causing lots of down time with broken builds (days even).

Full builds take 10 hours and verifying-before-pushing is an evolving process.

If you are using "git review" and you forgot the "--amend" option on your commit, you will be prompted to squash and will end up with a new branch. You can recover using "git rebase -i HEAD~2" then using the force option and then abandoning the new branch.

Along with master breaking, master can also produce stale artifacts, so don't assume that clearing your m2 cache, creating a subproject repo and building will give you up-to-date code.

The patch workflow is not very good at all for having multiple people working on an item; traditional git branches and management are superior in this respect so expect some pain.

If your patch won't build because master is broken, you can rebase it to pick up a fix temporarily. When you do a "git review" to push it, you will be prompted because there are two commits (your new patch revision and the fix). You do not want to merge that fix into your patch.

Hopefully, some of these tips will find their way to helping someone else along the way.