I’d like to propose a fundamental law of configuration management: the cost of an integration increases over time. This is similar to the well-known software engineering observation that the cost of fixing a bug increases over time.

Let’s start with a simple example: a single project with just 2 engineers, where each engineer commits a single change once per day. Now suppose that both engineers, for some reason, decide to start committing their code in batches of 5 changes once per week instead. I’m not sure why they would do this; I see large benefits to keeping commits small.

Here are the consequences I would forsee:

A reduction in per-commit overhead by batching up 5 changes into a single larger commit.

Increased communication overhead: a revision control system is a formalized way for engineers to communicate without having to send emails, etc. In particular, the change descriptions, if well-written, help keep the other team members informed about what is going on. Frequent commits also make conversations like “watch out, I’m working on some big changes to file X” less necessary.

Increased redundant work: both engineers might fix the same bug in their own local trees rather than picking up the change from the other engineer.

A larger number of merge conflicts. At the risk of misapplying statistics and making a vast number of simplifying assumptions: if each change touches 5% of the total lines of code, and if changes are randomly distributed in the code, the probability of a merge conflict was about 1.2% weekly before and is about 5.1% weekly now.

Incompatible changes: both engineers might choose to rewrite the same block of code in two different and inconsistent ways. This will show up as a merge conflict, but it’s worse than a plain old merge conflict. You’re not just doing a textual merge, you’re trying to reconcile to conflicting visions of how the code should work “after the fact” and throwing away a good chunk of the work. Had the first rewrite been committed more promptly, an additional rewrite might have been avoided.

New bugs are discovered and fixed later: if the first engineer’s changes introduce a bug that impacts the second engineer’s work, the bug might be discovered a week later rather than a day later. Standard software engineering literature suggests that bugs cost more to fix over time.

Increased probability of losing work. Once a change is committed, it’s saved in the repository and won’t be lost in an accidental “rm -rf” or a hardware failure (assuming that the repository itself is being backed up appropriately).

Unless you’re extremely worried about per-commit overhead (in which case I would suggest that you have bigger process problems you need to address), this is definitely not a good thing.

Merge conflicts in particular are more dangerous than a lot of people realize. In software, it is not necessarily true that the correct way to combine two changes is to perform a textual merge of the source code. It is dangerous to assume that simply because a textual merge did not detect any obvious conflicts, you are all set!

To perform a correct merge, you need to understand what has been changed and why. Many engineers have a bad habit of being careless on merges: they let down their guard. Merges are just as real as any other change, and we cannot assume that just because two changes worked independently that they will also work together.

Of course, if the textual merge does detect a conflict, the risks are far greater. An automated merge won’t get tired or make a typo. A human can and sometimes will. If the conflicts are nontrivial, as in the case of two engineers rewriting the same code, merges can be some of the most dangerous changes of all.

So far I’m not really saying anything new here. It’s pretty standard advice that engineers should commit code no less than once a day, even if only to reduce the risk of losing code by accident. Also, there is a lot of literature on the benefits of “continuous integration” as opposed to “Big Bang integration”, or on releasing your software “early and often.”

At the same time, a lot of supposed proponents of continuous integration seem to talk the talk better than they walk the walk. You will find a lot of these same people advocating such things as:

development branches, where different groups of engineers working on different features commit to different branches/codelines, rather than sharing a single “main” or “trunk” development branch

“distributed version control systems”, which are development branches taken to another level (all changes are effectively developed in a development branch, and no “main” branch even exists except by convention)

branching and releasing each component of your project separately, rather than putting all components under a single “main” or “trunk” and branching them simultaneously

I contend that, by delaying integrations, these practices are steps back in the direction of “Big Bang integration” and that they increase the total cost of integrations.

Consider development branches, where several engineers go off and work on different features in different branches rather than working concurrently in a single main branch. Nearly all the same risks I listed above for committing once a week rather than once a day apply here also: communication overhead, redundant work, merge conflicts, incompatible changes, bugs discovered and fixed later. (On the bright side, losing work by accident should not be an issue here.)

The more development branches you have, the more integrations you will need to do. Someone will need to merge the changes from the main branch into the development branch on a regular basis, and when the development branch is done or at least has reached a “good” point, it needs to be merged back into the main branch. Either way, this typically leads to “mass integrates” in both directions.

As I’ve written before, mass integrates are a “worst practice.” Mass integrates can frequently run into dangerous merge conflicts. Because you are merging two large bodies of changes, the probability of a textual or logical conflict between the two sets of changes can be high. The longer the development branch lives on without being integrated back into the main branch, the greater this risk grows. (If you must, for whatever reason, have a development branch, I recommend integrating in both directions as frequently as possible.)

A development branch can be thought of as an intentional delay in integrating code. This can be tempting: “I get my own sandbox where I can do whatever I want!” But this kind of freedom is dangerous at best. For example, it encourages engineers to break things in the branch expecting that they will be “cleaned up later.” If the feature’s schedule starts to slip, this deferred “cleanup” work may be skipped. All of the sudden the development branch “needs” to be merged back into the main branch “right away” so that it can be in place for the next release branch. (I’ve seen this happen a number of times.)

When you add in the costs of delayed integrations, I recommend against development branches. You are better off doing the development in the main branch. This may require a bit more care on each change–you can’t break stuff like you can off in your own little “sandbox”– but the discipline this requires will pay off later. You won’t waste time integrating changes back and forth between branches, and you will spend a lot less time fiddling around with (textual or logical) merge conflicts.

If the new code isn’t ready to activate right away, you can simply hide it behind an #ifdef or somesuch. Even if the #ifdef approach starts to prove to be difficult, it’s likely to still be easier than dealing with merge conflicts: when someone makes an unrelated change that interacts with your changes, there’s a good chance that they will help you out by updating the inactive #ifdef code. And if someone makes a change that truly conflicts with your changes, you’ll know right away.

3 Responses to 'The Cost of Integration'

Subscribe to comments with RSS
or TrackBack to 'The Cost of Integration'.

Mark,

I tend to agree, but I think as teams get large (an agile anti-practice I’d argue) and you have 100 people committing into the same large code base, the “everyone works on trunk” ideal starts to break down. Especially as the rate of commits times the length of the build, exceeds the capacity of the system to quickly build each commit.

In these circumstances, my favorite answer is to split the team of 100 up into component teams and build parts of the system separately. When that can’t be done, I think feature or component streams may be the least evil remaining option. I say streams rather than branches, because stream based SCMs actually allow for automatic merging from the main integration “branch” back into the ones developers are working on.

The big bang integration risk is still present, but builds off the feature streams may be independently verifiable and integrated into the integration stream at least daily. It seems like a reasonable compromise.

That said, it’s still far from ideal and you’re dead right that the cost of integration increases over time.

Hi Eric, I agree that as teams get larger we are stuck making a choice between several less-than-satisfactory alternatives.

However, I’ve worked on projects with well over 100 people regularly committing to a single branch without stability problems. I think it’s a matter of having appropriate tools to (1) detect regressions promptly and (2) prevent regressions from happening in the first place.

My guess is that teams that only focus on “detection after the fact” rather than prevention are the ones whose processes start breaking down around ~100 engineers.

Computers (especially headless desktop PCs) are so cheap these days; compare the cost of buying an appropriately sized build farm to the cost of hiring an engineer. It’s a bargain if it saves you any time at all.