Tuesday, 17 January 2017

The Cost of Long-Lived Feature Branches

Many moons ago I was working at large financial organisation on one of their back office systems. The ever increasing growth of the business meant that our system, whilst mostly distributed, was beginning to creak under the strain. I had already spent a month tracking down some out-of-memory problems in the monolithic orchestration service [1] and a corporate programme to reduce hardware meant we needed to move to a 3rd party compute platform to save costs by sharing hardware.

Branch Per Project

The system was developed by a team (both on-shore and off-shore) numbering around 50 and the branching strategy was based around the many ongoing projects, each of which typically lasted many months. Any BAU work got done on the tip of whatever the last release branch was.

While this allowed the team to hack around to their heart’s content without bumping into any other projects it also meant merge problems where highly likely when the time came. Most of the file merges were trivial (i.e. automatic) but there were more than a few awkward manual ones. Luckily this was also in a time before “relentless refactoring” too so the changes tended to be more surgical in nature.

Breaking up the Monolith

Naturally any project that involved taking one large essential service apart by splitting it into smaller, more distributable components was viewed as being very risky. Even though I’d managed to help get the UAT environment into a state where parallel running meant regressions were usually picked up, the project was still held at arms length like the others. The system had a history of delays in delivering, which was unsurprising given the size of the projects, and so naturally we would be tarred with the same brush.

After spending a bit of time working out architecturally what pieces we needed and how we were going to break them out we then set about splitting the monolith up. The service really had two clearly distinct roles and some common infrastructure code which could be shared. Some of the orchestration logic that monitored outside systems could also be split out and instead of communicating in-process could just as easily spawn other processes to do the heavy lifting.

Decomposition Approach

The use of an enterprise-grade version control system which allows you to keep your changes isolated means you have the luxury of being able to take the engine to pieces, rebuild it differently and then deliver the new version. This was the essence of the project, so why not do that? As long as at the end you don’t have any pieces left over this probably appears to be the most efficient way to do it, and therefore was the method chosen by some of the team.

An alternative approach, and the one I was more familiar with, was to extract components from the monolith and push them “down” the architecture so they turn into library components. This forces you to create abstractions and decouple the internals. You can then wire them back into both the old and new processes to help verify early on that you’ve not broken anything while you also test out your new ideas. This probably appears less efficient as you will be fixing up code you know you’ll eventually delete when the project is finally delivered.

Of course the former approach is somewhat predicated on things never changing during the life of the project…

Man the Pumps!

Like all good stories we never got to the end of our project as a financial crisis hit which, through various business reorganisations, meant we had to drop what we were doing and make immediate plans to remediate the current version of the system. My protestations that the project we were currently doing would be the answer if we could just finish it, or even just pull in parts of it, were met with rejection.

So we just had to drop months of work (i.e. leave it on the branch in stasis) and look for some lower hanging fruit to solve the impending performance problems that would be caused by a potential three times increase in volumes.

When the knee jerk reaction began to subside the project remained shelved for the foreseeable future as a whole host of other requirements came flooding in. There was an increase in data volume but there were now other uncertainties around the existence of the entire system itself which meant it never got resurrected in the subsequent year, or the year after so I’m informed. This of course was still on our current custom platform and therefore no cost savings could be realised from the project work either.

Epilogue

This project was a real eye opener for me around how software is delivered on large legacy systems. Having come from a background where we delivered our systems in small increments as much out of a lack of tooling (a VCS product with no real branching support) as a need to work out what the users really wanted, it felt criminal to just waste all that effort.

In particular I had tried hard to encourage my teammates to keep the code changes in as shippable state as possible. This wasn’t out of any particular foresight I might have had about the impending economic downturn but just out of the discomfort that comes from trying to change too much in one go.

Ultimately if we had made each small refactoring on the branch next being delivered (ideally the trunk [2]) when the project was frozen we would already have been reaping the benefits from the work already done. Then, with most of the work safely delivered to production, the decision to finish it off becomes easier as the risk has largely been mitigated by that point. Even if a short hiatus was required for other concerns, picking the final work up later is still far easier or could itself even be broken down into smaller deliverable chunks.

That said, it’s easy for us developers to criticise the actions of project managers when they don’t go our way. Given the system’s history and delivery record the decision was perfectly understandable and I know I wouldn’t want to be the one making them under those conditions of high uncertainty both inside and outside the business.

Looking back it seems somewhat ridiculous to think that a team would split up and go off in different directions on different branches for many months with little real coordination and just hope that when the time comes we’d merge the changes [3], fix the niggles and release it. But that’s exactly how some larger teams did (and probably even still do) work.

[3] One developer on one project spend most of their time just trying to resolve the bugs that kept showing up due to semantic merge problems. They had no automated tests either to reduce the feedback loop and a build & deployment to the test environment took in the order of hours.