Ensuring system rewrites are truly necessary

I think it’s safe to say that software engineers—hey, I’m one one myself!—often dream about what the rewrite of a system would look like if they had the chance to start over. Oh so much more usable. Oh so much better structured. Oh so much faster! We are architects after all.

And it might be true that a rewrite can deliver those benefits. But often enough, we are too quick to let our minds wander into that territory.

As you may already know, rewrites are expensive and can be the mistake that kills a project or a company. It is easy to understand that a rewrite is costly—but what’s harder to notice is that, while the rewrite happens, the original version often cannot stop improving. As a result, the rewrite becomes harder over time and grows the same kind of warts that the original version had. This was all very well-covered by Joel Spolsky in his 20-year old (!!) post about software rewrites so I’m not going to dive deeper into these reasons.

What I want to do in this post, however, is take a different perspective against software rewrites. A perspective that forces a rewrite, should it happen, to deliver breakthrough change and not just incremental change.

The thesis

Let’s say you have a legacy system with some well-known problems. Your whole team is aware that the system has grown architectural issues, and they are also well-aware that the problems the users experience are real. Looking at the system, you realize that you can mitigate some of the problems with boring incremental change, but other problems truly necessitate a redesign. What do you do?

On the one hand, you can take the fun and exciting route: design a new system that hopefully provides room for the pipe dreams, but focus first on fixing the burning issues. Once designed, your company’s organizational complexity may make it easy to justify the rewrite given these promised immediate gains and to abandon the old system while it’s still in use.

On the other hand, you can take the boring route and incrementally fix the previous system. The thinking goes: “if the current system weren’t as bad as it’s perceived today, a future rewritten system would need to focus on dramatic improvements and not be allowed to regress the improvements we have already delivered”.

The regression aspect of this second approach key. During a rewrite, it’s easy to design for the current burning issues and neglect others. And by neglecting others, you may reach a dead end and not be able to improve on those.

Confused so far? Yes, I know. Let’s illustrate this idea with an example before we look at two real-world case studies.

Thesis exemplified: p50 and p99 latencies

Say you own a backend system that currently offers p50 = 100ms and p99 = 500ms latencies. Both of these are problematic and your users have been asking you for improvements, especially on the p50 front. You know that you can design a new version of this system that delivers p50 = 30ms (an impressive 70% cut) but your new design does nothing for the p99—in fact, the new design will makes it harder to address the p99 in the future.

So what if someone came around and iteratively tuned the old system to its absolute best, lowering the latencies to p50 = 70ms and p99 = 300ms? Sure, the gains on p50 are nowhere as impressive as with your new design—they are a mere 30% vs. the proposed 70%—but the gains on the p99 are very tangible. If this happened, your proposed redesign would be doomed: proceeding with that design would indeed cut the p50s further, but it would also harm the new baseline p99s—something that would not be acceptable once users got used to them.

Legacy system

Tuned legacy system

Proposed system

p50

100ms

70ms

30ms

p99

500ms

300ms

500ms

The table summarizes what I mentioned above. Note how the tuned legacy system, while not impressive on the p50s front, may be an overall best answer than the proposed system. The main takeaway thus is:

By improving the old system in dimensions that the redesign didn’t think about, we have made the work of the redesign much harder. And that is good, because then the redesign won’t corner us in a position that is hard go get out of.

Let’s look at two real-world examples of this taking place.

Case study 1: storage system

This story dates back to my days in Storage SRE, which are long gone. The systems involved have evolved a lot since then and none of this applies any more. But the story still serves as the foundation for what I’m presenting here.

One of the shared storage systems at Google at the time was not initially designed to support interactive workloads: it was designed mostly to support batch workloads very well. As time passed, however, people started relying on it for their interactive needs—and the obvious problems arose: high tail latencies. High tail latencies in a distributed system are very bad because your front-end apps typically compound their effects: they don’t issue a single RPC towards your backend system: they issue hundreds of them, so the probability that they hit your p99 metric is very likely.

As you can imagine, the original developers of such system envisioned a replacement. The replacement would magically fix all these issues and be so much better, architecturally-wise, that it would allow us to also implement other future features.

But… while supporting new use cases was very important and unfeasible in v1 of the product, the improvements to tail latency in v2 didn’t seem revolutionary and weren’t the primary focus in the design. Why couldn’t v1 behave better?

In fact, it could. A peer of mine pushed for this idea: “let’s make v1 so good that it’s hard to prematurely replace it with v2”. And he did. He established SLIs and SLOs, and then iteratively tightened them with targeted improvements throughout the software stack until v1 satisfied the customers’ immediate needs. The case for v2 was then focused on adding support for the new use cases that weren’t possible with v1 and without a rush to mitigate the current fires.

Case study 2: from local to remote builds

Let’s look now at another story: one in which I’ve been directly implicated as the primary engineer doing the work.

iOS builds at Google had traditionally run locally and our engineers were requesting faster build times. The obvious solution to this problem, based on all of our other builds, was to bring remote/distributed execution to these builds. So we did.

When rolling this feature out, we noticed that clean build times improved by more than 50% but incremental build times worsened and increased by 10x. To make things worse for us, teams were hard at work tuning their local-only builds, improving their clean build times and thus minimizing the gains remote execution provided in this case.

Deploying this feature was therefore not an option, so we had to go back to the drawing board to find an alternative. In the end, we had to come up with a vastly different answer to the problem and combine both local and remote resources during a build. You can read more on this in my Bazel dynamic execution posts.

This situation is slightly different than the others as it did not involve redesigning a system. However, the situation is similar enough because the baseline performance targets of our iOS builds were inherently different than those in other builds—and thus the cargoculted solution was insufficient. If we had started “from scratch” with remote builds only (and never had those local-only builds), we would have been blind to the potential higher targets we could achieve by leveraging local resources.

Parting words

Be very skeptical every time someone proposes to rewrite a piece of software. Often enough, the cheaper and safer answer is to perform iterative improvement to solve most problems.

Save rewrites for the cases where the current system’s architecture truly cannot fulfill new requirements. And if you end up going for a rewrite, try to deliver it iteratively by replacing the previous system in chunks, and not all at once.