I usually try to follow the advice of the book Working Effectively with Legacy Code. I break dependencies, move parts of the code to @VisibleForTesting public static methods and to new classes to make the code (or at least some part of it) testable. And I write tests to make sure that I don't break anything when I'm modifying or adding new functions.

A colleague says that I shouldn't do this. His reasoning:

The original code might not work properly in the first place. And writing tests for it makes future fixes and modifications harder since devs have to understand and modify the tests too.

If it's GUI code with some logic (~12 lines, 2-3 if/else blocks, for example), a test isn't worth the trouble since the code is too trivial to begin with.

Similar bad patterns could exist in other parts of the codebase, too (which I haven't seen yet, I'm rather new); it will be easier to clean them all up in one big refactoring. Extracting out logic could undermine this future possibility.

Should I avoid extracting out testable parts and writing tests if we don't have time for complete refactoring? Is there any disadvantage to this that I should consider?

It’s an illusion

First, sure, the existing code might be wrong. It might also be right. Since the application as a whole seems to have value to you (otherwise you'd simply discard it), in the absence of more specific information you should assume that it is predominantly right. "Writing tests makes things harder because there's more code involved overall" is a simplistic, and very wrong, attitude.

Second, by all means expend your refactoring, testing, and improvement efforts in the places where they add the most value with the least effort. Value-formatting GUI subroutines are often not the first priority. But not testing something because "it's simple" is also a very wrong attitude. Virtually all severe errors are committed because people thought they understood something better than they actually did.

Third, "we will do it all in one big swoop in the future" is a nice thought. Usually the big swoop stays firmly in the future, while in the present nothing happens. Me, I'm firmly of the "slow and steady wins the race" conviction.

Build a foundation

When you're refactoring legacy code, it doesn't matter if some of the tests you write happen to contradict ideal specifications. What matters is that they test the program's current behavior. Refactoring is about taking tiny iso-functional steps to make the code cleaner; you don't want to engage in bugfixing while you're refactoring. Besides, if you spot a blatant bug, it won't be lost. You can always write a regression test for it and temporarily disable it, or insert a bugfix task in your backlog for later. One thing at a time.

I'd agree that pure GUI code is hard to test and perhaps not a good fit for "Working Effectively ..." -style refactoring. However, this doesn't mean you shouldn't extract behavior that has nothing to do in the GUI layer and test the extracted code. And "12 lines, 2-3 if/else block" is not trivial. All code with at least a bit of conditional logic should be tested.

In my experience, big refactorings are not easy and they rarely work. If you don't set yourself precise, tiny goals, there's a high risk that you embark on a never-ending, hair-pulling rework where you'll never land on your feet in the end. The bigger the change, the more you risk breaking something and the more trouble you will have finding out where you failed.

That said, software suffers from entropy and builds up technical debt. Previous developers short on time (or perhaps just lazy or inexperienced) may have implemented sub-optimal buggy solutions over well designed ones. Whilst it may seem desirable to refactor these, you risk introducing new bugs to what is (to the users anyway) working code.

Some changes are lower risk than others. For example, where I work there tends to be a lot of duplicated code that can safely be farmed out to a subroutine with minimal impact.

Ultimately, you have to make a judgement call as to how far you take the refactoring but there is undeniably value in adding automated tests if they don't exist already.

30 Reader Comments

I suppose it depends on the system and how critical it is. I work for a company that does avionics software and one of the things they told me is don't touch code outside the scope of what you're supposed to do, even if it looks really bad. The reason is because that code, regardless of how horrible it looks, works, and it has years and years of flight time. Refactoring it means throwing all of that confidence away.

Another company I worked for did training systems, and there was a general push for code refactoring which I could see was a good idea since it wasn't a critical system and refactoring did produce more pros than cons.

I suppose it depends on the system and how critical it is. I work for a company that does avionics software and one of the things they told me is don't touch code outside the scope of what you're supposed to do, even if it looks really bad. The reason is because that code, regardless of how horrible it looks, works, and it has years and years of flight time. Refactoring it means throwing all of that confidence away.

Another company I worked for did training systems, and there was a general push for code refactoring which I could see was a good idea since it wasn't a critical system and refactoring did produce more pros than cons.

I suppose it depends on the system and how critical it is. I work for a company that does avionics software and one of the things they told me is don't touch code outside the scope of what you're supposed to do, even if it looks really bad. The reason is because that code, regardless of how horrible it looks, works, and it has years and years of flight time. Refactoring it means throwing all of that confidence away.

Another company I worked for did training systems, and there was a general push for code refactoring which I could see was a good idea since it wasn't a critical system and refactoring did produce more pros than cons.

Ditching an existing system, be it code or actual stations, AFAIK is never welcomed with open arms. Some financial institutions (minor ones, not pointing fingers here) still use Windows 2000 with MS Access for their DB and the only reason they refuse to upgrade is how critical the DB is. Even the slightest chance of fail is akin to heresy to them.

At the same time though I can somewhat relate where they are coming from. I have worked with an accounting data entry program in the past where my new coworker wanted to "improve" it by enabling UTF-8 support. The original was written in-house and poorly at that, but there was absolutely no use for UTF-8 in that specific task and I was hesitant since the previous data was already encoded and encrypted based on ANSI. I suggested a local test version - he scoffed and dismissed my worries with a shrug. Let's just say my worries saved us a lot of headache (and possibly my coop job).

The reason is because that code, regardless of how horrible it looks, works, and it has years and years of flight time. Refactoring it means throwing all of that confidence away.

That's where the tests come in. All the functionality for which you write tests prior to refactoring will be verified to work similarly as it did before.

For the whole it may be impossible to have complete test coverage, but for some modular parts you can cover the code so well that you pretty much prove your refactoring did not affect its functionality.

The reason is because that code, regardless of how horrible it looks, works, and it has years and years of flight time. Refactoring it means throwing all of that confidence away.

That's where the tests come in. All the functionality for which you write tests prior to refactoring will be verified to work similarly as it did before.

For the whole it may be impossible to have complete test coverage, but for some modular parts you can cover the code so well that you pretty much prove your refactoring did not affect its functionality.

The problem is how do you know how good your tests are at capturing the behaviour (intended and non-intended) of the previous implementation. Code coverage is one measure, but nowhere near enough. The classic for me is silently swallowed errors in an old implementation (e.g. overflows), which might cause a segfault or something similar in a new implementation. Normally data dependent and you can bet your butt that once in a blue moon something is relying on that without you knowing about it.

If it's GUI code with some logic (~12 lines, 2-3 if/else blocks, for example), a test isn't worth the trouble since the code is too trivial to begin with.

LOL!Code too trivial to test. Sure. Remember that long function spanning two pivoted screens? It's too trivial to test. It only consists of effectively 12 lines and 2 if-else blocks. The rest is just kind of inlined functions, but they, too, are trivial and not worth a test.

About 95% the programming mistakes I have seen, including my own, were in things "too trivial to test". Just because a function is extremely simple and even short on top of that this does not mean the common programmer wouldn't manage to have something slip through his fingers.

The problem is how do you know how good your tests are at capturing the behaviour (intended and non-intended) of the previous implementation. Code coverage is one measure, but nowhere near enough. The classic for me is silently swallowed errors in an old implementation (e.g. overflows), which might cause a segfault or something similar in a new implementation. Normally data dependent and you can bet your butt that once in a blue moon something is relying on that without you knowing about it.

Code coverage is not enough, but I wouldn't say not nearly. If you cover all the codes (and branches!) of a "module" (be it class, function or file) with no side effects, the only thing left is to test its inputs and outputs.

That's where it gets tricky – for some such modules it is possible to write tests that cover all the inputs and outputs, but those are a minority. When that isn't possible, you can test all the corner cases (nulls, empty strings, '\0', long input, etc.) and some known cases. If even that isn't enough, you can fuzz.

You may not get to 100% certainty that the code will work as it did before, but then you weren't 100% sure the old code would work correctly in all situations either. And it's not inconceivable that writing all those tests did uncover a case that wasn't correctly handled before.

As someone else said, the legacy code presumably has value since it's still around--users rely on it having the correct behavior, even if the code is a nightmare (in some cases, even fixing issues that are clearly legitimate bugs can disrupt the users). If you can write regression tests for it, and rely on them to maintain consistency as you gradually refactor yourself into a better place, then you can greatly reduce the risk inherent in any such refactoring, especially the mythical one-time overhaul that's always sitting way down there in the backlog.

Obviously, though, this strategy is not always simple to implement, since having tests that work for both old and new code presents its own problems.

The reason is because that code, regardless of how horrible it looks, works, and it has years and years of flight time. Refactoring it means throwing all of that confidence away.

That's where the tests come in. All the functionality for which you write tests prior to refactoring will be verified to work similarly as it did before.

For the whole it may be impossible to have complete test coverage, but for some modular parts you can cover the code so well that you pretty much prove your refactoring did not affect its functionality.

The problem is how do you know how good your tests are at capturing the behaviour (intended and non-intended) of the previous implementation. Code coverage is one measure, but nowhere near enough. The classic for me is silently swallowed errors in an old implementation (e.g. overflows), which might cause a segfault or something similar in a new implementation. Normally data dependent and you can bet your butt that once in a blue moon something is relying on that without you knowing about it.

Beyond that, 'flight-testing' is a very, very important part of the test suite for avionics software. And the second you touch a single line of code that has been flight tested, you're throwing away all of the confidence that has been built up on it based on flight testing.

So not only do you have an oracle problem with your new tests, but you're not even able to properly replicate the original test cases (flight time).

Life-critical chunks of code like this are definitely the exception to where you should always try and improve the code base. Unless you can actually duplicate all of the real-world testing that went into the system, you absolutely should not touch that code unless you know exactly what you are doing and have a very specific reason to do so.

But I think we've digressed a bit from the original point of the question

Tests are mostly useful when you are refactoring the code, but you said there's no time for that.

Tests also only work properly when the code was designed with testing in mind — again, that's not the case here.

With those things put together, you're not going to get much real world benefit from adding tests to the code.

Therefore, you should spend your time doing something more useful — such as writing a replacement for the legacy code that is thoroughly tested. Or writing up documentation.

If your boss tells you there is no time for refactoring, then that usually means the team of programmers you're working on is not able to keep up with the amount of work that needs to be done. How about doing some work on that stuff instead of writing tests for legacy code.

I've recently been given a task to optimise some code developed over a number of years by another developer who has left the company ...

This isn't my first optimisation gig (it's actually one of the things I really enjoy - make it right, then make it fast). So first thing I do is get myself a quick overview of the subsystem, ensure that I can run it and capture some profiling data, and look at the unit tests.

There aren't any. Not a single one.

So I'm currently writing tests for the current behaviour. Which has quite a few ... quirks. Let's just say that there are a number of edge cases that aren't handled very well. Exceptions being swallowed. You get the idea.

So before I can even begin optimising, I need to write unit tests to capture the current behaviour; write more unit tests to verify the correct behaviour (and fix bugs); and then finally I can do a baseline profile and start optimising.

The way I write the unit tests for current behaviour is to write the test that I think should pass, and then see how it fails (as it almost always does). Then I have to dig into the code that's being tested (often with a debugger) to find out how my assumptions differ from reality. Sometimes that results in realising that my assumptions were wrong (e.g. assuming an ordering that doesn't exist), sometimes it means I need two tests - a temporary one for the current behaviour, and a skipped one for the correct behaviour.

Once I have a test that passes, I then start adding edge cases to the input data. Almost inevitably I'll eventually fail the test, and end up with another skipped test case.

Once I've got what I consider to be a good enough test suite together, I'll start modifying the code to pass the correct test suite, one unit test at a time.

The reason is because that code, regardless of how horrible it looks, works, and it has years and years of flight time. Refactoring it means throwing all of that confidence away.

That's where the tests come in. All the functionality for which you write tests prior to refactoring will be verified to work similarly as it did before.

For the whole it may be impossible to have complete test coverage, but for some modular parts you can cover the code so well that you pretty much prove your refactoring did not affect its functionality.

The problem is how do you know how good your tests are at capturing the behaviour (intended and non-intended) of the previous implementation. Code coverage is one measure, but nowhere near enough. The classic for me is silently swallowed errors in an old implementation (e.g. overflows), which might cause a segfault or something similar in a new implementation. Normally data dependent and you can bet your butt that once in a blue moon something is relying on that without you knowing about it.

Beyond that, 'flight-testing' is a very, very important part of the test suite for avionics software. And the second you touch a single line of code that has been flight tested, you're throwing away all of the confidence that has been built up on it based on flight testing.

So not only do you have an oracle problem with your new tests, but you're not even able to properly replicate the original test cases (flight time).

Life-critical chunks of code like this are definitely the exception to where you should always try and improve the code base. Unless you can actually duplicate all of the real-world testing that went into the system, you absolutely should not touch that code unless you know exactly what you are doing and have a very specific reason to do so.

But I think we've digressed a bit from the original point of the question

In that case, what is the threshold for determining when it is okay to modify a flight-tested code?

Tests should be written based on what the code is supposed to do, not what it currently does. Cargo-cult test writing, while socially expedient amongst young nerds, is extremely dangerous because if there's already a bug in the code the next guy to come along will see the test and think the code must be working as intended. What I mean by this is that useful tests will take almost as much dev time as refactoring, so I wouldn't bother if there's not enough time for either.

In the absence of testing/refactoring/whatever, arcane legacy code is best treated as a sacrosanct artifact of eldritch origin, kept in one spot, and whenever a problem pops up that sounds related to what it does the first step should be to look through the changelog for recent modifications. My experience is that while legacy code may be majorly bad, problems usually arise when an ambitious lad eager to prove himself sets out to "fix" things and then someone else used to taking advantage of certain quirks makes the assumption that it works the same now as it did five years ago.

Tests should be written based on what the code is supposed to do, not what it currently does.

It's important to understand what the current code does before fixing it. Writing tests that capture the current state gives a high confidence level that it is understood. Include comments that this is testing the existing behaviour prior to bugfixes.

Then fix the test, and fix the code. The changes to the test and corresponding changes to the code will be captured in your version control.

I'm not saying you should ensure that every quirk of the existing code be captured in the tests. Often it's useful to just put a comment in the test to the effect that certain edge cases are broken. Then in the changeset that fixes the test, you replace the comment with tests for the edge cases.

That said, software suffers from entropy and builds up technical debt. Previous developers short on time (or perhaps just lazy or inexperienced) may have implemented sub-optimal buggy solutions over well designed ones.

There's another issue I have often run across in my time working with legacy code when trying to get in and add a feature or make a bug fix. Legacy code with incorrect comments or unexpected side-effects or variables that are changed in unseen ways or things that have bad practice code structure (eg. many uses of "goto"). There's a lot of edge cases to look through, and even if you've waded in the code base for awhile, if you were not the original designer you may not know all of the pitfalls or limitations of the system itself (hell, even the original developer may not have!).

This goes back to the avionics example, where changing code is dangerous because you've got so many hours of seat time with the existing code base that when you make a change then everything is in flux. Especially when your legacy system has questionable coding styles/standards (or a complete lack of a recognizable style or standard), any change can be a dangerous thing. Doubly so when you don't have any actual code space to spare on the microprocessor, depending on how you want to construct the tests. But unfortunately, this issue is even more compounded when coupled with this:

Quote:

There is a culture in some companies where they're reticent to allow developers any time to enhance code that doesn't directly deliver additional value e.g. new functionality.

If schedule is always job 1... You need to weigh the costs of fighting the system going backwards (you also need to try to educate people higher up the food chain, or possibly other developers, on technical debt and how it works -- and how much people don't really understand about how products do and DON'T work under the hood, in spite of how people THINK they work).

Sometimes I advocate for limited cleaning up of legacy code when large changes must be made. Other times, just leaving it alone and attempting to document the functionality so that when it has to get re-written as part of a new platform (and then argue vehemently that it needs re-written instead of "ported forward" to bring it into an actual maintainable style). Early on, I was an advocate of ultimate refactoring and testing and all that stuff... But the realities of business and schedules have kind of made me change my view of how to triage legacy code I am in charge of moving forward.

Quote:

It's important to understand what the current code does before fixing it.

I also highly agree with this. And it actually takes, in my opinion, longer to get into legacy code than many people realize. After a few months I have had a good handle on legacy systems, but 1 or 2 years out I've run across something and gone "oh wow, that last feature upgrade was tenuous at best" because of new knowledge on how the system really works or interacts. Maybe things are somewhat different in the PC world, but from a system level view, it's easy to fall into a trap of thinking you know how stuff works when it really doesn't.

Tests should be written based on what the code is supposed to do, not what it currently does.

It's important to understand what the current code does before fixing it. Writing tests that capture the current state gives a high confidence level that it is understood. Include comments that this is testing the existing behaviour prior to bugfixes.

Then fix the test, and fix the code. The changes to the test and corresponding changes to the code will be captured in your version control.

I'm not saying you should ensure that every quirk of the existing code be captured in the tests. Often it's useful to just put a comment in the test to the effect that certain edge cases are broken. Then in the changeset that fixes the test, you replace the comment with tests for the edge cases.

A really good test system has the ability to mark some tests as 'expected to fail': You write the tests to what the behavior should be, and get alerted when the current behavior changes.

Also note that there are actually three behaviors to worry about with code: What it does, what it's supposed to do, and what it's documented to to. Ideally at least two of these match, but don't count on it.

Tests are mostly useful when you are refactoring the code, but you said there's no time for that.

I strongly disagree with this. Tests are useful 1) for building confidence the code works as supposed to and 2) preventing new bugs whenever you change the code in any way. The former is not that important with a legacy system that's found to work well in practice, but the latter is useful even when you aren't about to refactor the code. Even legacy systems sometimes need bug fixes or new features and tests help prevent some bugs that could be added as a side effect.

I've recently been given a task to optimise some code developed over a number of years by another developer who has left the company ...

This isn't my first optimisation gig (it's actually one of the things I really enjoy - make it right, then make it fast). So first thing I do is get myself a quick overview of the subsystem, ensure that I can run it and capture some profiling data, and look at the unit tests.

There aren't any. Not a single one.

So I'm currently writing tests for the current behaviour. Which has quite a few ... quirks. Let's just say that there are a number of edge cases that aren't handled very well. Exceptions being swallowed. You get the idea.

So before I can even begin optimising, I need to write unit tests to capture the current behaviour; write more unit tests to verify the correct behaviour (and fix bugs); and then finally I can do a baseline profile and start optimising.

The way I write the unit tests for current behaviour is to write the test that I think should pass, and then see how it fails (as it almost always does). Then I have to dig into the code that's being tested (often with a debugger) to find out how my assumptions differ from reality. Sometimes that results in realising that my assumptions were wrong (e.g. assuming an ordering that doesn't exist), sometimes it means I need two tests - a temporary one for the current behaviour, and a skipped one for the correct behaviour.

Once I have a test that passes, I then start adding edge cases to the input data. Almost inevitably I'll eventually fail the test, and end up with another skipped test case.

Once I've got what I consider to be a good enough test suite together, I'll start modifying the code to pass the correct test suite, one unit test at a time.

First of all, my preferred testing method is a mixture of unit testing and assertions; I use unit tests to ensure an implementation of an interface behaves as expected, but I'm usually fairly relaxed on just how thoroughly I range test the values (as this can quickly over-complicate the test code). For range testing I then use assertions in the code itself, as they're also a great thing to leave enabled during production, so IMO it makes most sense for the strict range testing to be done this way. Of course some work place environments will put requirements on your test method, which is why I'm lucky I work freelance mostly, as I've found that relaxed unit tests + assertions is a great balance.

The problem with doing all of this retroactively though is that unit tests really should be written in advance to conform to a well documented interface, but the number of times I've encountered classes that don't have interfaces but should, and are badly documented as well, is just disheartening. These are nightmares to test as you have to infer everything about how they work and what their methods are supposed to do (and what values they're supposed to accept), which can be a very difficult and tedious task.

Personally when working with another's code I will do everything I can to avoid touching that code and just hope it works, while being as thorough as I can with testing my own code; if those tests fail because of code I'm relying on, then I'll go ahead and write tests for that too. Basically it's the same rule as profiling; don't try to test everything as you may just be wasting your time, but if something breaks then the first task (if you're still in development) is to write a proper test to make sure you understand the code and can verify that it has been fixed, then you can dive in and try to fix it. Of course if the bug crops up in production you have to prioritise the fix, but if it's non-critical then stopping to write your test first can still be useful, as the worst thing is delving into someone's code, thinking you've fixed the problem, only to discover you've created a new one (or more often more than one)!

It's important to understand what the current code does before fixing it.

I couldn't agree more.

Quote:

Writing tests that capture the current state gives a high confidence level that it is understood.

Understanding the code means knowing why it does what it does. It is possible to write tests that capture the current state of the code in a purely syntax-driven manner, without having any understanding of the code under test. Tests are code-level constructs, and are just as prone as the product code to be obtuse, confused and confusing. In fact, because the structure of the tests necessarily reflects the design of the code, a test suite constructed to accept an existing code base is not likely to do better than that code base in comprehensibility.

Writing tests for the code might, as a side-effect, lead to an understanding of that code (mainly because the test author has to read the code), but analytical methods are a faster and more direct route. If your first goal is to understand the current code (and I agree that it is), then why not tackle that problem directly? The author of the original question should put aside test-writing until he thoroughly understands the software.

Understanding the code means knowing why it does what it does. It is possible to write tests that capture the current state of the code in a purely syntax-driven manner, without having any understanding of the code under test. Tests are code-level constructs, and are just as prone as the product code to be obtuse, confused and confusing. In fact, because the structure of the tests necessarily reflects the design of the code, a test suite constructed to accept an existing code base is not likely to do better than that code base in comprehensibility.

Writing tests for the code might, as a side-effect, lead to an understanding of that code (mainly because the test author has to read the code), but analytical methods are a faster and more direct route. If your first goal is to understand the current code (and I agree that it is), then why not tackle that problem directly? The author of the original question should put aside test-writing until he thoroughly understands the software.

My point exactly. A lot of test writers are cargo-cult testers, they don't take the time to understand the how or the why and just blindly translate what the code appears to do into unit tests. It's fast, it's simple, it doesn't require much thinking, and for many developers it's more important to impress their friends with how many tests their team has than have a good working knowledge of the code base.

Consider this simple example. Buried in some code written ten years ago we've got this function:

What does it do? Why does it do it? Who uses it? Well, obviously it converts some value to widgets by multiplying by 28.56. And we did a search and it's used to set some obtuse variable in a ten thousand line class, no way I'm reading all that. It needs a test! So we write something like this.

[UnitTest]AncientKnowledgeTests::testConvertToWidgets(){ double myVal = AncientKnowledge.convertToWidgets(2.5); assertEqual(71.4, myVal); // I ran the function and this is what I got back}

Bada bing, bada boom, we have a test and increased code coverage so all is right in the world. Now, six months later another developer needs to convert something to widgets. He finds that the values he's getting don't match what he wants to get, they're off by a small amount. So he goes to the business analyst or systems engineer or whoever sets the requirements and asks WTF is going on. The systems guy looks at the spec and says the problem is that the widget conversion rate is 28.57, not 28.56. Programmer responds that it must be a new change to the spec because we have tests that would have caught this. So systems guy and code guy go at it for a while, each checks their respective changelog and can't resolve the discrepancy until eventually they call the one old guy in the company who actually worked on this stuff and he confirms that yes, the number should be 28.57 and it must have been a typo. Hey, errors happen when you're under pressure to hit internal deadlines.

But why wasn't this a problem earlier? Because in that ten thousand line file the original tester couldn't be bothered to analyze, the variable we set to is actually being cast to an integer and the numbers were all small enough that the rounding error was eaten, and the new coder was using it as a double for a very precise calculation. If the tester had looked at the spec he would have found the error, and if there was no test the coder would have backed down because the conversion value was hearsay. But there were tests, which must have been validated at the time, so two engineers spent half a day screwing around over a ten year old bug that went unnoticed because of questionable code elsewhere in the system.

The reason is because that code, regardless of how horrible it looks, works, and it has years and years of flight time. Refactoring it means throwing all of that confidence away.

That's where the tests come in. All the functionality for which you write tests prior to refactoring will be verified to work similarly as it did before.

For the whole it may be impossible to have complete test coverage, but for some modular parts you can cover the code so well that you pretty much prove your refactoring did not affect its functionality.

The problem is how do you know how good your tests are at capturing the behaviour (intended and non-intended) of the previous implementation. Code coverage is one measure, but nowhere near enough. The classic for me is silently swallowed errors in an old implementation (e.g. overflows), which might cause a segfault or something similar in a new implementation. Normally data dependent and you can bet your butt that once in a blue moon something is relying on that without you knowing about it.

Beyond that, 'flight-testing' is a very, very important part of the test suite for avionics software. And the second you touch a single line of code that has been flight tested, you're throwing away all of the confidence that has been built up on it based on flight testing.

So not only do you have an oracle problem with your new tests, but you're not even able to properly replicate the original test cases (flight time).

Life-critical chunks of code like this are definitely the exception to where you should always try and improve the code base. Unless you can actually duplicate all of the real-world testing that went into the system, you absolutely should not touch that code unless you know exactly what you are doing and have a very specific reason to do so.

But I think we've digressed a bit from the original point of the question

In that case, what is the threshold for determining when it is okay to modify a flight-tested code?

Good question, if a bit unanswerable.I have designed a system which reads the state of printers, to facilitate their management. A major problem was that test models usually come early in the release cycle. Further down the release cycle firmware upgrades can cause modified behaviour. At this point it is possible to see what is going on in a customer machine, but not get one for necessary extended investigation - even if time was available. Do you attempt to modify the code? Remember that the firmware changes are usually to address bugs, and you may have a workaround for that bug which is no longer correct.Not as critical as avionics by a long way, but an example of how software modifications in a part of the system you do not control (and for which you probably don't have a published specification, do you, Hewlett Packard?) can cause serious problems on updating the parts which you do.

Assuming you have a codebase you need to make changes to, you should certainly follow Michael Feathers' advice in Working Effectively With Legacy Code. If you don't need to make any changes (and are still getting paid as a programmer), then this is great for you. Your colleague's advice of waiting for the 'big refactor' is fine if a) waiting instead of delivering features and bug fixes is acceptable and b) someone's going to pay for a 'big refactor' and wait while it plays out. Otherwise, you will need to modify the existing codebase, delivering those new features and bug fixes with the least risk and effort. Michael Feathers' book tells you how to do that.

I'd be wary of any of the comments on this thread advocating informal reasoning ('analysis') as the only means of understanding the codebase. Feathers advocates 'characterisation tests' for good reason; he does not claim they are regular 'unit tests'. They provide objective, albeit limited, data about the behaviour of the existing system; in contrast, informal reasoning is subjective. Reading and reasoning about the code and documentation are of course useful they are not objective evidence. When you deploy your code, its behaviour will be an objective fact and the documentation and your reasoning don't count. Feathers covers all of this and more in the book.

Tests should be written based on what the code is supposed to do, not what it currently does.

If it's been in place for a while, what it currently does is what it's supposed to do. There are likely processes that depend on the behavior that it currently has, and changing that without warning could be disastrous.

No one cares to fix this. I.E. It's a big foobar mess that no one understands but everyone relies on. You are not in a position to change any of that. When you are ready to retire, these same people will be looking to update that same big foobar mess. Except now you will be one of the only few people who still can write code in that language. So you can charge an obscene hourly rate to work for them.