I work on a rather large code base. Hundreds of classes, tons of different files, lots of functionality, takes more than 15 minutes to pull down a fresh copy, etc.

A big problem with such a large code base is that it has quite a few utility methods and such that do the same thing, or has code that doesn't use these utility methods when it could. And also the utility methods aren't just all in one class (because it'd be a huge jumbled mess).

I'm rather new to the code base, but the team lead who's been working on it for years appears to have the same problem. It leads to a lot of code and work duplication, and as such, when something breaks, it's usually broken in 4 copies of basically the same code

How can we curb this pattern? As with most large projects, not all code is documented(though some is) and not all code is... well, clean. But basically, it'd be really nice if we could work on improving the quality in this respect so that in the future we had less code duplication, and things like utility functions were easier to discover.

Also, the utility functions are usually either in some static helper class, in some non-static helper class that works on a single object, or is a static method on the class which it mainly "helps" with.

I had one experiment in adding utility functions as Extension methods(I didn't need any internals of the class, and it definitely was only required in very specific scenarios). This had the effect of preventing cluttering up the primary class and such, but it's not really anymore discoverable unless you already know about it

Tools like this will help you find points in code that does similar things. Continue to write tests to determine that they really do; use the same tests to make the duplicate code simpler to use. This "refactoring" can be done in multiple ways and you can use this list to determine the correct one:

Furthermore there is also a whole book about this topic by Michael C. Feathers, Working Effectively with Legacy Code. It goes in depth different strategies you can take to change the code to the better. He has a "legacy code change algorithm" which is not far off from the two step process above:

Identify change points

Find test points

Break dependencies

Write tests

Make changes and refactor

The book is a good read if you're dealing with brown-field development, i.e. legacy code that needs to change.

In this case

In the OP's case I can imagine the untestable code is caused by a honey pot for "utility methods and tricks" that take several forms:

Take note that there is nothing wrong with these, but on the other hand they're usually hard to maintain and change. Extensions methods in .NET are static methods, but are also relatively easy to test.

Before you go through with the refactorings though, talk with your team about it. They need to be kept on the same page as you before you proceed with anything. This is because if you're refactoring something then chances are high you'll be causing merge conflicts. So before reworking something, investigate it, tell to your team to work on those code points with caution for a while until you're done.

Since the OP is new to the code there are some other things to do before you should do anything:

Take time to learn from the codebase, i.e. break "everything", test "everything", revert.

We actually have quite a bit of unit and integration testing. Not 100% coverage, but some of the things we do are nearly impossible to unit test without radical changes to our code base. I never considered using static analysis to find duplication. I'll have to try that next.
–
EarlzDec 27 '12 at 19:08

@Earlz: Static code analysis is awesome! ;-) Also, whenever you need to do the change, think of solutions to make changes easier (check the refactor to patterns catalogue for this)
–
SpoikeDec 27 '12 at 19:28

+1 I'd understand if someone put a bounty on this Q to award this anwser as "extra helpful". The Refactor to Patterns Catalogue is gold, things like this in the fashion of GuidanceExplorer.codeplex.com are great programming aids.
–
Jeremy ThompsonDec 28 '12 at 4:07

Prevention - Try to have as good documentation as possible. Make every function properly documented and easy to search through whole documentation. Also, when writing code, make it obvious where the code should go, so it is obvious where to look. Limiting amount of "utility" code is one of the key points of this. Every time I hear "lets make utility class", my hair goes up and my blood freezes, because it is obviously a problem. Always have quick and easy way to ask people to know the codebase whenever some feature already exists.

Solution - If prevention fails, you should be able to quickly and efficiently solve the problematic piece of code. Your development process should allow to quickly fixing duplicate code. Unit testing is perfect for this, because you can efficiently modify code without fear of breaking it. So if you find 2 similar pieces of code, then abstracting them into a function or class should be easy with little bit of refactoring.

I personally don't think prevention is possible. The more you try, the more it is problematic to find already existing features.

I don't think this kind of problems have the general solution. Duplicate code will not created if the developers have enough willingness to look up existing code. Also developers could fixed the problems on the spot if they want to.

If the language is C/C++ duplication merge will be easier because of flexibility of linking (one can call any extern functions without prior information). For Java or .NET you may need to devise helper classes and/or utility components.

I usually begin the duplication removing of existing code only if the major errors are arise from the duplicated parts.

This is a typical problem of a larger project that has been handled by many programmers, that have been contributing under sometimes a lot of peer pressure. It is very very tempting to make a copy of a class and adapt it to that specific class. However, when a problem was found in the originating class, it should also be solved in it's decedents which is often forgotten.

There is a solution for this and it is called Generics which has been introduced in Java 6. It is the equivalent of C++ called Template. Code of which the exact class is not yet known within a Generic Class. Please check for Java Generics and you will find tons and tons of documentation for it.

A good approach is to rewrite code that seems to be copied/pasted in many places by rewriting the first one that you need to ie fix because of a certain bug. Rewrite it to use Generics and also write very rigorous testing code.

Make sure that every method of the Generic class is invoked. You can also introduce code coverage tools: generic code should be fully code coverage because it will be used in several places.

Also write testing code ie using JUnit or similar for the first designated class that is going to be used in conjunction with the Generic piece of code.

Start using the Generic code for the second (most of the times) copied version when all the preceding code works and is fully tested. You will see that there are some lines of code that are specific for that designated Class. You can call these lines of coded in a abstract protected method that needs to be implemented by the derived class that uses the Generic base class.

Yes it is a tedious job, but as you go along it will get better and better to rip out similar classes and replace it with something that is very very clean, well written and much easier to maintain.

I have had similar situation where on generic class eventually replaced something like 6 or 7 other almost identical classes that were all almost àlmost identical but have been copied and pasted by various programmers over a period of time.

And yes, I am very very much in favor of automated testing of the code. It will cost more in the beginning but it will definitely saves you a tremendous amount of time overall. And try to achieve a code coverage of overall of at least 80% and 100% for Generic code.

The author of this book, Johannes Sametinger, describes a set of barriers to code reuse, some conceptual some technical. For instance:

Conceptual and Technical

Difficulty finding reusable software: software cannot be reuse unless it can be found. Reuse is unlikely to happen when a repository
does not have sufficient information about components or when
components are poorly classified.

Nonreusability of found software: easy access to existing software does not necessarily increase software reuse.
Unintentionally, software is seldom written in a way so that others
can reuse it. Modifying and adapting someone else's software can
become even more expensive than programming the needed functionality
from scratch.

Legacy components not suitable for reuse: Reuse of components is hard or impossible unless they have been designed and developed for
reuse. Simply gathering existing components from various legacy
software systems and trying to reuse them for new developments is not
sufficient for systematic reuse. Re-engineering can help in extracting
reusable components, however the effort might be considerable.

Object-oriented technology: It is widely believed that object-oriented technology has a positive impact on software reuse.
Unfortunately and wrongly, many also believe reuse depends on this
technology or that adopting object-oriented technology suffices for
software reuse.

Modification: components will not always be exactly the way we want them. If modifications are necessary, we should be able to
determine their effects on the component and its previous verification
results.

Garbage reuse: Certifying reusable components to certain quality levels helps minimizing possible defects. Poor quality controls is one
of the major barriers to reuse. We need some means of judging whether
the required functions match the functions provided by a component.

Other basic technical difficulties include

Agreeing on what a reusable component constitutes.

Understanding what a component does and how to use it.

Understanding how to interface reusable components to the rest of a design.

Designing reusable components so that they are easy to adapt and modify in a controlled way.

Organizing a repository so that programmers can find and use what they need.

According to the author, different levels of reusability happen depending on the maturity of an organization.

Ad-hoc reuse among application groups: if there is no explicit commitment to reuse, then reuse can happen in an informal and
haphazard way at best. Most of the reuse, if any, will occur within
projects. This also leads to code scavenging and ends up in code
duplication.

Repository-based reuse among application groups: the situation slightly improves when a component repository is used and can be
accessed by various application groups. However, no explicit mechanism
exists for putting components into the repository and no one is
responsible for the quality of the components in the repository. This
can lead to many problems and hamper software reuse.

Centralized reuse with a component group: In this scenario a component group is explicitly responsible for the repository. The
group determines which components are to be stored in the repository
and ensures that quality of these components and the availability of
the necessary documentation, and helps retrieving suitable components
in a particular reuse scenario. Application groups are separated from
the component group, which acts as a kind of subcontractor to each
application group. An objective of the component group is to minimize
redundancy. In some models, the members of this group can also work on
specific projects. During project start-ups their knowledge is
valuable to foster reuse and thanks to their involvement in a
particular project they can identify possible candidates for inclusion
in the repository.

Domain-based Reuse: The specialization of component groups amounts to domain-based reuse. Each domain group is responsible for
components in its domain, e.g. network components, user interface
components, database components.

So, maybe, besides all the suggestions given in other answers, you could work on design a reusability program, involve management, form a component group responsible for identifying reusable components by doing domain analysis and define a repository of reusable components that other developers can easily query and look for cooked solutions to their problems.

Don't forget that the code duplication is not always harmful. Imagine: now you have some task to be solved in absolutely different modules of your project. Just now it is the same task.

There could be three reasons for it:

Some theme around this task is the same for both modules. In this case the code duplication is bad and should be liquidated. It would be clever to create a class or a module to support this theme and use its methods in both modules.

The task is theoretical in terms of your project. For example, it is from physics or maths etc. The task exists independently on your project. In this case the code duplication is bad and should be liquidated, too. I'd create a special class for such functions. And use a such function in any module where you need it.

But in other cases the coincidence of tasks is a temporary coincidence and nothing more.
It would be dangerous to believe that these tasks will remain the same during changes of the project due to refactoring and even debugging. In this case it would be better to create two same functions/pieces of code in different places. And future changes in one of them won't touch the other one.

And this 3rd case happens very often. If you duplicate "unknowlingly", mostly it is for this very reason - it is not a real duplication!

So, try to keep it clean when it is really necessary and don't be afraid of duplication if it is not the must.

code duplication is not always harmful is one poor advice.
–
user61852Dec 27 '12 at 23:46

Should I bow to your authority? I had put my reasons here. If I am mistaking, show where is the mistake. Now it rather seems as your poor ability to keep discussion.
–
GangnusDec 28 '12 at 0:12

3

Code duplication is one the core problems in software developing and many computing scientists and theoreticians have developed paradigms and methodologies just to avoid code duplication as the main source of maintainability issues in software development. It's like saying "writing poor code is not always bad", that way anything can be rhetorically justified. Maybe you are right, but avoiding code duplication is a too good principle to live by so as to encourage the opposite..
–
user61852Dec 28 '12 at 0:22

I have put here arguments. You haven't. The reference to authorities won't work since 16th century. You can't guarantee that you have understood them correctly and that they are authorities for me, too.
–
GangnusDec 28 '12 at 13:33

You are right, code duplication is not one of the core problems in software development, and no paradigms and methodologies have been developed to avoid it.
–
user61852Dec 28 '12 at 15:49