Clone Doctor: How to find Clones in a Mountain of Software Code

Basic Ideas

People clone source code all the time. How can we find cloned in a huge source code base?
The basic idea is that a block of cloned code show similarilty (by some definition) to some
other block of code. So the easy way is to define any measure of similarity over blocks of
code, and simply compare them all. Those blocks with same similarity may be clones.

While simple, this idea has a number of engineering flaws. First you have to define what
you mean by blocks of code: entire files? sequences of lines? language atoms like
idenfiers and keywords? You also have to define how these can vary: exact match, arbitrary
holes in the text, or precise holes matching language elements.

Look for Big Matching Chunks? Not much Payoff

If you define things simply, you can build a simple clone detector.
If you stick to finding cloned files with no changes, you can compare files against each other
using any diff tool you can find.
The downside is that you won't find a lot of clones, because most clones are modified
versions of the original text.

Grain size for Matching? Lines are Not a Good Idea

If you want a good clone detector, you need finer grain.
So you need to choose between "lines" or program elements.
You also need to allow variation between the clones to account for modifications people made when copying.
Mostly they make changes within or across line boundaries.
For source text, variation can include reformatting the code layout,
changing variable names, modifying comments. An ideal detector will
be able to detect clones in spite of this.

Now you have to decide
how to chop up files into blocks of lines (or program elements) somehow.
Lines are easy. Program elements are harder; the tool has to understand
the structure of the language to some degree.

Scale

The next thing to note is scale: we have N blocks, comparing them all is fundamentally
an N**2 process; a million blocks means a trillion compare operations.
We're going to have to be smart or the analysis won't ever finish.

Types of Clone Detectors Compared

Dot-plot style. These mostly compare single lines globally for equality and draw a dot plot graph. They work by computing hash codes for single lines; lines that go into the same bucket are compared for equality, claimed as a clone, and produce a dot in the display graph. Benefits: These scale well, produce pretty Graphs, and work for any text file.
Problems: While pretty, the dot plots don't really tell you a lot, they don't show you the code or provide precise location information, and they don't help you form an abstraction to replace the clone.
And exact match means many clones are missed.

Blocks-of-lines matching. These match groups of identical lines, and provide location data as a list of file/linenumber/lengths. Benefits: more precise location data. Exact match again means many clones are missed; these clone detectors have a high false positive rate. You get some answers, but you don't get a lot of bang for the buck. These are relatively easy to build, so there's lots of these out there.

Token based matching. These tools break the source code into language tokens for the language of interest, and look for matches over sequences of tokens. Benefits: they are independent of the actual layout of the source text, so they can detect clones that have been reformatted. They can detect matching sequences of tokens where tokens in one sequence are replaced one-for-one with tokens in another; usually, this is limited to identifiers being replace with other identifies or numbers. Problems: First, you need an accurate lexer for the languages of interest; while it is easy to build sloppy lexers, real languages such as C#, COBOL and C++ have quite complicated lexical syntax and these detectors tend to be weak in matching the details. This weakness eventually turns into detector confusion about the actual edges of the more exotic lexemes, and that in turns causes the detector to either miss clones it should see, or see clones that aren't there. Additionally, since these tools don't deeply understand language syntax, they don't generally produce clones that match how the copy and paste action occured; programmers copy well-defined blocks of code not just any sequence of tokens. Thus, these kinds of of detectors will claim token sequences such .... \identifier1 } { \identifier2 ... as clones, even though such chunks are clearly not clones. Token based detectors thus have a high false positive rate. If you software has 5000 real clones, you don't want a tool that will report the 5000 real ones and 2500 false positives too. Better engineered detectors using this technique spend effort trying to bring in the notion of program boundaries such as begin-end blocks, but when you start with the assumption you don't have any, this is an uphill fight.

Function-based matching. The idea here is to match function bodies as potential clones. This has the big advantage of getting much of the language structure into the analysis process. These detectors use real language parsers to read the source code, find all function bodies, and then compute a hash code over function bodies. Function bodies with matching hash codes might be clones. A key problem here is handling variation: how do you get two functions which are almost identical, to have the same hash code? These detectors find exact-function clones easily, but are not good at finding function bodies that have changed. And they cannot find clones inside of functions. This idea seems to mostly have fallen out of favor.

Abstract Syntax Tree (AST) based matching. Programs have structure induced by the language's grammar rules. One can use this to find clones efficiently by hashing on arbitrary substructures of the program (identifiers, expressions, statements, declarations, function headers, statement sequences, ...encoded as compiler data structures called abstract syntax trees). This hashing process makes a N**2 computation essentially linear in time. Once a small clone has been identified, it and neighboring clones can be merged to form larger ones; the gap between becomes parameters to the clone and can be other arbitrary program stuctures, including identifiers, expression, statements, etc. Benefits: detected clones match code chunks that programmers would actually copy; found clones contain parameters for points of variation, and these detectors can find clones regardless of program formatting because the abstract syntax trees have been produced by a compiler-accurate front end parser for the language.

Data flow based. The conceptual step beyond ASTs is to match what the program computes. While this idea is attractive, attempts to implement it have foundered on the cost of comparing the data flow graphs that compilers use to represent the program's computation. No practical Data Flow clone detector is presently available. Research in the Universities on this continues.

CloneDR is an Abstract Syntax Tree based detector. Semantic Designs's CTO, Dr. Ira Baxter, wrote the key paper
on how to detect clones using this technique, and this paper is the most referenced paper in the clone literature as of 2010.
SD's CloneDR tool is built on this initial understand plus a decade of honing the idea, as well as being based on DMS and DMS's available array of language precise
parsers for C, C++, C#, Java, COBOL, PHP, Ada and many more. SD's CloneDR has the additional benefit of being able to bring multicore computing to the problem;
by being coded as a DMS application, it can use the processors available in your large server to compute accurate answers as quickly as possible. When you have a million
lines of code to analyze, time matters.

For those of you with a deep technical curiousity, this clone detection tool survey (2009) provides considerable detail about types of clone detectors and how they compare, including a a top rating of CloneDR.

Topics

Semantic Designs- Our Goal

To enable our customers to produce and maintain timely, robust and economical software by providing world-class Software Engineering tools using deep language and problem knowledge with high degrees of automation.