Abstract

Computer software is, by its very nature highly complex and invisible yet subjectto a near-continual pressure to change. Over time the development process hasbecome more mature and less risky. This is in large part due to the conceptof software traceability; the ability to relate software components back to theirinitial requirements and between each other. Such traceability aids tasks suchas maintenance by facilitating the prediction of “ripple effects” that may result,and aiding comprehension of software structures in general. Many organisations,however, have large amounts of software for which little or no documentationexists; the original developers are no longer available and yet this software stillunderpins critical functions. Such “legacy software” can therefore represent a highrisk when changes are required.Consequently, large amounts of effort go into attempting to comprehend andunderstand legacy software. The most common way to accomplish this, giventhat reading the code directly is hugely time consuming and near-impossible, isto reverse engineer the code, usually to a form of representative projection suchas a UML class diagram. Although a wide number of tools and approaches exist,there is no empirical way to compare them or validate new developments. Consequentlythere was an identified need to define and create the Reverse Engineeringto Design Benchmark (RED-BM). This was then applied to a number of industrialtools. The measured performance of these tools varies from 8.8% to 100%,demonstrating both the effectiveness of the benchmark and the questionable performanceof several tools.In addition to the structural relationships detectable through static reverseengineering, other sources of information are available with the potential to revealother types of relationships such as semantic links. One such source is the miningof source code repositories which can be analysed to find components within asoftware system that have, historically, commonly been changed together duringthe evolution of the system and from the strength of that infer a semantic link. Anapproach was implemented to mine such semantic relationships from repositoriesand relationships were found beyond those expressed by static reverse engineering.These included groups of relationships potentially suitable for clustering.To allow for the general use of multiple information sources to build traceabilitylinks between software components a uniform approach was defined andillustrated. This includes rules and formulas to allow combination of sources.The uniform approach was implemented in the field of predictive change impactanalysis using reverse engineering and repository mining as information sources.This implementation, the Java Code Relationship Anlaysis (jcRA) package, wasthen evaluated against an industry standard tool, JRipples. Depending on thetarget, the combined approach is able to outperform JRipples in detecting potentialimpacts with the risk of over-matching (a high number of false-positivesand overall class coverage on some targets).