Extracting Group Relationships Within Changing Software Using Text Analysis

View/Open

Author

Green, Pamela Dilys

Attention

2299/11896

Abstract

This research looks at identifying and classifying changes in evolving software by
making simple textual comparisons between groups of source code files. The two
areas investigated are software origin analysis and collusion detection. Textual
comparison is attractive because it can be used in the same way for many different
programming languages.
The research includes the first major study using machine learning techniques
in the domain of software origin analysis, which looks at the movement of code in
an evolving system. The training set for this study, which focuses on restructured
files, is created by analysing 89 software systems. Novel features, which capture
abstract patterns in the comparisons between source code files, are used to build
models which classify restructured files fromunseen systems with a mean accuracy
of over 90%. The unseen code is not only in C, the language of the training set, but
also in Java and Python, which helps to demonstrate the language independence
of the approach.
As well as generating features for the machine learning system, textual comparisons
between groups of files are used in other ways throughout the system:
in filtering to find potentially restructured files, in ranking the possible destinations
of the code moved from the restructured files, and as the basis for a new file
comparison tool. This tool helps in the demanding task of manually labelling the
training data, is valuable to the end user of the system, and is applicable to other
file comparison tasks.
These same techniques are used to create a new text-based visualisation for use
in collusion detection, and to generate a measure which focuses on the unusual
similarity between submissions. This measure helps to overcome problems in
detecting collusion in data where files are of uneven size, where there is high
incidental similarity or where more than one programming language is used. The
visualisation highlights interesting similarities between files, making the task of
inspecting the texts easier for the user.