A Platform for the Integration of Repository Mining and Software Analytics through Big Data Technologies

Abstract

Researchers of various research areas (e.g. defect prediction, software evolution, software simulation) analyse software projects to develop new ideas or test their assumptions by performing case studies. But to analyse software projects, two different steps need to be taken: (1) the mining of the project data, which includes pre-processing, calculation of metrics and synthesizing composite results and (2) performing the analysis on basis of the mined data. The process of analysing a software project is often divided into these two steps and therefore, different tooling is used to perform them (e.g. CVSAnaly for mining, WEKA for analysis). Furthermore, the tooling for these steps is very versatile. This raises the problem, that performed studies are often not replicable. Therefore, the possibility of performing meta-analyses (analyse an analysis) is not given. Additionally, the mining and combination of information from different data sources (e.g. mailing lists and source code repositories) is a complex and error-prone task. Furthermore, there is an evergrowing need for performing the analysis of a large amount of data efficient and fast. Our solution for these problems is the development of a platform, which incorporates the mining and the analysis of software projects and present it with an easy to use web interface to the researcher. Furthermore, our platform is designed as a general-purpose platform and uses different big data technologies and frameworks (e.g. Apache Spark, Apache Hadoop) to make an efficient and fast analysis of the mined data possible. Additionally, this platform can be automatically deployed in a cloud infrastructure. We have used our platform
for mining different projects and conducting defect prediction research on the mined data. The results show, that our platform is a useful tool for conducting software project research.