Corrigendum to: Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions published on 9 December 2015

Publication

Publication

INTRODUCTION During the preparation of the corresponding chapter in Davy Landman's PhD thesis, some minor graphical and statistical discrepancies were found in the paper “Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions.” To support future reproduction and use of this work, we prepared the current erratum, containing several updated figures, a diagnosis of the cause of the errors, and an explanation of the effect on the original paper. None of the issues reported in this erratum influence the conclusions of the original paper. ISSUES DISCOVERED The hexagonal scatter plots in Figure lack a more prominent line at CC = 0. This was caused by a bug *reported and confirmed: https://github.com/tidyverse/ggplot2/issues/2061in ggplot, which would filter out data around the limits. The R2 values in the Tables B and B of the C corpus were off by a maximum of 0.01 from the actual result. The cause was that this table was not re-calculated after fixing a bug in the “remove out-of-scope code” phase. Note that the impact of this error is scattered throughout the paper, as the correlations of Tables and are often repeated for clarity in the remaining sections (for example, the R2 of the linear model for all the C functions is 0.43 instead of 0.44). Our R code calculating the log-transformed linear fit contained an error. The dashed lines in Figures, and are impacted and the shape of the residual plot in 11. The biggest impact is in Figure, where the original fit seemed to miss the data almost entirely. We misinterpreted this phenomenon in the last sentence of the second paragraph of section 4.4.2; it is not caused by the skewness of the distributions of the two metrics, but rather by the current bug. The custom implementation of the log-scaled y-axis of the residual plots in Figure contained two errors: ∘ The labels on the y-axis were off by a factor 10 ∘For the negative side of the residual plot, we took the absolute, calculated the log10 value, and made it negative again. However, values between 0 and 1 (the values close to the linear fit) turn into a negative value (as log10(1) equals 0). This caused strange outliers in the original plots that were not scrutinized. The fixed residual plots do not have this outliers and look much more like the data in Figure. We republished the data sets related to the current paper on Zenodo to increase their availability: ∘ Landman, Davy. (2015). A Curated Corpus of Java Source Code based on Sourcerer (2015) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.208213 ∘Landman, Davy. (2015). A Large Corpus of C Source Code based on Gentoo packages [Data set]. Zenodo. http://doi.org/10.5281/zenodo.208215 ∘Davy Landman. (2015, February 26). cwi-swat/jsep-sloc-versus-cc. Zenodo. http://doi.org/10.5281/zenodo.293795 NEW IMAGES The remaining part of this erratum contains updated tables and figures as replacements for the original paper. 8 (Figure presented.) Scatter plots of SLOC vs CC zoomed in on the bottom left quadrant. The solid and dashed lines are the linear regression before and after the log transform. The grayscale gradient of the hexagons is logarithmic 4 Correlations for part of the tail of the independent variable SLOC. All correlations have a high significance level p≤1×10−16.(b) C functions (Table presented.) 5 Correlations for part of the tail of the independent variable SLOC removed. All correlations have a high significance level p≤1×10−16.(b) C functions (Table presented.) 9 (Figure presented.) Scatter plots of SLOC vs CC on a log-log scale. The solid and dashed lines are the linear regression before and after the log transform. The grayscale gradient of the hexagons is logarithmic 11 (Figure presented.) Residual plot of the linear regressions after the log transform, both axis are on a log scale. The grayscale gradient of the hexagons is logarithmic 12 (Figure presented.) Scatter plots of SLOC vs CC for Java and C files. The solid and dashed lines are the linear regression before and after the log transform. The grayscale gradient of the hexagons is logarithmic.