followed by a response from some of the original authors. The complete article will appear in the December issue of the journal Empirical Software Engineering [2].

In the meantime, I will try to summarise my arguments and link to an early version of my comments on arXiv.

It’s important to make considered decisions about the types of participant we use and appreciate the potential threats to validity.

We should avoid dichotomising participants as professionals/students since a student might be a part-time or former professional and in any case their experience may be more or less relevant.

It’s important to be explicit about the type of population being investigated.

We also need to consider how we sample tasks, artefacts and settings (as well as participants). Are these representative? Are there potential interactions?

Sometimes pragmatism wins the argument: using students may be the only option, but if that’s the case let’s be honest and say it’s a matter of expediency rather than claim it’s actually better than using professionals (if that is indeed the case).

In terms of advocacy, using professionals is more likely to be persuasive for getting new software engineering techniques adopted in practice.

So whilst I appreciate Falessi et al. [1] initiating a discussion around the choice of participant in our experiments, my fear is some researchers may attempt to use this paper as a blanket justification for taking the easier path of using students when they are not representative of the population of interest. Presently (18.10.2018), they have 14 citations. After excluding one duplicate and one in Portuguese, 8/12 argue that it’s ok to use students because Falessi et al. say so. A typical example is “Falessi et al. state that controlled experiments with students are as valid as experiments with experts”. Obviously there are occasions when this may be so, but to use this paper as blanket permission worries me. In fact it worries a good deal.

Updated:

This blog was updated (25.10.2018) to reflect the new title and authorship of [2] which has been changed at the request of the journal editors Robert Feldt and Tom Zimmerman to better reflect the content.

It was further updated (28.11.2018) to provide a link and full publication details for reference [2].

How effective is the process of replication for adding empirically-derived, software engineering knowledge?

In empirical software engineering, it is more or less a given that replication is the best way to test our confidence in an empirical result. Consequently the number of replication studies has been growing recently. For example, a mapping study found 16 replications in 2000-2 and a decade later 2010-12 a total of 63 replications, almost a four-fold growth [3].

Defining replication

However, first things first. What do we mean by replication? In a meticulous review of definitions Gómez et al. [5] found more than 70 different definitions which they classified as:

Group 1: essentially a faithful replication of the original experiment

Group 2: some variation from the original experiment e.g., measurement instruments, metrics, protocol, populations, experimental design or researchers

Group 3: shares the same constructs and hypotheses

It is not obvious to me how Group 2 and 3 differ, so it seems easier to refer to Group 1 as a reproduction (as is commonly the case in other scientific disciplines [7]) and treat Groups 2 and 3 as being replications on some continuum.

Q1: How similar must a replication result be to constitute confirmation?

Returning to our first question: how similar must a replication result be to the original experiment to constitute confirmation? Remarkably, we don’t seem to have been explicitly addressed this question of #replicability in software engineering. Perhaps the answer is so obvious it doesn’t need to be; but is it? Clearly we wouldn’t expect identical results, other than dealing with #reproducibility. So how much difference might be acceptable?

An obvious and common approach is to use p-values and null hypothesis significance testing (NHST). If the original study calculated p-value falls below a threshold (typically this is α = 0.05, possibly with correction for multiple tests) then the effect is deemed to be “statistically significant” so one would expect the replication to also be significant if it is confirmatory. Unfortunately this is mistaken, particularly if the original study is underpowered [1, 2, 4]. Worse, if the null hypothesis is true then p becomes a random variable following a uniform distribution which means all values of p are equally likely.

So, even if both studies are sampling the same population and the intervention is perfectly replicated, the measurement instrument identical and without error, sampling error alone can cause differences in results and these differences can be surprisingly large.

Simulation

In my paper I explore this through simulation.

Suppose we have two treatments X and Y and we want to compare them experimentally. Each experiment has 30 units, where a unit might be a participant, a data set, and so forth. This seems reasonable based on Jørgensen et al. [6] who found in their survey of software engineering experiments that 47% had a sample size of 25 or less. Let’s also suppose the experimental design is extremely simple and that the two samples are independent, as opposed to paired. We also assume the rather unlikely situation of no measurement errors and no publication bias.

We investigate two underlying population distributions: (i) normally distributed and (ii) a more realistic mixed-normal distributions (contamination level = 10.sd, mix probability = 0.2) that yields a heavy tailed but still symmetric distribution.

I simulate two conditions: (i) no effect i.e., μ(X) =μ(Y) = 0 and (ii) a small effect (μ(X) – μ(Y) = 0.2). Note that small effect sizes dominate our research [6] and this is exacerbated by the tendency of under-powered studies to over-estimate the true effect, not to mention selective reporting and flexible analysis practices.

Then I simulate the replication process by randomly drawing pairs of studies, without replacement, and observing the difference in results. So for my simulation of 10,000 experiments this gives 5,000 replications.

Results

The R code, additional figures and associated materials can be downloaded. However, in brief, there are three main findings:

for the no effect condition we observe a surprisingly wide range of possible effect sizes with only just over half the experiments finding negligible or no effect [-0.2, +0.2].

in the face of heavy tails, ~32% of replications agree in the correct direction, ~48% disagree and ~19% agree but in the opposite direction to the true effect.

Alternatively, if we ask the question what variability in results between the original and the replication studies might we expect solely due to sampling error then we can construct a prediction interval [9]. In other words, how different can two results be and still be explained by nothing more than random differences between the two samples? Essentially if we obtain the first n1 observations (from the original study) what might we expect from the next n2 (in the replication). To compute this we need both sample sizes, the statistic of interest from the original study (in our case the standardised mean difference aka Cohen’s d) and the variance or standard deviation. Unfortunately researchers are seldom in the habit of reporting this information which greatly reduces the value of published results.

The following table shows three examples of studies that have been replicated. They were chosen simply because I have access to the variance of the results. There are two points to make. First, the good news: each study is confirmed by the replication. Second, the bad news: the prediction intervals are all so broad it’s hard to conceive of a replication that wouldn’t ‘confirm’ the original study. Consequently we learn relatively little, particularly when the effect size is small and the variance high. Note that for R1 and R2, a result in either direction would constitute a confirmation! Even for R3 the prediction interval includes no effect.

Q2: How effective is the process of replication?

So for the second question, how effective is the process of replication for adding empirically-derived, software engineering knowledge? My answer is hardly at all, hence the title of the paper and my blog.

To finish on a positive note, my strong recommendation is to use meta-analysis to combine results to estimate the population effect. To that end below is a simple forest plot of R3 which by pooling the results narrows the confidence interval around the overall estimate.

Our goal was to examine how cognitive biases (specifically the anchoring bias) impact software engineers’ judgement and decision making and how it might be mitigated.

The anchor bias is powerful and has been widely documented [1]. It arises from being influenced by initial information, even when it’s totally misleading or irrelevant. This strong effect can be very problematic when making decisions or judgements. Even extreme anchors e.g., the length of a blue whale is 900m (unreasonably high anchor) or 0.2m (unreasonably low anchor), influence people’s judgements about the length of whales. Jørgensen has been active in demonstrating that software engineering professionals are not immune from this bias (see his new book on time predictions: free download).

Therefore we decided to experimentally investigate whether de-biasing interventions such as a 2-3 hour workshop can reduce, or even, eliminate the anchor bias. Given the many concerns that have been expressed about under-powered studies and being able to reliably identify small effects in the context of noisy data, we made four decisions.

Use professional software engineers (there is an ongoing debate about the value of student participants, e.g., the article by Falessi et al. [2] that is in favour which contrasts with the strong call for realism from Sjøberg et al. [3]. We side with realism).

In brief, we used a 2×2 experimental design with high and low anchors combined with a de-biasing workshop and control. Participants were randomly exposed to a high or low anchor and then asked to estimate their own productivity on the last software project they had completed (EstProd). Some had previously undertaken our de-biasing workshop while others received no intervention, i.e., the control group.

The interaction plot below shows a large effect between high and low anchor. It also shows that the workshop reduces the effect of the high anchor (the slope of the solid line is less steep) but has far less effect on the low anchor. However, it does not eliminate the bias.

We conclude that:

We show how professionals can be misled easily into making highly distorted judgements.

This matters because despite all our tools and automation, software engineering remains a profession that requires judgement and flair.

So try to avoid anchors.

But it is possible to reduce bias.

We believe there are many other opportunities for refining and improving de-biasing interventions.

Caveats are:

We only considered one type of bias.

Used a relatively simple de-biasing intervention based on a 2-3 hour workshop.

Some years back I, my sadly passed away colleague Prof. Qinbao Song, along with Dr Zhongbin Sun and Prof. Carolyn Mair wrote a paper [1] on the subtleties and perils of using poor quality data sets in the domain of software defect prediction. We focused on a collection of widely used data sets known as the NASA data set which were demonstrably problematic, e.g., implied relational integrity constraints are violated such as LOC TOTAL cannot be less than Commented LOC, since the former must subsume the latter. Worse, inconsistent versions of the data sets are in circulation. These kind of problems have particular impact when research is data-driven.

In our paper we proposed two cleaning algorithms that generate data-sets D’ and D”. We recommend D”. Unfortunately although the data sets were hosted on wikispaces, this turned into a pay for service and now is closing down altogether. Hence the data sets have not been accessible for a period of time. Belatedly (sorry!) I’ve now hosted the cleaned data on figshare which should provide a more permanent solution.

Reflecting on the process I see three lessons.

It’s important to choose a stable and accessible home for shared data sets, otherwise the goal of reproducible research is hindered.

Data cleaning is often a good deal more subtle than we give credit for. Jean Petrić and colleagues have further refined our cleaning rules [2].

I’m still of the view that researchers need to pay more attention to the quality and provenance of their data sets (or at least this process needs to be explicit). Otherwise how can we have confidence in our results?

PS Presently, un-cleaned versions are also available from the PROMISE repository along with additional background information, however, I would caution researchers from using these versions due to their obvious errors and inconsistencies.

On Monday and Tuesday (6-7th Nov, 2017) I was privileged to attend a two day workshop at Facebook, London on Testing and Verification organised by Mark Harman and Peter O’Hearn, along with about 90 other people. Unusually for these kinds of events it was a very balanced mix of industry and academia.

The theme was testing and verification and some kind of rapprochement between the two groups. As it turned out this was hardly contentious as everyone seemed to agree that these two areas are mutually supportive.

As ever it seems a bit invidious to pick out the key lessons, but for me, I was most struck by:

Google has ~2 billion LOC. That is hard to appreciate!

Practitioners strongly dislike false positives (i.e. predicted errors that do not exist), it was suggested more than 10% is problematic.

As both an author and member of a number programme committees, I’ve been reflecting on the recent decision of various academic conferences including ICSE 2018, to opt for double blinding of the review process. Essentially this means both the identify of the reviewer and the author are hidden; in the case of triple binding, which has been mooted, even the identity of one’s fellow reviewers is hidden.

But there remain many potential revealing factors e.g. the use of British or US English, the use of a comma or period as a decimal point indicator, choice of word processor and the need to cite and build upon one’s own work. Why not quadruple or even quintuple blinding?!! Should the editor or programme chair be known? What about the dangers of well-established researchers declining to serve on PCs for second-tier conferences? Perhaps there should just be a pool of anonymous papers randomly assigned to anonymous reviewers that will be randomly allocated to conferences?

Personally, I’m strongly opposed to any blinding in the review process. And here’s why.

Instinctively I feel that openness and transparency lead to better outcomes than hiding behind anonymity. Be that as it may, let’s try to be a little more analytical. First, what kinds of bias are we trying to address? There seem to be five types of bias derived from:

– personal animosity
– characteristics of the author e.g. leading to misogynist or racist bias
– the alignment / proximity to the reviewer’s research beliefs and values
– citation(s) of the reviewer’s work
– the reviewer’s narcissism and the need for self-aggrandisement

It seems that only the first two biases could be addressed through blinding, and this of course assumes that the blinding is successful, which in small fields may be difficult. Although I would never seek to actively discover the identity of the authors of a blinded paper, in many cases I am pretty certain as to who the authors are. And it doesn’t matter.

In my opinion, double blinding is a distraction, but one with some negative side effects. The blinding process harms the paper. As an author I’m asked to withhold supplementary data and scripts because this might reveal who I am. Furthermore sections must be written in an extremely convoluted fashion so I don’t refer to my previous work, reference work under review or any of the other perfectly natural parts of positioning a new study. It promotes the idea that each piece of research is in some sense atomic.

Why not do the opposite and make reviewers accountable for their opinions by requiring them to disclose who they are. Journal editors or programme chairs are known so why not the reviewers too?

Open reviewing would reduce negative and destructive reviews. It might also help deal with the situation where the reviewer demands additional references be added all of which seem tangential to the paper but, coincidentally, are authored by the reviewer! The only danger I foresee, might be that reviews become more anodyne as reviewers do not wish to be publicly controversial. But then this supposes I, as a reviewer, wish to have a reputation as a bland yes-person. I would be unlikely to want this, so I’m unconvinced by this argument.

So whilst I accept that the motivation for double blind reviewing is good, and I also accept I seem to be in a minority (see the excellent investigation of attitudes in software engineering by Lutz Prechelt, Daniel Graziotin and Daniel Fernández) but I think it’s unfortunate.

Almost 20 years ago, Chris Schofield and I published a paper entitled “Estimating software project effort using analogies” [1] that described the idea of using case-based (or analogical) reasoning to predict software project development effort using outcomes from previous projects. We tested the ideas out on 9 different data sets and used stepwise regression as a benchmark and reported that in “all cases analogy outperforms” the benchmark.

Today (27.10.2017) is a landmark in that not only is our paper 20 years old, it has (according to Google Scholar) 1000 citations. So I thought it appropriate to take stock and offer some reflections.

Why has the paper been widely cited?

I think the citations are for four reasons. First, the paper proposed a relatively new approach to an important but tough problem. Actually, the idea wasn’t new but the application was. Second, we tried to be thorough in the experimental evaluation and provide a meaningful comparator. Third, the publication venue of the IEEE Transactions on Software Engineering is highly visible to the community. Finally, there is an element of luck in the citation ‘game’. Timing is all important, plus once a paper becomes well known, it garners citations just because other writers can recall the paper more easily than alternatives, that might be more recent or relevant but less well known.

What ideas have endured?

I see three aspects of our paper that I think remain important. First, we used meaningful benchmarks with which to compare our prediction approach. We chose stepwise regression because it’s well understood, simple and requires little effort. If analogy-based prediction cannot ‘beat’ regression then it’s not a competitive technique. I think having such benchmarks is important, otherwise showing an elaborate technique is better than a slightly less elaborate technique isn’t that practically useful. At it’s extreme Steve MacDonell and myself showed [2] that yet another study using regression-to-the-mean and analogy [3] was actually worse than guessing, however, I hadn’t realised because at the time I hadn’t used meaningful benchmarks.

Second, we used a cross validation procedure, specifically leave-one-out cross validation (LOOCV). Although cross-validation is a complex topic the underlying idea of trying to simulate predicting unseen cases (or projects in our study) is important.

Third, in terms of realism we also noted that data sets grow one project at a time so in that sense LOOCV is an unrealistic validation procedure. Unfortunately our data did not include start and end dates so we were unable to properly explore this question except through simple simulation.

What would I do differently if we were to rewrite the paper today?

There are three areas that I would definitely try to improve if I were to re-do this study. The first is — and it’s quite embarrassing — the fact that the results cannot be exactly reproduced. This is mainly because the analogy software was written in Visual Basic and ran on Windows NT. It also used some paid-for VBX components. We no longer have access to this environment and so cannot run exactly the same software. Likewise the exact settings for the stepwise regression modelling are now lost and I can only generate close, but not identical, results. A clear lesson is to properly archive scripts, raw and intermediate results. However this would still not address the problem of no longer being able to execute this early version of our Analogy tool (ANGEL).

Second, the evaluation was biased in that we optimised settings for the analogy-based predictions by exploring different values for k (the number of neighbours) but the regression modelling was taken straight out of the box.

Third and finally, we reported predictive performance in terms of problematic measures such as MMRE and pred(25). We did not consider effect size or the variability of the results. Subsequent development in this area has greatly improved research practice.