From Aristotle to Ringelmann: a large-scale analysis of team productivity and coordination in Open Source Software projects

Abstract

Complex software development projects rely on the contribution of teams of developers, who are required to collaborate and coordinate their efforts. The productivity of such development teams, i.e., how their size is related to the produced output, is an important consideration for project and schedule management as well as for cost estimation. The majority of studies in empirical software engineering suggest that - due to coordination overhead - teams of collaborating developers become less productive as they grow in size. This phenomenon is commonly paraphrased as Brooks’ law of software project management, which states that “adding manpower to a software project makes it later”. Outside software engineering, the non-additive scaling of productivity in teams is often referred to as the Ringelmann effect, which is studied extensively in social psychology and organizational theory. Conversely, a recent study suggested that in Open Source Software (OSS) projects, the productivity of developers increases as the team grows in size. Attributing it to collective synergetic effects, this surprising finding was linked to the Aristotelian quote that “the whole is more than the sum of its parts”. Using a data set of 58 OSS projects with more than 580,000 commits contributed by more than 30,000 developers, in this article we provide a large-scale analysis of the relation between size and productivity of software development teams. Our findings confirm the negative relation between team size and productivity previously suggested by empirical software engineering research, thus providing quantitative evidence for the presence of a strong Ringelmann effect. Using fine-grained data on the association between developers and source code files, we investigate possible explanations for the observed relations between team size and productivity. In particular, we take a network perspective on developer-code associations in software development teams and show that the magnitude of the decrease in productivity is likely to be related to the growth dynamics of co-editing networks which can be interpreted as a first-order approximation of coordination requirements.

Notes

Acknowledgments

Ingo Scholtes and Frank Schweitzer acknowledge support from the Swiss National Science Foundation (SNF), grant number CR31I1_140644/1.

Appendix A:

A1 Data Set

Table 3 summarises the 58 projects in our data set. For each project we show the project name and programming language, the time span of the data retrieved, as indicated by the times of the first and last retrieved commits, the total number of commits and the total number of unique developers during the analysed time span.

Table 3

Summary of the 58 OSS projects in our data set

Project

Language

From

To

Commits

Committers

antirez/redis

C

2009-03-22 09:30:00

2014-10-29 11:48:22

4361

173

mono/mono

C#

2001-06-08 18:45:34

2014-11-21 14:20:40

96688

738

xbmc/xbmc

C++

2009-09-23 01:49:50

2014-10-30 13:39:03

25810

543

TrinityCore/TrinityCore

C++

2008-10-02 21:23:55

2014-10-30 06:38:55

22275

487

cocos2d/cocos2d-x

C++

2010-07-06 02:19:51

2014-10-30 04:45:22

16400

448

Itseez/opencv

C++

2010-05-11 17:44:00

2014-10-28 13:06:36

11972

383

bitcoin/bitcoin

C++

2009-08-30 03:46:39

2014-10-30 06:31:34

4624

314

dogecoin/dogecoin

C++

2009-08-30 03:46:39

2014-08-24 14:57:20

4036

269

litecoin-project/litecoin

C++

2009-08-30 03:46:39

2014-09-16 09:50:58

2935

190

twbs/bootstrap

CSS

2011-04-27 20:53:51

2014-10-30 16:13:39

7461

670

zurb/foundation

CSS

2011-10-13 23:09:47

2014-10-28 23:54:09

5525

676

docker/docker

Go

2013-01-19 00:13:39

2014-10-29 23:43:18

7216

752

elasticsearch/elasticsearch

Java

2010-02-08 13:30:06

2014-10-30 11:32:53

9838

378

libgdx/libgdx

Java

2010-03-06 16:05:53

2014-10-30 12:56:30

7955

332

github/android

Java

2011-10-12 22:36:58

2014-08-08 11:09:08

2305

67

jquery/jquery-mobile

JavaScript

2010-09-10 22:23:13

2014-10-29 23:13:23

10847

285

meteor/meteor

JavaScript

2011-11-18 02:35:20

2014-07-25 20:57:47

8162

146

adobe/brackets

JavaScript

2011-12-07 21:20:16

2014-10-30 15:35:15

8068

232

mrdoob/three/js

JavaScript

2010-04-24 03:01:19

2014-10-28 22:33:52

7557

440

joyent/node

JavaScript

2009-02-16 00:02:00

2014-10-02 16:00:40

6918

510

jquery/jquery-ui

JavaScript

2008-05-22 15:38:37

2014-10-25 16:18:17

6146

285

angular/angular/js

JavaScript

2010-01-06 00:36:58

2014-10-30 15:04:41

5930

1147

jquery/jquery

JavaScript

2006-03-22 03:33:07

2014-10-30 13:16:32

5376

249

emberjs/ember/js

JavaScript

2011-04-30 22:39:07

2014-10-30 12:54:26

5078

504

ajaxorg/ace

JavaScript

2010-04-02 13:39:45

2014-10-30 00:09:09

4695

241

mozilla/pdf/js

JavaScript

2011-04-26 06:33:36

2014-10-28 18:56:55

4630

201

strongloop/express

JavaScript

2009-06-26 18:56:18

2014-10-29 05:15:58

4459

183

cocos2d/cocos2d-html5

JavaScript

2012-01-29 09:14:21

2014-12-17 03:33:02

3468

93

mbostock/d3

JavaScript

2010-09-27 17:23:59

2014-10-23 17:05:38

2640

86

tryghost/Ghost

JavaScript

2013-05-04 11:09:13

2014-12-15 17:09:24

2529

210

jashkenas/backbone

JavaScript

2010-09-30 19:48:05

2014-10-28 01:52:45

1938

265

tastejs/todomvc

JavaScript

2011-06-03 20:04:08

2014-10-30 00:58:23

1491

226

ivaynberg/select2

JavaScript

2012-03-04 18:58:26

2014-10-29 21:36:54

972

316

AFNetworking/

Objective-C

2011-05-31 21:27:34

2014-10-25 02:34:33

1579

253

AFNetworking

WordPress/WordPress

PHP

2003-04-01 06:17:43

2014-10-16 22:07:20

27101

55

zendframework/zf2

PHP

2009-04-28 10:23:49

2015-01-13 10:14:03

16810

890

symfony/symfony

PHP

2010-01-04 14:26:20

2014-10-27 18:25:03

12945

1236

cakephp/cakephp

PHP

2005-05-15 21:41:38

2014-10-30 01:43:18

12224

349

bcit-ci/CodeIgniter

PHP

2006-08-25 17:25:49

2014-10-29 11:18:24

5987

372

laravel/laravel

PHP

2011-06-09 04:45:08

2014-09-29 14:08:27

3155

264

yiisoft/yii

PHP

2008-09-28 12:03:53

2014-10-24 15:29:38

2415

220

django/django

Python

2005-07-13 01:25:57

2014-10-30 12:53:20

18266

798

ansible/ansible

Python

2012-02-05 17:48:52

2014-10-29 04:59:46

8825

1131

kennethreitz/requests

Python

2011-02-13 18:52:37

2014-10-26 13:45:12

2778

371

mitsuhiko/flask

Python

2010-04-06 11:12:57

2014-10-27 10:54:30

1629

270

rails/rails

Ruby

2004-11-24 01:04:44

2014-10-30 16:01:18

39509

3046

CocoaPods/Specs

Ruby

2011-09-08 19:46:11

2014-10-30 15:52:28

25412

4107

rapid7/metasploit-

Ruby

1970-01-01 00:02:02

2014-10-28 17:10:19

22354

387

framework

spree/spree

Ruby

2008-02-25 18:23:50

2015-01-13 09:12:45

14171

742

diaspora/diaspora

Ruby

2010-06-11 17:40:49

2014-10-23 05:12:26

9773

360

gitlabhq/gitlabhq

Ruby

2011-10-08 21:34:49

2014-10-30 12:44:29

8991

690

fog/fog

Ruby

2009-05-18 07:13:06

2014-10-30 12:08:47

8548

752

discourse/discourse

Ruby

2011-10-15 18:00:00

2014-10-30 16:11:33

7026

338

mitchellh/vagrant

Ruby

2010-01-21 08:35:06

2014-10-25 14:10:37

3210

322

activeadmin/activeadmin

Ruby

2010-04-15 13:23:16

2014-12-15 22:00:26

2754

433

plataformatec/devise

Ruby

2009-09-16 12:17:43

2014-10-29 14:59:33

2368

414

jekyll/jekyll

Ruby

2008-10-20 02:07:26

2013-06-08 17:47:34

1486

210

robbyrussell/oh-my-zsh

Shell

2009-08-28 18:14:03

2014-10-22 13:16:15

1732

796

A2 GitHub Data Set

For each project in our data set we queried the GitHub API with the query https://api.github.com/repos/%3Cowner%3E/%3Crepo%3E/commits%3Fpage=%3Cn%3E, where <owner> is a GitHub user account and <repo > is the name of a project repository belonging to this user. The query returns a paginated json list of the 30 most recent commits in the master branch of the project. By varying the parameter <n >, we control the pagination and can trace back the commit history until the very first commit.

Each element in the json list represents a commit with all Git-relevant information (see Section 3.1). More specifically, it contains the names and email addresses of both the author and the committer.11 The author is the person who authored the code in the commit and the committer is the one with write permissions in the repository who merged the commit in the project code base. These two identities may not be the same when pull requests are considered, as the developers requesting the pull typically do not have write access. Since we quantify contributions in terms of amount of code written, we take the author email from the commit data as a unique identifier for individual developers. In cases where the author email is empty, we conservatively skip the commit.

The commit SHA contained in the json list can be used to execute a commitspecific query in the GitHub API of the form:

The result is again a json list which provides detailed information about the list and diffs of all files changed in the commit. We retrieve this additional information and use it to (i) quantify the precise contribution to the source code at the level of individual characters and (ii) construct the time-varying coordination networks of developers who have co-edited files (see Section 5.1).

A2.1 Merged Pull Requests

Upon merging a pull request, typically through the GitHub interface, the commit tree of the project is modified by including a special merge commit. The basics of this process is illustrated in Fig. 13.

Simplified illustration of merging a pull request. A potential contributor forks the main branch of a project (light blue) into his/her own local repository (green). After some activity in both repositories a pull request is created and merged as indicated by the dashed arrow. This results in two commits - C5’ and C6 - in the main branch. The merge commit C6 (dark blue) has two parent links and should be excluded

In this example, a potential contributor forks the main branch after the second commit. Subsequent local changes are then made to the master branch and to the remote repository, represented by commits C4 and C5 respectively. After C5, the potential contributor creates a pull request asking for the changes in C5 to be incorporated in the main code base. Assuming the pull request is approved and no conflicts exist, C5 is merged by creating two commits - C5’ and C6. C5’ is almost identical to C5 in that it has the same author and committer fields as well as diffs.12 C6 is a special merge commit that contains the same diffs as C5 and C5, but differs on the author and committer information. The author and committer in C6 are those of the maintainer who approved and merged the pull request, and not those of the developer who originally wrote the code in C5 and C5’. Thereby including commit C6 in the analysis would wrongly attribute the contained diff to the maintainer and inflate his/her contribution in terms of code written.

We deal with this problem by noticing that merge commits always have at least two parent pointers - one to the replicated commit from the forked repository, and one to the last commit in the main branch. In some cases when changes are merged from more than one remote branches, the merge commit will have a parent pointer to each of these remotes. Since the parent pointers are also available in our data set, we exclude all commits that have two or more parent pointers.13

An additional complication is that Git also allows integrating changes by socalled rebasing. Different from pull requests, which generate a merge commit, in rebasing all changes are applied on top of the last commit of the branch being rebased into. The result is a single commit with only one parent link that is added at the end of the rebased branch and that incorporates these changes. Since we cannot distinguish the developer who rebased from those who authored the changes, we exclude such commits from our analysis. Even though the parent pointer rule cannot be applied here, most well-structured projects contain indicative commit messages that can be used to this end. We exclude all commits with commit messages that contain any of the keywords merge pull request, merge remote-tracking, and merge branch, regardless of punctuation.

We note that all summary statistics regarding the number of commits in this paper (e.g. Table 3) are calculated after applying the above two exclusion methods.

A3 Model Fits for Project-Wise Scaling of Productivity

For each project in our data set, we estimated the model in (2) relating the team size s to the mean team-member contribution 〈c′〉. For a small number of those projects, the team size s varies in a rather narrow range, thus questioning logarithmic transformations of both the parameter s and 〈c′〉 in the linear model of (2).14 We thus additionally use a model variation with a logarithmic transformation of 〈c′〉, while keeping s linear, i.e.:

We denote this model as Log-Lin, while referring to the original model in which we perform a logarithmic transformation of both parameters as Log-Log.

For each project, we fit both models and select the one which yields the largest coefficient of determination r2 as the appropriate model for this project. The resulting project-dependent scaling coefficients are summarized in Table ??.

Table 4

Estimation of two linear models for Fig. 6 with single-commit developers removed

Project

Log-Log

Log-Lin

α3

r2

\(\hat {\alpha }_{3}\)

\(\hat {r}^{2}\)

zendframework/zf2

−1.892 (−2.038)

0.466 (0.462)

−0.006 (−0.008)

0.416 (0.440)

≥ 10,000 commits

xbmc/xbmc

−1.566 (−1.701)

0.330 (0.274)

−0.007 (−0.010)

0.304 (0.255)

cakephp/cakephp

−1.392 (−1.430)

0.717 (0.680)

−0.018 (−0.024)

0.715 (0.681)

spree/spree

−1.326 (−1.351)

0.504 (0.466)

−0.006 (−0.011)

0.311 (0.296)

TrinityCore/TrinityCore

−1.136 (−1.151)

0.814 (0.791)

−0.011 (−0.014)

0.684 (0.641)

Itseez/opencv

−0.972 (−0.975)

0.825 (0.799)

−0.012 (−0.017)

0.783 (0.763)

rails/rails

−0.894 (−0.900)

0.764 (0.714)

−0.003 (−0.004)

0.675 (0.620)

django/django

−0.856 (−0.850)

0.475 (0.360)

−0.004 (−0.006)

0.356 (0.257)

cocos2d/cocos2d-x

−0.620 (−0.590)

0.257 (0.232)

−0.004 (−0.005)

0.132 (0.132)

WordPress/WordPress

−0.459 (−0.459)

0.042 (0.048)

−0.017 (−0.019)

0.025 (0.032)

jquery/jquery-mobile

−0.352 (0.260)

0.01 (0.004)

−0.001 (0.002)

−0.004 (−0.002)

CocoaPods/Specs

−0.116 (−0.085)

0.229 (0.117)

0 (0)

0.053 (0.006)

mono/mono

−2.276 (−1.122)

0.37 (0.170)

−0.013 (−0.016)

0.490 (0.284)

rapid7/metasploit-framework

−0.511 (−0.48)

0.263 (0.22)

−0.006 (−0.006)

0.288 (0.247)

symfony/symfony

−1.085 (−1.099)

0.543 (0.484)

−0.004 (−0.006)

0.57 (0.508)

mitsuhiko/flask

−3.785 (−3.827)

0.278 (0.181)

−0.051 (−0.093)

0.268 (0.148)

<10,000 commits

github/android

−2.254 (−2.861)

0.486 (0.364)

−0.062 (−0.154)

0.477 (0.353)

laravel/laravel

−2.107 (−2.290)

0.406 (0.374)

−0.012 (−0.028)

0.196 (0.234)

plataformatec/devise

−1.946 (−1.909)

0.429 (0.269)

−0.024 (−0.035)

0.344 (0.136)

jashkenas/backbone

−1.796 (−1.141)

0.142 (0.059)

−0.008 (−0.011)

0.023 (0.014)

AFNetworking/AFNetworking

−1.679 (−1.889)

0.364 (0.264)

−0.018 (−0.051)

0.264 (0.206)

ivaynberg/select2

−1.554 (−1.887)

0.476 (0.443)

−0.013 (−0.058)

0.323 (0.380)

discourse/discourse

−1.542 (−1.501)

0.327 (0.335)

−0.004 (−0.008)

0.16 (0.187)

yiisoft/yii

−1.510 (−1.638)

0.531 (0.527)

−0.017 (−0.027)

0.298 (0.316)

mozilla/pdf.js

−1.505 (−1.584)

0.476 (0.386)

−0.022 (−0.04)

0.428 (0.399)

activeadmin/activeadmin

−1.378 (−1.54)

0.637 (0.538)

−0.016 (−0.031)

0.633 (0.565)

jquery/jquery-ui

−1.340 (−1.87)

0.267 (0.139)

−0.014 (−0.037)

0.173 (0.096)

emberjs/ember.js

−1.315 (−1.399)

0.462 (0.440)

−0.004 (−0.006)

0.189 (0.187)

cocos2d/cocos2d-html5

−1.278 (−1.237)

0.407 (0.314)

−0.031 (−0.037)

0.338 (0.238)

tastejs/todomvc

−1.205 (−1.114)

0.129 (0.084)

−0.016 (−0.024)

0.098 (0.052)

mitchellh/vagrant

−1.158 (−1.279)

0.438 (0.355)

−0.009 (−0.027)

0.429 (0.418)

kennethreitz/requests

−1.125 (−1.212)

0.245 (0.237)

−0.009 (−0.019)

0.188 (0.199)

libgdx/libgdx

−1.109 (−1.121)

0.496 (0.451)

−0.011 (−0.016)

0.391 (0.356)

twbs/bootstrap

−1.107 (−1.108)

0.293 (0.233)

−0.004 (−0.009)

0.138 (0.126)

tryghost/Ghost

−1.106 (−1.136)

0.549 (0.386)

−0.008 (−0.014)

0.382 (0.253)

adobe/brackets

−1.092 (−1.095)

0.500 (0.469)

−0.010 (−0.012)

0.350 (0.330)

gitlabhq/gitlabhq

−1.081 (−1.085)

0.589 (0.488)

−0.004 (−0.009)

0.523 (0.436)

ansible/ansible

−1.062 (−1.079)

0.667 (0.598)

−0.002 (−0.003)

0.395 (0.356)

litecoin −project/litecoin

−1.009 (−0.972)

0.270 (0.202)

−0.016 (−0.020)

0.157 (0.091)

elasticsearch/elasticsearch

−0.990 (−0.984)

0.685 (0.625)

−0.005 (−0.013)

0.182 (0.203)

mrdoob/three.js

−0.971 (−1.009)

0.418 (0.373)

−0.008 (−0.013)

0.414 (0.362)

jekyll/jekyll

−0.946 (−0.451)

0.057 (0.009)

−0.001 (0.002)

−0.007 (−0.008)

meteor/meteor

−0.930 (−0.869)

0.447 (0.361)

−0.014 (−0.018)

0.340 (0.287)

zurb/foundation

−0.925 (−0.926)

0.145 (0.136)

−0.003 (−0.007)

0.113 (0.116)

joyent/node

−0.883 (−0.868)

0.422 (0.342)

−0.008 (−0.015)

0.315 (0.263)

dogecoin/dogecoin

−0.861 (−0.853)

0.294 (0.234)

−0.014 (−0.021)

0.251 (0.191)

angular/angular.js

−0.831 (−0.796)

0.485 (0.291)

−0.002 (−0.006)

0.352 (0.220)

docker/docker

−0.802 (−0.760)

0.617 (0.493)

−0.002 (−0.004)

0.423 (0.337)

robbyrussell/oh-my-zsh

−0.796 (−0.79)

0.099 (0.072)

−0.004 (−0.009)

0.081 (0.048)

fog/fog

−0.777 (−0.761)

0.575 (0.504)

−0.006 (−0.008)

0.489 (0.426)

jquery/jquery

−0.737 (−0.579)

0.109 (0.041)

−0.011 (−0.012)

0.080 (0.023)

bitcoin/bitcoin

−0.729 (−0.713)

0.277 (0.216)

−0.007 (−0.012)

0.131 (0.109)

bcit-ci/CodeIgniter

−0.694 (−0.625)

0.188 (0.128)

−0.006 (−0.006)

0.089 (0.048)

strongloop/express

−2.114 (−1.885)

0.352 (0.188)

−0.043 (−0.112)

0.363 (0.180)

mbostock/d3

−1.825 (−1.909)

0.409 (0.343)

−0.084 (−0.116)

0.417 (0.337)

antirez/redis

−1.542 (−1.619)

0.367 (0.244)

−0.031 (−0.064)

0.381 (0.245)

ajaxorg/ace

−1.249 (−1.169)

0.233 (0.215)

−0.027 (−0.037)

0.247 (0.195)

diaspora/diaspora

−0.221 (−0.126)

0.003 (−0.003)

0.003 (0.005)

0.040 (0.035)

MM-estimation was used to estimate the coefficients of the regressors. Bold values for α3 or \(\hat {\alpha _{3}}\) indicate (i) significance at p<0.01and (ii) that the corresponding model has a larger coefficient of determination r2. Only the first 15 projects have at least 10,000 commits. Of those, 11 had the Log-Log model as a significant and most reasonable fit, and were used as a selection pool for building coordination networks in Section 5.2. Brackets enclose the corresponding quantities after the commits of single-commit developers have been removed from the analysis

The result confirms that our finding of decreasing returns to scale at the aggregate level (Section 4.1) also holds for individual projects. Virtually all projects exhibit negative scaling of the mean team-member contribution with the team size, except for two projects for which no significant scaling coefficient could be determined. At any rate, the absence of significant positive coefficients for any of the projects allows us to conclude that there is no evidence for super-linear scaling in our data set.

A4 Effect of One-Time Contributors

In order to quantify the extent to which our results of team productivity may be influenced by contributors who committed to a project only once, we identified single-commit developers in all of the studied projects. Figure 14 shows the fraction of one-time contributors in all of the studied projects, validating the intuition that they comprise a sizable part of the development team.

Fraction of commits submitted by one-time contributors, i.e., developers who never contributed a second commit

In order to ensure that our results about the scaling of productivity are not qualitatively affected by the large fraction of single-commit developers, we have additionally filtered the commit logs of all projects, filtering out the commits of all developers who committed only once. By this study, we focus on the contributions of a core team of that particularly rules out single-commit developers. Using this filtered commit log, we then recomputed all model fits in the paper. In Tables 5 and 6, we report the scaling exponents. We observe no qualitative changes regarding our observation of decreasing returns to scale. We additionally reanalyzed all individual projects, again filtering out all contributions by single-commit developers. We report on the project-wise scaling exponents in the bracketed values in Table ??, again not observing any qualitative changes of our results for individual projects.

Table 5

Estimation of two linear models for Fig. 6 with single-commit developers removed

β0

α0

r2

β1

α1

r2

0.95 ±0.02

−0.20 ±0.01

0.10

4.21 ±0.04

−0.27 ±0.02

0.04

MM-estimation was used to estimate the coefficients of the regressors. The coefficients are presented together with their corresponding 95 % confidence intervals and are highly significant at p<0.001. The sample size for both models is 13776

Table 6

Estimation of two linear models for Fig. 7 with single-commit developers removed

β2

α2

R2

β3

α3

R2

0.82 ±0.03

−0.64 ±0.02

0.32

4.07 ±0.05

−0.71 ±0.03

0.22

MM-estimation was used to estimate the coefficients of the regressors. The coefficients are presented together with their corresponding 95 % confidence intervals and are highly significant at p<0.001. The sample size for both models is 13776

A5 Inference Versus Prediction from Linear Models

In Section 4.1 we introduced two linear models15 as a means of quantifying the negative trends observed in Figs. 6 and 7. In particular we introduced

where 〈n〉 is the mean number of commits per active devel 〈n′〉 is the mean number of commits per team member, 〈c〉 is the mean commit contribution per active developer, 〈c′〉 is the mean contribution per team member and 𝜖0,1,2,3 denote the errors of the models.

We note that for these models to provide reliable predictions16 the following conditions must be met: (a) Var(𝜖0,1,2,3,4| logs) = σ2, for all s (homoskedasticity), (b) \(\epsilon _{0,1,2,3} \sim \mathcal {N}\)(0, σ2) (normality of the error distribution) and (c) E(𝜖0,1,2,3| logs)=0 (linear model is correct).

We test for homoskedasticity by running the Koenker studentised version of the Breusch-Pagan test (Koenker 1981). This test regresses the squared residuals on the predictor in (7) and uses the more widely applied Lagrange Multiplier (LM) statistics instead of the F-statistics. Although more sophisticated procedures, e.g. Whites test, would account for a non-linear relation between the residuals and the predictor, we find that the Breusch-Pagan test is sufficient to detect heteroskedasticity in our data. The consequence of violating the homoskedasticity assumption is that the estimated variance of the slopes α0,1,2,3 will be biased, hence the statistics used to test hypotheses will be invalid. Thus, to account for the presence of heteroskedasticity, we use robust methods to calculate heteroskedasticity-consistent standard errors. More specifically, we use an MMtype robust regression estimator, as described in and provided by the R package robustbase.

As for normality of the errors 𝜖0,1,2,3, a violation of this assumption would render exact t and F statistics incorrect. However, our use of a robust MM estimator addresses possible non-normality of residuals, as it is resistant to the influence of outliers.

The last assumption pertains to the overall feasibility of the linear model. A common way to assess it is to plot the residuals from estimating (7) versus the fitted values, commonly known as a Tukey-Anscombe plot. A strong trend in the plot is evidence that the relationship between the dependent and independent variable is not captured well by a linear model. As a result, predicting the dependent variable from the calculated slope is likely to be unreliable, especially if the relationship between the variables is highly non-linear.

In Fig. 15 we show the Tukey-Anscombe plots for the four regression models in (7). While we cannot readily observe a prominent trend, we, nevertheless, see two qualitatively different regimes. Specifically the residuals in the lower ranges are close to zero, while they are relatively symmetrically distributed beyond this range. Looking at the line fits in Figs. 6 and 7 we see that the reason for this discrepancy are the outliers in the region of large team sizes, which fall close to the fitted regression lines. Therefore the residuals corresponding to these outliers will be close to zero. Investigating these specific data points reveals that they belong exclusively to the Specs project.

Residuals versus fitted values for (7). The titles above each plot correspond to the respective scatterplots in Figs. 6 and 7

To actually quantify a possible trend in Fig. 15 we calculate the normalized mutual information (NMI) between the residuals and the fitted values.17 As expected the NMI is rather low - 0.04 (top-left), 0.02 (top-right), 0.04 (bottom-left) and 0.03 (bottom-right) - an indication that there is no pronounced systematic error in the linear model. However, even though the NMI values are low, we find that there are all statistically different from zero at p = 0.05.18

Therefore, despite the evidence against a systematic error in these linear models, assumption (c) cannot be technically satisfied. We, thus, conservatively avoid using the linear models for predictive means. Since the NMI values, however, are rather low, the regression models are sufficient for our purposes of simply quantifying the observed negative trends. We argue that the practical significance of such small effect sizes is negligible with respect to introducing strong systematic errors that could obscure a salient non-linear relationship. Effectively, we can only retroactively infer a significant negative relationship between team size and productivity, but cannot forecast team production given team size. We caution that such inference is also a subject to high variability, as indicated by the low r2 values (see Section 5.2), and is thus valid only on average.

Finally, an argument against the significance of the slopes in (7) is the relatively large sample size of N=13998. Known as the “p-value problem” (Lin et al. 2013), the issue pertains to applying small-sample statistical inference to large samples. Statistical inference is based on the notion that under a null hypothesis a parameter of interest equals a specific value, typically zero, which represents “no effect”. In our example, we are interested in estimating the slopes α0,1,2,3 with an associated null hypothesis that sets them to zero. It is precisely this representation of the “no effect” by a particular number that becomes problematic with large samples. In large samples the standard error of the estimated parameter becomes so small that even tiny differences between the estimate and the null hypothesis become statistically significant. Hence, unless the estimated parameter is equal to the null hypothesis with an infinite precision, there is always a danger that the statistical significance we find is due to random fluctuations in the data. One way to alleviate the issue is to consider the size of the effect (as we did above) and assess whether the practical significance of the effect is important for the context at hand, even if it is significant in the strict statistical sense.

Another way is to demonstrate that the size and significance of the effect cannot arise by a random fluctuation. To this end we again resort to a bootstrap approach. For each scatterplot in Figs. 6 and 7, we generate 10, 000 bootstrap samples by shuffling the data points. We then estimate the regression models on each bootstrap sample and record the corresponding slope estimate \(\hat {\alpha }_{0,1,2,3}\), regardless of its statistical significance.19 We find that the slopes of the 10, 000 bootstrapped regression models are restricted in the ranges [ −0.02, 0.02], [ −0.04, 0.04], [ −0.04, 0.03] and [ −0.04, 0.06] for \(\hat {\alpha }_{0}\), \(\hat {\alpha }_{1}\), \(\hat {\alpha }_{2}\) and \(\hat {\alpha }_{3}\), respectively. Comparing those ranges to the empirical slopes in Tables 1 and 2, we see that by eliminating the relationship between team size and productivity we cannot reproduce the strength of the negative trend found in the dataset. It is the information lost from the shuffling procedure that accounts for the statistical significance of α0,1,2,3. Hence, it is safe to conclude that our analysis does not suffer from spuriously significant results introduced by large samples.

A6 Calculating Overlapping Source Code Regions

Our method of identifying overlapping source code changes between co-edits of the same file is based on the information in the chunk header of a diff between two versions of a committed file. Such a file diff shows only those portions of the file that were actually modified by a commit. In git parlance these portions are known as chunks. Each of these chunks is prepended by one line of header information enclosed between @@…@@. The header indicates the lines which were modified by a given commit to this file. Therefore, from all chunk headers within a file diff we can obtain the line ranges affected by the commit and eventually calculate the overlapping source code regions between two different commits to the same file.

As a concrete example, assume a productivity time window of 7 days in which the file foo.txt was modified first by developer A and then by developer B in commits CA and CB, respectively. The diff of foo.txt in commit CA may contain the following chunk header:

$$\textsf{@@ -10,15 \qquad+10,12 @@} $$

The content of the header is split in two parts identified by “ −”and “+”: −10,15 and +10,12. The two pairs of numbers indicate the line ranges, outside which the two versions (before and after CA) of foo.txt are identical. More specifically, -10,15 means that starting from line 10, CA made changes to the following 15 lines, i.e., it affected the line range [10 - 25]. What the result of these changes was is given in the second part of the header. +10,12 indicates that starting from line 10 in the new state of the file, the following 12 lines are different compared to the [10 - 25] line range. Beyond these 12 lines, the old and the new state of foo.txt are identical, provided there are no more chunks in the file diff. Therefore, the line range [10 - 25] in the old state of foo.txt and the line range [10 - 22] in the new state after CA, are the only differences introduced by the commit. This could be caused for example by the removal of three lines from the line range [10 - 25], together with other modifications in the same range.

Since CA comes prior to CB in our example, we associate the second part of the chunk header, i.e., line range [10 - 25], to CA as it represents the state of foo.txt after the changes from CA were applied and before those from CB. Now assume that the diff of foo.txt in CB has only one chunk with the following header:

$$\textsf{@@ \(-\)10,30\qquad +10,40 @@} $$

In other words, lines [10 - 40] from the old state of foo.txt were modified by CB, and the changes are reflected in lines [10 - 50] in the new state of foo.txt after CB.20 Note that, lines [10 - 40] represent the state of foo.txt after CA, but before CB. Therefore to compute the overlapping source code ranges between CB and CA, we need to compare the line ranges [10 - 40] and [10 - 25] and calculate the overlap. In this case, the overlap is 15 lines, which is the weight we attribute to the coordination link from developer B to A in this particularly simple example. The procedure described above is applied to all pairs of commits by different developers which have edited a common file within a given productivity time window of 7 days. Processing the chunk information in the above way thus allows us to extract linebased, weighted and directed co-editing networks which capture the association between developers and source code regions.

Cataldo M, Herbsleb JD, Carley KM (2008) Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In: Proceedings of the second ACM-IEEE international symposium on empirical software engineering and measurement, ESEM ’08. doi:10.1145/1414004.1414008. ACM, New York, pp 2–11

Hindle A, German DM, Holt R (2008) What do large commits tell us?: a taxonomical study of large commits. In: Proceedings of the 2008 international working conference on mining software repositories, MSR ’08. doi:10.1145/1370750.1370773. ACM, New York, pp 99–108