David A. Wheeler's Blog

Sun, 08 Jul 2012

The world of the future belongs to the collaborators.
But how, exactly, can you have a successful project with collaborators?
Can we quantitatively analyze past projects to figure out what works,
instead of just using our best guesses?
The answer, thankfully, is yes.

I just finished reading the amazing
Internet Success: A Study of Open-Source Software Commons.
This landmark book by Charles M. Schweik and Robert C. English of the
Massachusetts Institute of Technology (MIT)
presents the results of five years of painstaking quantitative
research to answer this question:
“What factors lead some open source software (OSS) commons
(aka projects) to success, and others to abandonment?”

If you’re doing serious research in how
collaborative development projects succeed (or not), you
have to get this book.
If you’re running a project, you should
apply its results, and
frankly, you’d probably get quite a bit of insight about
collaboration from reading it.
The book focuses specifically on the development of OSS,
but as the authors note, many of its lessons probably apply elsewhere.
Here’s a quick review and summary.

Schweik and English examined over 100,000 projects on SourceForge, using
data from SourceForge and developer surveys.
Their approach to data collection and analysis is spelled out in detail
in the book; the key is that they took the time to deeply dive into it.
Many previous studies have focused on just a few projects, and they
summarize those; while those are useful, they don’t tell the whole story.
Schweik and English instead cover a broad array of projects, using
quantitative analysis instead of guesswork.

Fair warning: The book is quite technical.
People who are not used to statistical analysis will find some parts
quite mysterious, and they answer a lot of questions you might not even
have thought to ask.
Because this is serious scientific research, they carefully define terms,
walk through a variety of data, and present an avalanche of data.
The key, though, is that they managed
to find useful answers from the data, and their results
are actually quite understandable.

They spend a whole chapter (chapter 7) defining the terms
“success” and “abandonment”.
The definitions of these terms are key to the whole study, so
it makes sense that they spend time to define them.
Interestingly, they switched to the term “abandonment”
instead of the more common term “failure”;
they found that “many projects that had ceased collaborating
would not be seen as failed projects”, e.g., because
that project code had been absorbed into another project
or the developer had improved their development skills (where this
was their purpose).

They use a very simple project lifecycle model — projects begin in
initiation, and once the project has made its
first software release, it switches to growth.
They also categorized projects as success, abandonment, or indeterminate.
Combining these produces 6 categories of project:
success initiation (SI); abandonment initiation (AI);
success growth (SG); abandonment growth (AG);
indeterminant initiation (II); and indeterminant growth (IG).
Their operational definition of success initiation (SI) is oversimplified
but easy to understand: an SI project has at least one release.
Their operational definition for a success growth (SG) project
is very generous: at least 3 releases,
at least 6 months between releases, and has more than 10 downloads.
Chapter 7 gives details on these; I note these here because it’s
hard to follow most of the book without knowing these categories.
I could argue that these are really too generous a definition of success,
but even with those definitions, they had many projects which did not meet
these definitions, and it is important to understand why
(so that future projects would be more likely to succeed).

They had so much data that even supercomputers could not directly
process it. Given today’s computing capabilities,
that’s pretty amazing.

So, what did they learn? Quite a bit.
A few specific points are described in chapter 12.
For example,
they had presumed that OSS projects with limited competition would be
more successful, but the effect is actually mildly the other way;
“successful growth (SG) projects are more frequently found in
environments where there is more competition, not less”.
Unsurprisingly,
projects with financial backing are “much more likely to be successful
than those that are not” once they are in growth stage; although financing
had an effect, its effects were not as strong in initiation.

As with any research material, if you don’t have time for
the details, it’s a good idea to jump to the conclusions, which in
this book is chapter 13.
So what does it say?

One of the key results is that during initiation (before first
release), the following are the most important issues, in order of importance,
for success in an OSS project:

“Put in the hours. Work hard toward creating your first release.”
The details in chapter 11 tell the story:
If the leader put in more than 1.5 hours per week (on average),
the project was successful 73% of the time;
if the leader did not, the project was abandoned 65% of the time.
They are not saying that leaders should only put in 2 hours a week;
instead, the point is that the leader must consistently put in time
for the project to get to its first release.

“Practice leadership by administering your project well, and
thinking through and articulating your vision as well as goals for
the project. Demonstrate your leadership through hard work…”

“Establish a high-quality Web site to showcase and promote your project.”

“Create good documentation for your (potential) user and
developer community.”

“Advertise and market your project, and communicate your plans and goals
with the hope of getting help from others.”

“Realize that successful projects are found in both GPL-based and
non-GPL-compatible situations.”

“Consider, at the project’s outset, creating software that has the
potential to be useful to a substantial number of users.”
Remarkably, the minimum number of users is surprisingly small;
they estimate that successful growth stage projects typically
have at least 200 users.
In general, the more potential users, the better.

None of these are earth-shattering surprises, but now they
are confirmed by data instead of being merely guessed at.
In particular, some items that people have claimed are important,
such as keeping complexity low, were not really supported as important.
In fact, successful projects tended to have a little more complexity.
That is probably not because a project should strive for complexity.
Instead, I suspect both successful and abandoned projects often strive to
reduce complexity — so it not really something that distinguishes them
— and I suspect sometimes a project that focuses on user needs has
to have more complexity than one that does not, simply because user needs
can sometimes require some complexity.

Similarly, they had guidance for growth projects, in order of importance:

“Your goal should be to create a virtuous circle where others help to
improve the software, thereby attracting more users and other developers,
which in turn leads to more improvements in the software…”
Do this the same way it is done in initiation:
spending time, maintain goals and plans, communicate the plans,
and maintain a high-quality project web site.”
The user community should actively interacting with the development team.

“Advertize and market your project.”
In particular, successful growth projects are frequently projects that
have added at least one new developer in the growth stage.

Have some small tasks available for contributors with limited time.

Welcome competition.
The authors were surprised, but noted that
“competition seems to favor success”.
Personally, I do not find this surprising at all.
Competition often encourages others to do better; we have an entire
economic system based on that premise.

Consider accepting offers of financing or paid developers
(they can greatly increase success rates).
This one, in particular, should surprise no one — if you want
to increase success, pay someone to do it.

“Keep institutions (rules and project governance) as lean and
informal as possible, but do not be afraid to move toward more
formalization if it appears necessary.”

The also have some hints of how potential OSS users (consumers) can
choose OSS that is more likely to endure.
Successful OSS projects have characteristics like
more than 1000 downloads, users participating in bug tracker and email lists,
goals/plans listed, a development team that responds quickly to questions,
a good web site, good user documentation, and good developer documentation.
A larger development team is a good sign, too.

These are just some of the research highlights.
For details, well, get the book!

I want to give some extra kudos to the authors: They have made a vast
amount of their data avaiable so that analysis can be re-done, and so that
additional analysis can be done.
(They held back some survey data due to personally-identifying information
issues, which is reasonable enough).
Science depends on repeatability, yet much of today’s so-called
“science” does not publish its data or analysis software,
and thus cannot be repeated… and thus is not science.

The book is not perfect.
It’s big and rather technical in some spots, which will make it hard
reading for some.
An unfortunate blot is that, while they’re usually extremely precise,
there are serious ambiguities in their discussion on licensing.
In particular, they have fundamentally inconsistent definitions for
the term “GPL-compatible” and “GPL-incompatible” throughout the book,
making their license analysis results suspect.
On page 22, they define the term “GPL-incompatible” in an extremely
bizarre and non-standard way; they define “GPL-incompatible” as
software in which “firms can derive new works from OSS, but are not
obliged to license new derivatives under the GPL [and] are not
obligated to expose the code logic in [derivative products].”
In short, they seem to using the term “GPL-compatible” as a synonym for
what the rest of the world would call a “reciprocal” or “protective” license.
Similarly, they seem to be defining the term “GPL-incompatible” to mean a
“permissive” license.
I don’t like non-standard terminology, but as long as unusual terms
are defined clearly, I can deal with bizarre terminology.
Yet later, on page 157, they define “GPL-compatible” completely
differently, and give it its conventional meaning instead.
That is, they define “GPL-compatible” as software that can be combined
with the GPL (which includes not just the reciprocal GPL license,
but which also includes many permissive licenses like the MIT license).
My initial guess is that the page 22 text is just wrong, but
it’s hard to be sure.
There is another wrinkle, too, presuming that they meant the term
“GPL-compatible” in the usual sense
(and that page 22 is just wrong).
One of the more popular licenses, the
Apache License 2.0, has recently become GPL-compatible
(on release of the GPL version 3),
even though it wasn’t before.
It’s not clear from the book that this is reflected in their data
(at least I didn’t see it), if they actually used the term “GPL-compatible”
in its usual sense,
and there is enough Apache-licensed software that this would matter.
This may just be a poor explanation of terms, but until this is
cleared up, I would be cautious about its comments on licensing.
Hopefully they will clear this up, and in addition,
it would probably be very useful to re-run the licensing analysis
to examine (1) GPL-compatible vs. GPL-incompatible, and (2)
to examine the typical 3 license categories (permissive,
weakly protective/reciprocal, and strongly protective/reciprocal).

What’s especially intriguing is that success is very achievable.
While initiating your project you should keep at it and communicate
(articulate the vision and goals,
have a high-quality web site to showcase/promote
the project, create good docuemntation, and advertize).
Once it’s growing, work to attract more users and developers.