Wednesday, May 13, 2009

As the occasional thinker about open-source development practices, communities, and issues, I have been wondering for a while: What are the largest open-source projects? What projects have the most code, the most users, and the most issues to deal with, and how do they cope?

The Debian archive should provide some insights into the first one or two questions, as it contains a very large portion of all available and relevant open-source software and exposes them in a fairly standard form. In the old days one might have gotten out grep-dctrl to create some puzzling statistics, but nowadays this information is actually available in an SQL database: the Ultimate Debian Database (UDD). (And it's in PostgreSQL. And it comes with a postgresql_autodoc-generated schema documentation. Excellent.)

So here is a first question. Well, the zeroth question would have been, which source packages have the largest unpacked orig tarball, but that information doesn't seem to be available, either via UDD or via apt. So the first question anyway is, which source packages produce the largest installation size across all their binary packages:

This produces a few well-known packages, but also a number of obscure ones. If you look closer, many of them appear to be themed around scientific, numerical, visualization, Scheme, Lisp, that sort of thing. Hmm.

Here is another idea. Take a package's installation footprint and multiply it by its popularity contest installation count. So you get some kind of maintenance effort score, either because the package is large or because you have a lot of users or both.

I noticed linux-2.6 is suspiciously absent because of a low popcon score (?!?).

I don't want to dump the entire database into this blog post, but if you try this yourself you can look at about the first 200 to 300 places to find reasonably large and complex projects before it gets a bit more obscure. A few highlights:

This is obviously still biased in a lot of ways, but it does show the major projects.

The UDD is also an interesting use case that shows how you can deploy a PostgreSQL database as a semi-public service with direct access. A great tool, and a great tool to build other great tools on top of.