CPAN vs. RAA: costs

I tried to improve my estimate of RAA's cost by running the script shown
below against 11% of the archive (by project count); that subset would cost
around $20M (600000 lines of code), leaving the total cost of the RAA under
$191 million. I then compared it to a revision of the cost of CPAN
computed in 2004 which lowers the
original estimate
substantially.

The final figure is somewhat biased because I didn't pick the projects
randomly (so the remainder should be smaller on average), but it still
serves as an upper bound.

Comparison with CPAN

The cost of CPAN
was estimated to be under $677 million in 2004.
That analysis was faulty because it considered all of CPAN as a single
project with 15.5 million LOCs, which would inflate the numbers due to
the nonlinear effort estimate equation
.

The error introduced will be smaller than

where P is the number of projects and L the average project size.

Unfortunately, I couldn't find any size statistics for the CPAN, so I just
took 5000 as a very conservative estimate of CPAN's size in 2004 (knowing
that it's close to 10000 modules now) --- the smaller the number of projects, the
less important the bias introduced in the original analysis. Retaining
that number, the 2004 result was bloated by at most
, leaving CPAN's cost in 2004 between
$442M and $677, depending on the size distribution of CPAN's modules.

RAA's cost in 2006 $ is under $191M --- let's make it $100 million, assuming that
the 89% I didn't analyze is smaller on average than the 11% I did consider.
Inflation is well under the error margin for CPAN's cost, so there's no need
to convert it into 2006-dollars. So the final, quotable result is

CPAN would cost around 5 times more than RAA according to the COCOMO basic model.

The cost of the interpreters

Surprisingly, ruby (including its standard lib) costs more than the
corresponding perl distribution: it's $20M vs. $15M, due to Ruby's richer
standard library. By the way, since perl is hosted in CPAN and ruby isn't in
RAA, those $20M could be added to the $100M used in the above analysis...

Counting lines of code

The original CPAN estimate was done with
SLOCCount. I also used it for perl and
ruby (the interpreters plus stdlibs themselves), but wrote a small script for
the RAA subset: