The Zen of Comprehensive Archive Networks

It seems that there is a lot of interest in having similar archives
for other languages like CPAN [1]
is for Perl. I should know; over the years people from at least
Python, Ruby, and Java communities have approached me or other core
CPAN people to ask basically "How did we do it?". Very recently I've
seen even more interest from some people in the Perl community wanting
to actively reach out a helping hand to other communities. This
'missive' tries to describe my thinking and help people wanting to
build their own CANs. Since I hope this message will somehow end up
reaching the other language communities I will explicitly include URLs
that are (hopefully) obvious to Perl people. Note that I'm going to
describe what things worked for Perl, translate appropriately for
other languages.

I'll start negatively and end with hopefully more constructive notes,
however these will build on the denials.

In the following Mumble and mumble stand for any other language than
Perl or a combination of languages other than Perl.

First, the negative statements.

CPAN shall not 'piggyback' other languages.
(In other words, there shall not be a mumble/ top level directory.)

Rationale: CPAN is CPAN is CPAN. CPAN carries Perl.
This implies all kinds of different contracts, explicit
and implicit.

Some people in the Mumble community will take offense to CPAN
carrying Mumble.

Some people in the Perl community will take offense to CPAN
carrying Mumble.

Some CPAN mirrors will take offense to suddenly having to carry
also Mumble.

Some CPAN mirrors will become resource (bandwidth, disk) constrained
after having to suddenly carry also Mumble.

CPAN cannot 'piggyback' other languages.

The building blocks or 'plumbing' of CPAN (the basic directory
structure, the PAUSE) is a reasonably good match for Perl.
I'm not so certain that it is for all the other languages.

Now, on to the hopefully more constructive suggestions.

First and foremost-- I'm not against other language communities having
a CPAN. I would love to have such archives. I'm willing to help the
other language communities. I'm only against too straightforward
"let's just slap it on to the side of CPAN" solutions to the problem.
Other languages are not like Perl, they are different, to a smaller or
larger degree. Let's allow them their own degree of dignity and
careful thought.

Then on to the technical questions, also known as "How did you do it?"
Well, people always ask that from me and I go speechless... "Errrr,
ummm, I kind of pulled all this stuff together and organized it a bit,
and put it on a ftp server". After this a brooding silence always
falls... "And...?" ... "And what?" ... "That's it?" "That's it."

Components of CPAN

Well, that's not really it, of course. The above is how CPAN started.
How it grew is another story. First, Larry designed Perl to grow by
letting it have modules (in other words, namespaces). Then we had a
couple of wise men (like Tim Bunce) to have the vision of good module
naming guidelines. Finally, we had Andreas KÃ¶nig who single-handedly
wrote PAUSE [2], the module
submission machinery, where Perl module authors can register, submit,
and manage their submissions. This allowed for a rapid but still
controlled growth of modules.

Installing modules can be difficult, especially if that involves
having to glue in C and/or external libraries. Andreas and other
people wrote both a frontend and a backend for this: the frontend is
known as the CPAN
(shell) [3] and the backend is known as the MakeMaker
[4]. The shell (also known as CPAN.pm) takes care of downloading the
required components, and the backend creates the required Makefiles
(or equivalent build tool control files) and then invokes the
appropriate build tools.

Incidentally, naming the module installation shell identically
with the archive proved to be more than a little bit confusing:
people may talk of "CPAN being broken" and you will have no idea
whether they are talking of a bug in the shell, their favourite CPAN mirror
being down, or whether they are objecting to some design detail of CPAN
in general.

Another variant of confusion is that many people think
CPAN is "just" the PAUSE, in other words, just the modules submitted
by authors using the PAUSE interface. While not wrong (the
overwhelming majority of CPAN content does indeed come from PAUSE),
this is not exactly right, either. Firstly, CPAN does have other
sources than just PAUSE: there are a couple of small sites CPAN merges
into itself, and some files (like some rarer binary distributions of
Perl) are still fetched manually (since they change infrequently).
Secondly, there is the ports
page that lists binary distributions for Perl, some in CPAN, most
hyperlinked from elsewhere.

An essential feature for (half)automated installation tools
is easy extraction of module dependencies. Easy documentation
extraction allows for easy online documentation browsing, which
in turn makes it easier for people to decide whether they want
to use a module, and when they use it, to use it better.

Since the CPAN shell is starting to show signs of its age and because
it doesn't have a good programmable API, a new project called CPANPLUS [5]
has been started. It will hopefully be a drop-in replacement for the
old trusty CPAN shell, but also allow greater flexibility and
extensibility. Similarly, there is a replacement project for
MakeMaker, Module::Build
[6].

Note that CPAN.pm and MakeMaker come with every Perl distribution,
but it is possible to write alternative module installation
interfaces: ActiveState has their own interface called ppm
(Programmer's package manager, originally known as Perl package
manager) for their ActivePerl distribution.

Because of the growth of CPAN, it finally became too arduous
to know what was out there, and luckily Graham Barr's scratch to
this itch become large enough to be published as
search.cpan.org [7].
There are also alternative search engines for CPAN,
Randy Kobes' search [8]
and WAIT [9], but the search.cpan.org
seems to be the most popular.

Later backPAN [10]
was added by Andreas to hold all the old versions
of submissions deleted by their authors; this ties back into simple
basic things that the core server(s) must have, like good backups.

The cpan-testers is a mailing list
(started by Graham Barr and Chris Nandor) whose subscribers download
recent module uploads and try running the regression suites, and report
back the success or failure to a mailing lists which gets databased,
and of course back to the original author. This has proved to be
invaluable in making the modules more portable between operating
system platforms and different releases and configuration of
those platforms. Also important to notice is that having regression
test suites coming with the modules is essential--
how else can you know whether
the code works at all?

Mirrors

CPAN mirrors [13],
then? How did they come about? The original ones,
dozen or so, were easy: I just asked the maintainers of the original
ftp sites I had found the seeds of CPAN from whether they might be
interested in carrying this slightly bigger amalgamated Perl archive.
Well, they foolishly agreed... I have to remind people once again
that CPAN was conceived as a FTP archive. Not a website. And it
still is that way. search.cpan.org just gives a nice interface. I'm
sorry but I'm a dry CS engineer, not a graphic designer. Information,
not animation.

Oh, back to the CPAN mirrors. After the original ones, we grew slowly
for a while, by word of mouth in the Perl community. However, since
this was the time before the billions dollars worth fiber dug into the
ground, Internet connections were still a bit dodgy and spotty.
Therefore I started doing two things: scanning ftp logs for sites that
obviously were mirroring CPAN but were not registered mirrors, and
sites that were good representatives for their particular top level
domain, especially outside the big seven TLD. This way I could track
down where Perl was used and by asking those sites to participate to
push back the load from the master site. Later I also filled in
missing countries by going for sites like the sunsites, and other
vendor/public funded sites that had a good chance of having good
connectivity. Usually I could find a sympathetic soul, oftentimes a
system administrator.

The status of the CPAN mirrors is monitored four times a day, from
two different machines in two different continents. A stale mirror is
almost worthless, sometimes even dangerous. Note also that as the
number of mirrors grows, don't expect to be able to check all of them
at each scan: there are always some network or server problems that
will stop you from getting all the status information. Getting the
full status of all the files on all the mirrors is a fantasy unless
the mirrors themselves run integrity checks. CPAN relies on a very
simple trick: the CPAN master site updates a certain file once every
hour, embedding a (UTC) timestamp in that file. By downloading that
file from a mirror and extracting the timestamp we can trivially see
when did they last update.

Summary of the mirror tirade: I went for sites that liked and/or
used Perl. I have no way of knowing off-hand whether they would
like Mumble. The mirrors are donating their network and storage
capacity and some amount of their administrative time for the
Perl community. If we would like to extend that in any way
we would have to ask them, from all of them individually.

You can learn more about CPAN's history from the Perl timeline [14].
Things didn't happen overnight.

Naming

A quite important thing for both the authors and the users is that the
language must get the naming scheme of its modules right, or at least
reasonably close. Perl's/CPAN's is far from perfect, but at least it
was once designed, and it has been enhanced over the years as new
needs have appeared. A good naming scheme allows hierarchical
browsing, gives good hints for search engines (a good name is
effectively a string of uniquely identifying keywords), and
coordinates community efforts. Some sort of conflict resolution
mechanism in case of competing and identically named implementations
is important. Keeping all those guidelines well documented and all
these processes public is important.

One naming issue I think Perl 5 got wrong is that module namespaces
are first-come-first-served, two or more different authors cannot have
an identically named module. This may lead into unintentional or
intentional namespace squatting, and some overly heated exchange,
none of which is not good for the community.

When designing your author/module/whatever hierarchy think
scalability. We originally got it wrong in one spot by having all
authors as subdirectories in one single directory which quickly became
a bottleneck. (The solution to this was simply to 'hash' based on the
leading two characters of the user ids.) Think also several different
views to your data: by author, by module, by category, by date, by
keywords, and so forth. Don't think only hierarchical views will be
enough: you will need searching capabilities.

Licensing

Get your license policy clear from day one. No, day minus one.
In this day and age it is very important that every piece of software
gets clearly marked as to what license it carries. Build your module
packaging tools so that they suggest, maybe even demand that the
author picks a license. This way both the users of modules and
distributors of software wanting to include the module don't have
to keep guessing.

Very much related to the licensing is of course commercial use:
CPAN took the easy and clear policy of no commercial software
of any kind, not even share/guilt/donateware would be allowed.
We felt that any other policy would be open to nitpicking, or
maybe even legal challenges, and as a volunteer group
we do not have time or other resources for any of that.

Keep Things Safe

That the servers hosting the archive core services should
be paranoidically maintained and monitored for security goes
without saying, but I'm saying it anyway.

Should you have PGP/GPG keys and triply-written-in-blood signatures?
Maybe. Currently CPAN has only MD5 checksums-- but so far they have
been enough. Then again, given the recent rise in Trojan attacks
against various pieces of open/free software a greater level of trust
may be needed. There are ongoing projects that enable using PGP/GPG keys
for verifying the origin of the software; but as always with PKI
systems, bootstrapping the web of trust is hard, some say even not
worth the trouble. Where should you store the public keys? Obviously
not in the same place as the module distributions themselves. Which
public key servers would you trust? One lightweight way to
do without PKI would be simply to distribute the original checksums to
enough places so that an attacker couldn't feasibly modify all the
copies. But at some point you would be very probably trusting DNS,
anyway.

Keep Things Open

Code quality? Ratings/reviews? Moderation/metamoderation?
"Approved" SDKs? These all are hotly debated subjects and will not be
addressed here since the CPAN is and will stay an open and free forum,
where the authors decide what they upload. Any further selection
belongs to different fora. Besides, adding any rating or approval
processes creates bottlenecks, and bottlenecks are bad.

Be mindful of other platforms than Intel Linux and Windows.
There's no need to alienate people of rarer tastes. One day
they will help you.

Make your archive accessible via several means. Don't stop
at just HTTP: think FTP and rsync, too. On the other hand,
do get the basic protocols right first-- don't jump off the deep
end and try to create an all-singing all-dancing web service,
or whatever is currently fashionable. This ties back to being
platform agnostic: try to package your modules so that the
maximum number of people can install it.

CPAN Scriptorium

The scripts that maintain the CPAN are dreadfully simple. They are
just simple shell scripts that copy sites A, B, ..., Z to the CPAN
master site at ftp.funet.fi, launched from cron. Many of them use Ye
Olde Original
mirror [15],
some of them are just rsync [16].
No magic. I really don't have anything to give away, no magic bags
full of powerful CPAN spells. The most complex script in the CPAN
master site is the
script [17]
that probes the mirror sites for
uptodateness-- and even that is not rocket science, just multiplexing
ftp and http downloads and comparing timestamps.

Andreas has the webserver code for PAUSE
available online [18].
That code is slightly more complex than the core CPAN scripts,
or the scripts supporting the PAUSE; but even here, the code is there.
Again, no tricks up our sleeves.

Conclusions

There is no magic. All it takes is a few people that sit down and get
first something running, a rough cut. Then iteratively enhance it.
Don't try to create a master plan that will get everything right
in one fell swoop. The only one that will get swooped is you.

One way to summarize most of the above is the priceless KISS
principle-- Keep It Simple, Stupid. Avoid too complex setups.
Start simple.

Another important credo is: Avoid bottlenecks and
interdependencies. Decentralize. Create and encourage
alternatives. For example, the most popular search engine of CPAN
isn't actually part of CPAN proper: search.cpan.org just mirrors CPAN
and from the data builds the search indices and searching/browsing
interfaces. That's way there can be several seach engines of the same
CPAN. Similarly, currently we use CPAN.pm + MakeMaker to install
modules: but we are not committed to either, and the community is
working on replacements. Keep things loosely connected. This
allows for different people to work on their own enhancements without
disturbing the other parts.

Perhaps the most demanding thing is commitment: someone must
keep things running. A slowly decaying and dusty archive is almost
worse (and certainly more sad) than no archive at all.

While writing this article I got valuable feedback from many people:
from the CPAN core people, and from the
readers of
use.perl.org (a Perl news and
community website).
I have to especially mention Neil Kandalgaonkar, who shared
his war stories from the ActiveState trenches.

This article is free documentation; you can redistribute it
under the same terms as Perl itself. Quoting it or, linking to it,
translating it to other languages, or using the illustration(s)
is allowed as long as the URL of the original article
(http://www.cpan.org/misc/ZCAN.html) is included.