open source systems for libraries

roberto writes: "A full-text paper regarding some open source ILS (Integrated library Systems), already notified to this discussion list some months ago, has been just translated from the orginal Italian language into Spanish by Patricia Russo."

You may reproduce this article in any format and for any purpose, but only in its entirety, including this statement.

One of the world's coolest free software applications is Zope, the web application server cum object database from Digital Creations. The story
of Digital Creations, a consulting company for whom open sourcing the family jewels (Zope) and its development process marked the road to financial stability and continued growth, is well documented (see here
and here for more on the corporate history). And the success of Zope itself is easily found all over the web, including in messages on varied lists with strangely over-the-top enthusiasts saying things like "oh my goodness, I get it now, I'm never doing this with anything but Zope again!" including maybe one or two from yours truly (btw, if you're never seen or used Zope, check out Paul Browning's Zope - a Swiss Army Knife for the Web? for a great intro for librarians or university types).

But did you know that some of the key folks behind Zope admit, unabashedly, that they've "always wanted to be a librarian"? In fact, when I met Paul Everitt, CEO of Digital Creations, at LinuxWorld last month, that's exactly what he said. A quick tour of Zope, and especially many of its newest features, reveals that the folks behind Zope (Digital Creations staff and the many contributors to Zope's open source Fishbowl Process for development) think very deeply about many of the same information issues fundamental to librarianship... especially metadata, and the creation, management, and leveraging thereof.

Paul and Ken Manheimer, developer on many Zope projects at Digital Creations (and noted as "Mailman's Savior" in the
acknowledgements of everybody's favorite list manager, among other accomplishments and contributions), graciously agreed to be interviewed about recent developments in Zope for oss4lib.

[Note: Our interview took place over a few rounds of mail
exchanges/responses, so some reordering and very minor
rewording of a few of the questions has been done for flow.
None of the text of responses has been altered, and Paul
and Ken both reviewed the text before it was posted.
Any mistakes or misrepresentations this refactoring might have
caused are the author's fault.]

oss4lib:
You and Ken both confess to being "closet librarians." Tell us about that. (It's okay to come out of the closet on oss4lib. We understand, we've all been through it. :)

Paul Everitt:
Well, I'd put it like this. We both think about information and content in abstract, rich ways. We then obsess over valuable ways to leverage the relationships and organizations within content.

Ken and I have been fond of saying lately that metadata and the
relationships in the "space in between" content is itself content.

In Zope, everything is a dynamic object in the database, including metadata and indices. Since content is just marketspeak for object, we actually have a platform to do something about these ideas.

Ken Manheimer:
Yes! Computers enable profound increases in the scale of communications, over space and time. Without sufficient measures for organizing all this stuff being cast around, we inundate people as much - or more - than we help them get the answers they're seeking, when they need them. Establishing the "spaces between" the information - the context, the relationships - as explicit content, the system can take the context into
account, and we can develop strategies and mechanisms that help fit answers into context. We can help fit collections of answers into stories.

oss4lib:
Traditionally, libraries gather metadata and information about content relationships in reference sources (in holdings catalogs as well as acquired encyclopedias, directories, indexes, atlases, union catalogs, etc.), enshrine these in a reference section, and put reference librarians in the "space between" this section and library visitors. Your approach, very cognizant of this
traditional model, creates and organizes as much of this information as possible automatically, and allows users to fill in many of the gaps themselves if they choose to behave certain ways.

Manheimer:
I think that's the gist of it. The idea is to organize the process and interface so that whatever metadata can naturally fall in place does fall in place. (Note that this is different from doing complex inferencing, eg ai or iterative collaborative filtering. Using collaborative feedback will have a big place, I think, but I shy from elaborate inferencing, at least until we've milked the overt stuff that can be gathered in process...)

oss4lib:
In the recent Zope
Directions Roadmap, you're making another push to simplify how people approach Zope (e.g. moving away from DTML), and to target your audience further (e.g. restated focus on developers). There must be a difficult balance to maintain when making things better and easier runs the risk of introducing change for some of Zope's most loyal fans. This issue faces libraries every day, especially with longtime users unfamiliar with new interfaces. We usually struggle with such decisions and end up trying to please everybody. How do you make these decisions?

Everitt:
In the "fishbowl". That is, we make fairly formal proposals and
put them up for a review process. We've struck a nice balance between the power of global collaboration and the coherence of benevolent dictatorship.

People have found the Zope learning curve too steep. In retracing the cause of this problem, we found that Zope tried to be all things to all people. Not knowing your primary audience leads to usability prolems. Thus, order number one was to pick an audience for Zope, and then allow Zope extensions to go after other audiences.

This leads to the Zope Directions document you mentioned. Zope is for developers who create useful things for other audiences.

Note that our idea of developer, thanks to Python, differs dramatically
from other systems which use low-level systems programming languages like C++ or Java for extension.

oss4lib:
Clearly, you pay close attention to what your consulting customers want from you, and to what active members of Zope's open source development want from Zope. As each community grows, how do you manage to keep listening closely to both? In what ways do the ideas you hear overlap?

Everitt:
There's a value cycle in our strategy. We tell customers, "We have this Open Source platform with great value being added from developers and companies worldwide that you can tap into." We then have to execute on having a strong, attractive platform for developers to create interesting things like Squishdot, Metapublisher,
etc.

Then we turn it back around and explain back to the community how
customer engagements are driving things that are clearly important to the platform's viability, such as enterprise scale.

It's worked out very well, although there are times where the choice has to be made, and this almost always means the consulting customer wins. As we've learned from these situations, we've adapted our organizational model to better leverage the synergy. How we're now structuring ourselves is becoming as exciting as the software itself.

oss4lib:
The new Content Management Framework (CMF) should appeal to many libraries, especially those wanting to empower patrons to manage their own content and allow customized content views. One of the most interesting things about the CMF is its deep support for Dublin Core (DC), with every object supporting DC descriptions. What led you in this
direction?

I've been doing this information resource and discovery thing for a while, with Harvest in 1993 and CNIDR and the like. I had followed Dublin Core for a while, plus related initiatives such as IAFA.

However, Mozilla was the first time I had seen DC built into a
platform. Being tied to RDF
nearly made it out of reach for people. But the value of having every object or resource in Mozilla support a standard set of properties was apparent, even for a knucklehead like me. :^)

oss4lib:
A common frustration with Dublin Core is that it would be all
the more powerful in the aggregate if more applications and sites implemented it.

Everitt:
Alas indeed! But it's not hard to see why it hasn't taken off. It's hard to get authors to participate in metadata. And when there's nearly no payoff or visible benefit, the incentive is even lower.

RDF has suffered from this same chicken-and-egg problem. It's needed a killer app that simultaneously sparks both supply and demand.

oss4lib:
Seeing DC in the CMF gives us hope. :) A likely upside is that if more applications and sites use DC, everyone will clamor for more robust metadata. In what ways are you planning for that next level?

Everitt:
I believe Ken would agree that the next area of interest for us over the next six months is the "space in between" content.

Manheimer:
Yes! There's a lot of metadata that can be inferred
on the basis of process and content.

For instance, we can identify the "lineage" of a document according to the document from which it was created. We can harvest the actions of visitors, like site-bookmarking, commenting, and rating documents, to glean orientation info for subsequent visitors. We can infer key concepts from the content, eg, common names (in the wiki, WikiNames).

Overall, we can reduce the burden on the content author and editors to fill in metadata when it can be inferred from process and content.

Everitt:
To illustrate, I'll go back to one of the eureka moments that Ken and I had several months ago. We've been pretty big consumers and contributers to the Wiki movement, which on the surface is the unapologetic antithesis of librarianship. That is, Wiki really tries to say, "We'll lower the bar so far, you'll always jump over it."

At one point, though, I became concerned that we were building an
alternative CMS with our WikiNG efforts, so Ken and I sat down and tried to plan ways to converge Wiki and CMF. We listed the things we liked about Wiki, what were the real innovations, and discussed ways to converge these innovations into the CMF.

We found out that one of the most attractive areas of Wiki was the way it assembled relationships and meaning from a corpus of
slightly-structured information. For instance, Wikiwords (the automatic hyperlinks generated from CamelCase words) not only give a system for regular web hyperlinks, they also give a system for the reverse (what pages are pointed to by this page).

In fact, the Wikiword system is a self-generating glossary that distills out important concepts in a corpus. And in Zope, these Wikiwords could become objects themselves. That is, they could become content.

This applied equally to the "backlinks" idea (or lineage) that Ken added
to our Wiki software. [Manheimer: A small correction - lineage is
actually different than "backlinks", the latter are common to all wikis.
Read on.] If you edit Page A and put in a Wikiword that
leads to the creation of Page B, then you have a relationship: Page A ->
Page B. If you then edit Page B to create Page C: Page A -> Page B ->

Page C. The backlink information itself could become content, thanks to
the relationships.

Everitt:
Neither the Wikiword nor the lineage are part of the content. They
exist in between the content. But they are as powerful as the content,
and in fact, they can be treated with the some of the same services you
would apply to content in a CMS.

oss4lib:
At Yale we used Wiki very successfully for
documenting several project discussions, but we also experienced many of
the common problems with wiki (e.g. who wrote what, how do you track
changes, how do you preserve ideas, etc.). What are some other important
improvements we should look for from the WikiForNow effort, and what
else should we look for from WikiNG?

Everitt:
WikiForNow, thanks almost exclusively to Ken's perserverance,
illustrates how a smarter system can address the common problems.
Without, hopefully, throwing the baby out with the bathwater. Each of
the three things that you mentioned are in WikiForNow.

However, they get there in WikiForNow by tapping into infrastructure
that is shared amongst all content in Zope or in the CMF. For me,
WikiNG is more about devolution rather than evolution. That is, take
the zen of Wiki and the features of Wiki and make them pervasive beyond
Wiki. That means that all content gains Wiki Zen.

Manheimer:
Classic wiki shows many things well worth doing - eg, WikiWord vocabulary,
backlinks, recent changes, etc. It also manifests an outstanding *way* of
doing them - low impedence/low complication operation and authoring - that
we will only be able to achieve in the general realm if we use a smart,
discerning framework. From my viewpoint, having recently joined the CMF
effort, i think it is becoming just such a framework. I think we will be
able to generalize the classic wiki features, and our organizational
strategies/extensions, more globally and across a richer, comprehensive
range of content. I'm excited about it!

Everitt:
It remains to be seen whether this model can achieve its goal without
losing the simplicity that makes Wiki so pervasive. But let me describe
a thought scenario and see if it makes sense to you...

Email and news. Lots of content blows by, a continuous flow of wisdom
left almost completely untapped. Email isn't content.

However, let's say that some smart mailing list management software
(such as Mailman) did a little bit more than relay mail and archive a
copy on disk. Let's say it also shoved a copy of the message into a
content management system, which converted relevant RFC822 headers into
Dublin Core, indexed the contents, etc. Just for fun, let's call that
CMS, well, the CMF. :^)

So in real time people could do full-text and parametized searches. Big
deal.

However, let's say the CMF also applied some of the ideas above to
email. For instance, threading relationships in email could translate
to backlinks/lineage, from which you could make inferences.

But let's take it a *huge* step further. Let's say that a small
portion, perhaps 1%, of the people on the mailing list committed
themselves to being Knowledge Base Contributors. That is, before
sending their email, they observed a couple of small conventions:

a. Using RFC822-style headers at the top, as is done in CMF's structured
text, they add targetted cataloging data.

b. In the text of their message they use Wikiwords.

For instance, an email message in response to a bug report might look
like this:

Observe that a Wikiword was used in the headers as well as the body.
There was also an extra header (rating) that isn't part of Dublin Core.

So all we ask is that a very small percentage of people use this system,
and
the smart mailing list server will munch the headers at the top of the
email message before relaying them. Not a very high bar to jump over.

Thanks to the threading response relationships, nearly every email
message in the corpus will be within one or two relations from something
manually annotated.

You could then provide tools that treated the relations and the concepts
as content, allowing reparenting and cleaning up the vocabulary.

oss4lib:
Implicit in this is knowing that in a given community a small subset of
folks will self-select into a group of detail-obsessives working to help
the others find and manage context-relevant information. In libraries,
it's the catalogers; in the general public it's folks adding to IMDB or
moderating MusicBrainz; in the hacker community, it's folks writing
How-Tos, guiding free software projects, moderating slashdot, and so on.

Manheimer:
Truly, the power of our species is collaboration. That's why computer
communications are making so many fundamental waves - they enable
quantum leaps in collaboration scale, immediacy, and intricacy. We're
all only gradually learning to harness that potential. I think the
librarian sensibilities are key because they're about systematizing
the advancements so they scale...

oss4lib:
A wiki "feature" that made a few librarian colleagues cringe was that
its mutable, dynamic nature was the only possible state. It was
agreed by all that a great function would be to enable offloading a
static snapshot of a wiki as a set of properly hyperlinked html
pages. Is it possible now to preserve a Wiki this way?

Everitt:
Sure, if that's what people want. wget will easily snarf
down a copy of a site.

But that's only one solution to the problem. A better solution is a
better system, one in which access control is possible and access to
previous versions (history) is there.

Manheimer:
RegulatingYourPages, mentioned above, details this.

oss4lib:
It sounds like the future of Wiki overlaps in many
ways with your plans for robust metadata support.

Everitt:
As hinted above, we already have RFC822-style headers in Wiki. For
instance, if you edit a Wiki page from an FTP client like
Emacs/ange-ftp, you'll see the Wiki seatbelt inserted at the top of the
page.

More important, the CMF will converge with our Wiki efforts. A Wiki
page will have all the web-based and text-based authoring benefits and
metadata of a CMF document. And hopefully, a CMF document will have all
the sublime interconnecting that you see in Wiki.

oss4lib:
The no-longer-active connection with Fourthought and4Suite was very
promising in this area; while we can always build tools that use 4Suite's
4RDF in conjunction with Zope, a few of us were hoping for deeper
integration. What's the outlook for general support for RDF in Zope?

Everitt: Hmm, good question. We have an ongoing dialog with Rael
Dornfest from O'Reilly. I'd be interested as well in some of your
thoughts on the subject.

oss4lib:
Btw, we found some old postings from you at your old .mil
address, in the context of GILS. So it's clear that you've been thinking about
metadata on the web for a long time.

Everitt:
Wow, you librarians
have a long memory. :^) I'm embarassed to think
what I said back then. Well, you know, I was younger then, it was a
crazy time, I didn't know it was a felony...oh, wrong embarassment. :^)]

What's your assessment, in early 2001, of how much we've progressed
overall in the metadata area? What are the most important
priorities, and is there a holy grail?

Everitt:
I don't think we've made an inch of progress in the
mainstream, meaning outside of the library science displine of the
already converted.

Nearly everyone I meet (besides programmers) uses Word. They use Word
even when they shouldn't. But almost none of them know that
File->Properties exists.

Metadata, through <meta> tags, has been built into the Web since at
least HTML2. So what percentage of web pages have anything other than
the "generator" metatag spewed automatically and unknowingly by
FrontPage? Essentially zero.

I have hope for incremental breakthroughs like the CMF, which brings a
"Wow, that's the way _everything_ should be done" response to CMS
cynics. However, it's still trying to transform a mature, continuous
market.

I think this takes a discontinuity, a disruptive breakthrough. Lately
Rael has been
talking about doing P2P for syndication. P2P could be the kind of
transformative breakthrough for DC and RDF. Without a standard
vocabularly across verticals (music, etc.), P2P will be another thousand
islands, which dramatically lowers the utility. Unlike web pages, which
generally wants content to be broadcast and rendered, P2P wants to
content to be exchanged. This model demands interoperable content.

oss4lib:

What more can librarians do to contribute our experience
and insight to the broader software community regarding metadata issues?

Everitt:
Uhh, prevent knuckleheads like me from repeating historical mistakes.
It's doubtful that a disruptive technology for metadata will come out of
the ranks of librarians. However, if librarians keep an open mind and
don't fall prey to sacrificing the larger victory by clinging to a
narrow agenda, then they can spot a winner and help guide it to
adulthood.

Many thanks go to Paul and Ken for their willingness, patience,
and responsiveness during the interview process.

(c) April 2000 by Daniel Chudnov
You may reproduce this article in any format and for any purpose, but only in its entirety, including this statement.
Background: the attack of napster

Have you seen napster yet? If not, take a look. Napster is two things: one part distributed filesystem and one part global music copying tool. It works incredibly efficiently and is very easy for users.

A typical session with napster might go like this:

sit at a fairly new computer with a decent net connection

think of a song

install a napster client (easy to find and do)

connect and search for your song

download your song (probably a 3-5Mb .mp3 file)

install an mp3 player (easy to find and do)

play your song

That's it. Do not go to the record store. Do not respond to the monthly selection. Look for what you want to hear, click, download, listen. And everybody's doing it. So many people are using napster, in fact, that several college campus network administrators are cutting out all napster traffic because the traffic is flooding their internet pipes.

Why is napster so successful? Because it's simple. Behind the scenes, it works quite simply also. You have mp3 files on your machine, and your machine is on the net. When you connect (usually by just starting your client application), the napster server knows what files are on your machine (if you tell it where to look). And the napster server knows what files are on the machines of the other two or three thousand people logged in at the same time. There's a song you want to hear? Search the napster server... it knows who has it, and napster will send you a copy of some other bloke's copy of that song.

Upon connecting to napster (usually late at night on weekends; the university I'm at doesn't allow napster traffic during business hours, a reasonable restriction), there are normally about 2,500 users logged in, and over 600,000 songs (files) available. Probably 80% of these files are duplicated at least once, and 20% probably account for 80% of the traffic and so on, but I've found some fairly obscure, groovy stuff. The key thing is that if I can think of a song, I can probably find it, even though napster applies no organizational scheme to its catalog of connected songs.

What of it?

The questions any librarian would ask at this point are obvious: "what about copyright?" and "doesn't it need to be organized?" The simple answer for question one is that while the napster folks state that they seek "to comply with applicable laws and regulations governing copyright," napster is widely used for making copies of songs in blatant violation of copyright. I know... I've done it. This doesn't seem to stop thousands of folks from using it; evidently folks aren't losing sleep over it. Napster is being sued, but the service hasn't been slowed at all yet. To put it simply, we all know that it's wrong, but somehow this is still-too-new for many people to dismiss as morally corrupt so therefore plenty of users remain.

(Note (2000-4-14): today the news broke about Metallica suing napster and my employer. Maybe it's time to post that Kirk Hammett pick I got at that 1989-7-4 Pine Knob show on ebay. :)

As for organization, it doesn't seem to matter. Nobody's organized a bit of it. Catalogers should turn red when seeing how poor (read: absent) the indexing is. 100% brute force, not even stop word removal and no clear record editing. Applying a few simple techniques to this problem would make searching for songs more reliable but not really any easier, because people are already mostly finding what they want to find and that's adequate for most. If you don't believe this, ask your nearest university network administrator.

And napster isn't just about music. As is well stated in this cnet article (and in this week's Nature, 13 April 2000, "Music software to come to genome aid?" by Declan Butler), this is a groundbreaking model of information delivery. It changes things. It's a killer app if ever there was one. The napster model shows that it's simple to share movies, music, anything that can live in a file on a connected box. All you need is a simple protocol and some clients that can speak that protocol. And fast net connections and big, cheap hard drives aren't going away anytime soon.

Some might wonder how napster is different from what the web already provides. The difference might seem minor, but cuts backwards through everything librarians know about giving people access to information. The difference is that while anybody can put anything up on the web and share it with friends, few people can provide the necessary overhead. Even if you can run a web server, for instance, a certain amount of centralized description or searching is still necessary (directory sites, search engines, etc.) before anyone can find your files.

With napster, however, you only have to be connected and willing to share your files. Napster does the rest by keeping track of what you've got so others can find it. You don't need to do anything to let thousands of other people copy files from your machine via napster.

So put aside security and copyright and organization concerns for a moment and consider... does this remind you of anything? Hmmm... I've got a song/movie/file that I've enjoyed and other people might like it too. Maybe other people have things I would enjoy. I wouldn't mind letting other people have my song/movie/file if I could also use theirs in return. This kind of cooperation could work. But how can I be sure that such cooperation would continue? And how would we organize it all?

Ever hear of a lending library?

Paper shall set you free

Funny how napster doesn't care about dublin core or MARC. It doesn't need a circulation module. It doesn't even matter what kind of computer you have, as long as you have a working client and decent bandwidth. Think of the implications of applying this model in our libraries. With all the advances in standardization of e-print archives and such (see the Open Archives initiative), we already have high hopes about the future of online publishing. With that solved, maybe the napster model could help us deal with our favorite legacy format: bound journals.

Have you ever worked in or near a busy InterLibrary Loan office? Do you know that sometimes we libraries photocopy and use Ariel to send the same document ten times in a month for patrons in other libraries? It seems terribly wrong that we've got to do this work over and over when we could just keep a copy on a hard drive, but we know well that the legal precedent today prevents us from creating such centralized storage. Heck, often we can't even fill an ILL request out of our already-digital ejournal collections because we sign restrictive licenses. So we have to go back to the stacks and photocopy and scan it through again instead of clickclickclicking to a lovely pdf for which we've paid so heftily.

But looking at napster, there's a key thing to consider about how its file sharing model might apply to document delivery. In napster, there is no centralized storage. In napster clients (gnapster, at least) users see from search results that there are listings of other users' copies of a song you want. You click to download. In the background napster echoes to its status line "requesting La Vida Loca from user hongkongfooey" and sometimes hongkongfooey says no... which is okay, too, because you can probably ask somebody else for it. When somebody (more precisely, their napster client, depending on how they've set their preferences, as napster doesn't wait for human approval if the right options are already set) okays a file transfer, they are giving you a copy, not napster.

In walks docster

Imagine all the researchers you know, with a new bibliographic management tool that combined file storage with a napster-like communications protocol -- docster. Instead of just citations, docster also stores the files themselves and retains a connection between the citation metadata and each corresponding file. Somewhere in the ether is a docster server to which those researchers connect. They're reading one of their articles, and they find a new reference they want to pull up. What to do? Just query docster for it. Docster will figure out who else among those connected has a copy of that article, and if it's found, requests and saves a copy for our friendly researcher.

Of course, we cannot do this. Libraries depend too much on copyright to attack the system so directly. But what if we focused instead on altering the napster model enough to make it explicitly copyright-compliant? After all, many cases of one researcher giving another a copy of an article are a fair use of that article. Fair use provides us with this possibility and it's not a giant leap to argue that perhaps coordinated copying through such a centralized server could constitute fair use, especially if docster didn't compete with commercial interests.

Well, it's still a big leap, but think of the benefits. Say there's an article from 1973 that's suddenly all the rage. It doesn't exist online yet, so a patron request comes to you from some other library, and you've got the journal, so you fill the request. But forty-eight other researchers want that article too. If that first patron uses docster, any of those other folks also using docster can just grab the file from the first requestor. If others don't use docster, they can request a copy from their local libraries, who -- I hope -- do use docster. Nobody has to go scan that article again, and suddenly there is redundant digital storage (see also LOCKSS).

Let the librarians librarize

Still, though, we're not doing enough to enforce copyright. Currently, a research library filling tens of thousands of document requests for its patrons per year makes copyright payments to the CCC when their fills demand. This system keeps publishers happy but keeps librarians chasing our collective tails. And even though systems like EFTS have automated even the copyright payment transfers, we're still continuing our massively parallel redundant (wasteful) copying and scanning efforts.

But because we're so good at making sure we make payments, we could leverage that structure within docster. We could federate the docster servers at the institutional (or consortial) level. For the several hundred or thousand researchers in a given field with a departmental library at a big institution, their docster requests go through their library. The requests, that is, not the files. It might look like this:

Researcher A at institution X needs an article

his docster client is configured to use the X University Library docster server

X University Library docster server searches for his article at friendly other University docster servers

Y University Library knows that local Researcher B has it

Y University Library claims the request

Y University Library tells Researcher B's docster client to send a copy to Researcher A

In this transaction (which might take only a few seconds for queries and download time) both libraries know about the article being requested, but X Library can keep A's identity private. Likewise Y Library can keep B's identity private. Thus the transaction might consist of identification at the institutional level, ensuring the privacy of both parties. But if a copyright payment needs to be made, X Library can pass that through to EFTS for clearance and then charge Researcher A's grant number (assuming, of course, that Researcher A knowingly signed up for the service). Y Library didn't have to pull anything from the stacks, and Researcher B might have been cooking dinner through the whole thing. Neither library ever stored or transmitted a copy directly; rather they only determined who had a copy (Researcher B), and had a copy sent to the requestor (Researcher A).

And the paper publisher gets paid. Everybody's happy.

Concrete steps and benefits

The necessary infrastructure for making this work is mostly in place. There are variants of the napster protocol under development (see the cnet piece now if you didn't read it before ;), none of which would require significant modifications. Some sort of federation protocol would need to be established, but it wouldn't be any more complicated than the routing cell structure (target library priority lists) implemented for Docline. Docster client software would need to be integrated with bibliographic metadata, but that's an easy hack too.

To address security concerns, it might be necessary to carefully define the protocol so that it would not compromise any individual user's machine. Additionally some sort of basic certification authority might need to be used to verify the identities of source and target institutions. While these are not trivial tasks, there are well-known approaches to each.

Think of the time we would save, and the speed at which articles would move around. For any article that had ever been filled into the docster environment (and that still lives on a connected machine), there would be no more placing a request, verifying the cite, placing the order, claiming the order, pulling from stacks, copying, Arieling, Prosperoing, emailing, etc., not to mention all the logging and tracking we rekey when moving requests from one system to the next when we don't have an ILL automation system (and even if we do). Your happy researcher would just need to type in a search and -- hopefully -- download what she needs. If not, you or your neighborly peer library make the copy and send. Once. And nobody else has to again.

Indeed it is easy to imagine building some local search functions into docster clients to avoid even making a request whenever possible. A local fulltext holdings database might be queried first (through the help of something like jake), then a local OPAC for print holdings. If these steps fail, a request could be automatically broadcast, and any request that bounces due to bad data or lacking files could be corrected, rebroadcast, or sent through existing systems like OCLC, RLIN, or Docline, and then Ariel'd back and delivered through the local docster server. The next such request would hit the now docster-available file (if the first requestor keeps his machine online).

Why wouldn't researchers create workarounds and bypass the libraries altogether? Well, technically, they already can use napster-like services for this, and obviously there's nothing to stop them from doing so. But libraries would play several vital roles in this equation: first, our patrons trust us with private information because we've safeguarded their privacy carefully and reliably for years; second, libraries can provide many value-added services such as integrating docster searches with additional functions as described above; third, the clientele we serve (at least where I work, a medical library) includes researchers working on deadlines and clinical staff saving patients' lives. Many of these patrons are fortunate enough to not have to care about costs; they wouldn't think twice about paying a reasonable charge for immediate, accurate delivery of crucial information.

Beyond the fact that building EFTS payments into this model would help our accounting procedures become even more automated, consider the copyright question one more time. As docster grows, more and more articles would be fed into the system. Some of those articles will be old. Some will even be old enough to qualify as public domain. As each year passes, a slew of articles will pass into such exalted status. For these public domain articles, nobody has to do any accounting at all.

Put this together with today's increasingly online publishing, and a window begins to close -- the window between today's e-prints (which will increasingly follow the open archives specifications and be accordingly easy to access) and yesterday's older print archives (which will increasingly be public domain). In between is a growing pool of documents available through docster, instantly accessible within complete compliance of (read: payment for) copyright.

It certainly wouldn't take very long to construct and conduct a limited trial. If we approach docster from day one as a good faith effort to comply with copyright while creating efficiences, we might be challenged by publishers but we'll at least have a good case going for us that we're not taking any money away. And best of all, we'd certainly have our patrons on our side. In all likelihood, the amount of revenues publishers would receive would probably increase significantly. Any librarian will tell you that the minute access gets easier, more people want access.

Think it through. I'll say it again: if you still don't believe this can happen, ask your nearest network administrator about napster traffic.

Acknowledgements

I am very grateful to KB, AB, MW, RKM, RMS, MG, AO, and KP for their insight and feedback on early drafts, and in particular to JS for allowing me to test this idea out in public and on an unsuspecting crowd. -dc

note: a significantly edited version of this article appeared in the August 1, 1999 issue of Library Journal. You might prefer their edited version to this version. You are free to reproduce the text of this version for any purpose and in any format, provided that you reproduce it in its entirety (including this notice) and refer to the url from which it is available: http://oss4lib.org/readings/oss4lib-getting-started.php

Introduction

The biggest news in the software industry in recent months is open source. Every week in the technology news we can read about IBM or Oracle or Netscape or Corel announcing plans to release flagship products as open source or a version of these products that runs on an open source operating system such as Linux. In its defense against the Department of Justice, Microsoft has pointed to Linux and its growing market share as evidence that Microsoft cannot exert unfair monopoly power over the software industry. Dozens of new open source products along with regular news of upgrades, bug fixes, and innovative new features for these products are announced every day at web sites followed by thousands.

The vibe these related events and activities send out is one of fundamental change in the software industry, change that alters the rules of how to make software--and how to make money selling software. What is all the noise about, and what does it mean for libraries?

Open Source: What it is and Why it Works

If you've ever used the internet, you've used open source software. Many of the servers and applications running on machines throughout the wired world rely on software created using the open source process. Examples of such software are Apache, the most widely used web server in the world, and sendmail, "the backbone of the Internet's email server hardware." [TOR] Open source means several things:

Open source software is typically created and maintained by developers crossing institutional and national boundaries, collaborating by using internet-based communications and development tools;

Products are typically a certain kind of "free", often through a license that specifies that applications and source code (the programming instructions written to create the applications) are free to use, modify, and redistribute as long as all uses, modifications, and redistributions are similarly licensed; [GPL]

Successful applications tend to be developed more quickly and with better responsiveness to the needs of users who can readily use and evaluate open source applications because they are free;

Quality, not profit, drives open source developers who take personal pride in seeing their working solutions adopted;

Intellectual property rights to open source software belong to everyone who helps build it or simply uses it, not just the vendor or institution who created or sold the software.

More succinctly, from the definition at www.opensource.org:

"Open source promotes software reliability and quality by supporting independent peer review and rapid evolution of source code. To be certified as open source, the license of a program must guarantee the right to read, redistribute, modify, and use it freely." [OSS]

Software peer review is much like the peer review process in research. Peer review bestows a degree of validity upon the quality of research. Publications with a high "trust factor" contribute ideas in published works to the knowledge base of the entire communities they serve.

It is the same for software. As described in the seminal open source work, "The Cathedral and the Bazaar" by Eric Raymond, author of the popular email program fetchmail, the debugging process can move faster when more individuals have both access to code and an environment in which constructive criticism is roundly welcomed. [ER] This leads to extremely rapid improvements in software and a growing sense of community ownership of an open source application. The feeling of community ownership strengthens over time because each new participant in the evolution of a particular application-- as a programmer, tester, or user--adds their own sense of ownership to the growing community pool because they are truly owners of the software. This community effect seems similar to the network effect seen across the internet, whereby each additional internet user adds value to all the other users (simply because each new user means there are more people with whom everyone else might communicate). For open source products which grow to be viable alternatives to closed-source vendor offerings, this growing community ownership begins to exert pressure on the vendors to join in. [NYT]

This tendency shares a striking similarity to the economic value of libraries. A library gives any individual member of the community it serves access to a far richer range of materials than what that individual might gather alone. At an extremely low marginal cost to each citizen expensive reference works, new hardcover texts, old journals, historical documents and even meeting rooms might be available through a local library. The library building, its collections, and its staff are infrastructure. This infrastructure serves as a kind of community monopoly in a local market for the provision of information. Instead of reaping monopoly profits for financial gain, however, a library returns the benefits of its monopoly to individual users. The costs of maintaining this monopoly are borne by the very community which holds the monopoly. To the extent which this model works in a given community, a library is a natural yet amenable monopolistic force. If this sounds mistaken, consider whether your community has libraries which compete or cooperate.

Library Software Today

No software is perfect. Office suites and image editors are pretty good; missile defense systems are, for all we know, appropriately effective; search engines could use improvement but usually get the job done. While there is constant innovation in library software, for many of us online catalog systems mean a clunky old text interface that often is less effective than browsing stacks. Often, this is due to the obstacles we face in managing legacy systems; new systems might be vastly improved, but we are slow to upgrade when we consider the costs of migrating data, staff retraining, systems support, and on and on. Sometimes, new versions of systems we currently use are just not good enough to warrant making a switch.

This is not surprising. The library community is largely made up of not-for-profit, publicly funded agencies which hardly command a major voice in today's high tech information industry. As such, there is not an enormous market niche for software vendors to fill our small demand for systems. Indeed the 1997 estimated library systems revenue was only $470 million, with the largest vendor earning $60 million. [BBP] Because even the most successful vendors are very small relative to the Microsofts of this world (and because libraries cannot compete against industry salary levels), there are relatively few software developers available to build library applications, and therefore a relatively small community pool of software talent.

What are we left with? Some good systems, some bad. Few systems truly serve the access needs of all of our users, failing to meet a goal--access for everyone--that most public libraries strive to achieve at more fundamental levels of service. Because libraries are community resources, we tend to be quite liberal about intellectual and physical access issues, including support of freedom of speech and ADA-related physical plant modifications. At the same time, librarians are very conservative about collections and data (remember the difficult issues when you last weeded?). Is it not odd, then, that market forces lead us to be extremely conservative about online systems software? After all, online systems are no less about access to information than having an auto-open front door or an elevator in a library building.

We read of exciting technological innovations in library-related systems. Innovations in advanced user interfaces and metadata-enabled retrieval environments and other areas have the potential to make online access more and more seamless and easy to use. Our systems, though, are too old--or not standardized enough, or too familiar to change--to take advantage of these advances. And creative ideas from exciting research seems not to make headway in real systems.

Libraries, if they indeed hold the kind of community monopoly described above, might do well to enhance their services by leveraging community-owned information systems--which open source seems to promise.

Open Source and Libraries

How could open source improve library services? First, open source systems, when licensed in the typical "general license" manner, cost nothing (or next to nothing) to use--whether they have one or one thousand users. Although the costs of implementing and supporting the systems on which software runs might not change, imagine removing the purchase price of a new search interface (or ILL tool, or circulation module, etc.) from your budget for next year. Rather than spending thousands on systems, such funds might be reallocated for training, hiring, or support needs, areas where libraries tend toward chronic shortfalls.

Second, open source product support is not locked in to a single vendor. The community of developers for a particular open source product tends to be a powerful support structure for Linux and other products because of the pride in ownership described above. Also, anyone can go into business to provide support for software for which the very source code is freely available. Thus even if a library buys an open source system from one vendor, it might choose down the road to buy technical support from another company--or to arrange for technical support from a third-party at the time of purchase. On top of this flexibility, any library with technical staff capable of understanding source code might find that its own staff might provide better internal support because the staff could have a better understanding of how the systems work.

Third, the entire library community might share the responsibility of solving information systems accessibility issues. Few systems vendors make a profit by focusing their products on serving the needs of users who cannot operate in the windows/icons/menus/pointer world. If developers building systems for the vision impaired and other user groups requiring alternative access environments were to cooperate on creating a shared base of user interfaces, these shared solutions might be freely built into systems around the world far more rapidly and successfully than ever before.

A Three-Step Process

If you are still reading, you probably suspect something here might be a good idea. You might even want to help make ideas discussed above happen. Where to begin?

Understand the Phenomenon

Axiomatic business notions have shown weaknesses throughout the information age; the utility of the internet for knowledge sharing demanded rethinking of what constitutes an information product. If nothing else, it is important for the international community of librarians to understand the open source phenomenon as part of the technology-driven shift in our understanding of the nature of information. Because the ethos and style of the open source initiative is so akin to the traditions of librarianship we hold at the core of our professionalism, we should find within open source the appropriate points of entry for the similar service and resource-sharing objectives we choose to achieve every day.

The seminal works on open source are mostly technical, but they provide an envigorating view of the current state of software engineering. All are available on the internet, and they form a core of knowledge that might one day be fundamental to our discipline. "The Cathedral and the Bazaar," by Eric Raymond [ER], is widely cited as the pivotal tome describing the technical and social processes open source entails. "The Open-Source Revolution," by Tim O'Reilly [TOR], founder of O'Reilly and Associates, Inc., a highly respected publisher of pragmatic computer-related titles, gives a broader view of the social phenomenon, in particular relating open source software development to the scientific method. Finally, www.opensource.org is a central point of focus for the Open Source Initiative. It is led in part by Mr. Raymond and appeals to both the technical and non-technical sides of the community.

To foster communication regarding open source systems in libraries, we have created a web site, www.med.yale.edu/library/oss4lib, and a listserv, oss4lib@biomed.med.yale.edu. They are intended as forums for announcement, discussion, and sharing of broad information; look for instructions on how to join the list along with a list of current open source projects for libraries the oss4lib site.

Use Open Source Systems Where You Are

Armed with understanding, we can find opportunities to leverage existing open source systems in our own institutions. The Linux operating system [LINUX], Apache web server [APACHE], and MySQL database [MYSQL] form a powerful, free platform for building online systems. Consider the value of these and other open source systems when making design and purchase decisions at your institution; you might find tremendous savings and increased product performance at the same time.

Beyond merely using open source products, however, we must create them. Are you already working on any new applications at your institution? Perhaps you've put a year or two into a homegrown search interface, or an online reference services tool, or a data model and retrieval code for an image archive. Is there a good reason why you wouldn't want to share that work? For those of you who realize that someone else might benefit from what you've done--and that you might benefit from the ability to share in the work of others--consider thoroughly the implications of releasing your code under an open source license. [FH] If the benefits outweigh the negatives, get started sanitizing and documenting your code as well as you can, and set it free.

Another ideal opportunity at this stage is for library and information science researchers to open their projects up for the entire community to review and develop as appropriate. Grant-funded systems builders might find an afterlife for their work by releasing their source. Faculty might design courses around building a retrieval system or improving an existing open source tool. Indeed this model is already widely used by computer science professors--at Yale, for instance, undergraduate students might work on aspects of the Linux kernel in their Operating Systems course.

Grow the Phenomenon

As the library community moves in this direction, there will be many roles for individuals in our profession to fill. Most visible is application development; there is a major need for software engineering resources to be devoted to creating community-owned library systems. This does not in any way marginalize those of us who are not programmers or database administrators. In the open source community there exists a tremendous need for exactly the skills librarians have always used in making information resources truly useful. In particular, systems testing, evaluation, and feedback to open source designers is welcome and even sought after; documentation for open source systems is always needing improvement; instructional materials for open source products are often lacking. These are all areas in which librarians excel. For the more technically minded among us, www.freshmeat.net provides constant updates and announcements of general open source projects replete with contact information for those wishing to participate. For all of us, the oss4lib listserv and website will highlight additional library-specific opportunities as they come around.

Playing a role in the larger open source community will strengthen our ability as professionals and service providers to understand how best to shape our own systems. Additionally, it might make significant inroads in demonstrating how the ethics and practice of librarianship is more vital to the movement of information than ever before. As the software industry shifts to appropriately incorporate open source models, systems in other industries might even grow to utilize products the library community creates.

Conclusion

An argument I have already heard against these ideas is based on experience: "We tried building our own OPAC in the eighties--it was an impossible project and we gave it up after a few years because it just cost too much." In 1999, however, we know that the internet has changed the landscape. Because it is so very easy to share ideas and software and code using the internet, software developers have already found that the old way of doing things--particularly building monolithic homegrown systems in our own institutions--makes no sense anymore. As the open source vision and culture continue to mature, librarians would be remiss not to find our profession playing a major role in that culture. For all we have done so far, our online systems are not good enough yet. We can do better.