June 26, 2008

Repositories: What they are, and what we use them for

(Note: This is the second of an ongoing series of posts on repositories. The first post is here.)

The JISC Repositories Support Project defines a digital repository as “a mechanism for managing and storing digital content.” I find this a useful definition, both for what it says and what it doesn’t say. It notes that repositories, as such, focus on content and its management. It doesn’t say anything about the kind of digital content managed by the repository, or about the use this content is put to.

A repository’s focus is related to, but distinct from, the focus of a library or an application. Repositories focus on particular information content. Applications (like Zotero, FeedReader, or Google Docs) focus on particular information tasks, like tracking citations, getting news, or authoring documents. Libraries focus on the information needs of particular communities (which might be towns, schools, peer researchers, or Internet users with particular interests). Applications and libraries may use repositories to support their tasks or communities, and some may be primarily built around one specific repository (as most libraries in the pre-computer age were built around what was in their physical stacks). But they are not identical to their repositories, and it’s often useful to distinguish the functions of a library and the functions of the repositories that it uses.

At the same time, though, you can’t plan the development of a library without thinking about its repositories. Repositories really are essential infrastructure for libraries, but not simply as a place to “capture and preserve the intellectual output of university communities” (as a 2002 SPARC white paper put it), or, more pessimistically, as “a place where you dump stuff and then nothing happens to it” (as a 2005 JISC workshop annex put it). The Penn Libraries today rely on hundreds of digital repositories, mostly run by various publishers. We also manage a few important ones ourselves. Here are a few that we manage, or are considering managing:

A repository providing open access to the scholarly output of our researchers (what is often thought of as the traditional “institutional repository”). For this repository, we manage the content, and contract with an outside company to manage the servers and develop the software. While many faculty cooperate in populating this repository, and some faculty deposit their own work themselves, librarians do much of the work to populate it.

The repository used to store content in our main courseware management system. The server is managed by us, using proprietary software, and is populated by instructors from all over the university. It is largely torn down and built anew every semester (sometimes carrying over material from previous semester’s incarnations). While this isn’t a permanent repository, it has very strong and definite persistence requirements that we have to take pains to support. And if some of our users just think of this as a place to do their teaching, and the “repository” aspects just come along for the ride, that’s a feature, not a bug.

Repositories for various digital image collections and digitized special collections. Historically these collections have been a mishmash of systems developed ad-hoc, involving filesystems, metadata in a database, custom-built websites, backup procedures, and sometimes little else. We’re currently locally developing a digital library architecture that will unify discovery and usage of many of these collections, and we hope to similarly unify repository management for many of these collections as well. Traditionally, the content is selected by bibliographers and the repositories and collection sites created by techies; we hope that the new architecture will let the bibliographers do more repository management and site design, and let the techies do less site-by-site management and more unified service management.

We have also tested repositories for managing numeric data, which are increasingly important shared research resources in many fields. We do not currently have a repository in production for this, but the repositories developed by projects like this one have important features for data-centric research that are not supported to the same extent by “traditional” repository systems.

As you can see from these examples, libraries like ours have all kinds of different uses for repositories, and various ways we can develop and manage them. We’re not starting repositories because they’re what all the cool Research I libraries are doing this year. We’re managing them because they help us provide what we see as important services to our communities. We recognize that different repositories have different uses, and that it often makes more sense to integrate multiple repositories into a single library than to build One Repository to Rule Them All. Once we have a clear understanding of why we would benefit from a particular repository, and what it would manage, we can consider various options for who would run it, where, and how. (And of course, what its costs would be, and how we can realistically expect those costs to be covered. But that’s a topic for another post.)