To link to the entire object, paste this link in email, IM or documentTo embed the entire object, paste this HTML in websiteTo link to this page, paste this link in email, IM or documentTo embed this page, paste this HTML in website

SLA_IR_Paper_Final_Corrected.doc/Tompson p. 1
Institutional Repositories: Beware the “Field of Dreams”1 Fallacy!
A Special Libraries Association Science & Technology Division
2006 Contributed Paper
By Sara R. Tompson, Deborah A. Holmes-Wong and Janis F. Brown2
Introduction
Institutional repositories (IRs) are all the rage right now, particularly with academic libraries. But are they living up to their potential? Not exactly. While the potential exists for IRs to “have an impact on the serials crisis” – one of the questions posed by the call for this contributed papers session – and on other areas of the publication and research cycle, there has generally been more impact in the literature than in the scholarly world thus far. This paper will examine in brief the rise in and use of institutional repositories with an emphasis on science and technology/engineering (scitech) arenas to set the context for a discussion of the work of the University of Southern California’s Institutional Repository Needs Assessment Task Force (IRNA), and close with a look toward the future.
Institutional Repositories
Definitions
Institutional repository (IR) is a phrase that implies both ownership and preservation. IRs in the context of this session, and generally in librarianship today, typically store documents in digital format. The Scholarly Publishing & Academic Resources Coalition (SPARC) has defined IRs as:
“Digital collections capturing and preserving the intellectual output of a single university or a multiple institution community of colleges and universities.” 3
IRs are typically thought of as serving two broad purposes regarding the intellectual output of scholarly institutions:
1. They can bring (back) some local control of scholarly work. This control is ceded to varying or lesser degrees to publishers when subscriptions move to online-only — authoritative versions of articles are not typically retained in accessible format locally, as they are with print journals.
2. IRs, by grouping much of a university’s intellectual output in one place, facilitate interdisciplinary research connections, statistic-keeping, marketing and other institutional efforts.
An IR populated with an academic institution’s faculty and student research has become a key service that member institutions of the Association of Research Libraries (ARL) want to provide to their constituencies. Clifford Lynch’s 2003 article on IRs for ARL both documented and furthered the trend.4 In that paper Lynch addressed the two IR purposes noted above in this way, with an emphasis on technology for the second purpose:
“The development of institutional repositories emerged as a new strategy that allows universities to apply serious, systematic leverage to accelerate changes taking place in scholarship and scholarly communication, both moving beyond their historic relatively passive role of supporting established publishers in modernizing scholarly publishing through the licensing of digital content, and also scaling up beyond ad-hoc alliances, partnerships, and support arrangements with a few select faculty pioneers exploring more transformative new uses of the digital medium.” 5SLA_IR_Paper_Final_Corrected.doc/Tompson p. 2
An ideal IR, according to Lynch, and we agree, is also a dynamic collection, not simply a storage place. Leveraging possibilities for IRs will be discussed briefly later.
The Scholarly Publishing and Academic Resources Coalition (SPARC), in a white paper that makes the case for IRs, describes the purposes as follows, noting that institutional repositories:
• “Provide a critical component in reforming the system of scholarly communication – a component that expands access to research, reasserts control over scholarship by the academy, increases competition and reduces the monopoly power of journals, and brings economic relief and heightened relevance to the institutions and libraries that support them.”
• “Have the potential to serve as tangible indicators of a university's quality and to demonstrate the scientific, societal, and economic relevance of its research activities, thus increasing the institution's visibility, status, and public value.” 6
One important point that could distinguish an IR from a simple preprint or postprint archive is that accompanying or related material – raw data, videos, class lectures, etc. could also be stored therein. Some scitech journals are doing this now on their websites, particularly in medical and cell biology fields where large genomic data arrays are being analyzed, but not all journals have the storage capacity for this. And, some journals only retain supplementary material for a limited time7. A key component of the scholarly research process — building upon what has been done before — can be hampered if the data is not available. As a plant biologist at Stanford recently noted:
“Journals are now facing the fact that results of microarray and proteomics experiments do not fit into publishable article pages, similar to the situation regarding publishing of sequence in articles some 15 years ago. Often, these results are archived on individual journals’ Web sites and are not well connected to community resources.”8
However, a broadly defined IR that includes all sorts of supporting material raises some legitimate concern in faculty researchers’ and university administrators’ minds regarding access — not everyone would want this information publicly accessible, and many definitions of IRs include an open access principal and structure. Already it is clear that if a library or institution builds a repository, not everyone will come, which can only partly be controlled by the type of IR built. Plus, Rhee, the biologist quoted above, notes, IRs are not (yet?) clearly integrated into the scholarly reward structure:
“… researchers are not accustomed to contributing their data and expertise to community databases. Here lies a conflict: Although there is a well-established reward structure for publishing in scientific journals and public repositories (largely through enforcement either as a condition for publication or for receiving grants), a similar reward system does not yet exist for contributions to community databases.”9
This and other inherent conflicts regarding IR purposes and uses was brought to light in the USC task force interviews of faculty, as will be discussed further below.
Developments in information technology have been the strongest driver in the rise of institutional repositories in two distinct ways: the rise of online journals, and the drop in cost of electronic storage.
Rise of the Web and Online Journals
The rise of stable, online journals has revolutionized collection development, and allowed libraries to provide “just in time” 24x7 online access to journals rather than storing thousands of linear feet of print journals “just in case” someone needs to use a volume. This has been an extremely important paradigm shift that is still underway — many libraries’ journal collections are still hybrids of print and online. SLA_IR_Paper_Final_Corrected.doc/Tompson p. 3
The benefits of online journals far outweigh the costs. But one cost, that can have potential significance, is that collection ownership is often no longer local with electronic journals.
Scitech libraries were some of the first to move to electronic journals, as our users’ disciplines were some of the first to use the Web. Indeed, high energy physics is credited with the invention of the Web. Tim Berners-Lee, at the particle physics laboratory CERN outside Geneva, Switzerland, wrote the first Web browser in the early 1990s.10 Physicists in particular, but also other scientists and engineers, were already reading and sharing articles via preprint servers (Alan Ginsparg set up the arXiv server at Los Alamos in 1991.11 The Web proved a much easier tool for this interchange than the more linear file transfer protocol (FTP), as it facilitated networking. The “E-Doc” Web platform was developed in the mid-1990s for scientific publishers, especially the American Physical Society, Nature, and Springer Verlag, to be able to utilize the Web as a publishing vehicle12. Electronic journals proliferated. By some counts, over 75% of scholarly journals are now online13.
Some scitech fields, notably physics, have allowed publication and citation in preprint archives to “count” as a measure of scholarly output. Many of these archives are de facto (bad science will get culled out) or in actuality (including editorial board review) peer-reviewed. In the high energy physics field, Stanford Linear Accelerator’s SPIRES bibliographic database, an early, freely accessible Web-based library catalog, has always indexed preprints and has always included a citation analysis tool.14
In the early days of the Web, many scientists expected a large percentage of scholarly journals to be openly accessible on the Web15, and/or to adopt the open approach of physicists. This has not come to pass. Journals from scientific member organizations like the American Physical Society and the Institute of Electrical and Electronic Engineers (IEEE) quickly went online, and while they were, and remain, fairly reasonably priced, they are not freely accessible. Commercial scientific publishers like Springer and Elsevier16 also mounted journals on the Web fairly early on. Many of these journal packages, particularly the commercial ones, are quite expensive, and thus only practically available to larger institutions.
Initially most online journals were provided as an adjunct to the print, with the print often being considered the authoritative version. Publishers handled subscriptions to this new format in various ways, sometimes charging extra for the online version, sometimes giving it away for free with the print subscription. Both models persist into this century.
As the online versions of scholarly journals have become more stable, more and more of them have become recognized as authoritative versions of record. For instance, the American Journal of Respiratory and Critical Care Medicine, which went online in 1997, designated the electronic version as the version of record in 2002.17 However, with the rise in prominence of the online version, some publishers have changed their subscription models, either charging more for the online version than the print18 (which seems particularly absurd, as it is less expensive to mount online versions than to print and mail out hard copies) and/or “penalizing” subscribers that drop the print versions. These newer, more punitive subscriptions models have been one of the drivers for both librarians and some faculty, the creators of the content of scholarly journals, to look at open access models and institutional repositories.
While journals were going online, more and more new journals continued to appear. The increasing quantity and cost of journals in the past two decades has often — including in the call for papers for this session — been termed “the serials crisis.” One aspect of the crisis has been physical storage space. As online journals became more stable, more and more libraries with large journal collections, principally academic, and especially scitech, libraries have cancelled print subscriptions in order to free up over-shelved ranges19. Print cancellation decisions are never made lightly, but are often necessary also to regain space for other services requested by users (e.g. information commons) and to free up funds to reallocate to the purchase of more online journals.
However, such decisions to go electronic-only have given up the local, preservation control of the content of the journals, content created in part by the libraries’ users. This need not be a cause for alarm, as some publishers have demonstrated themselves to be trustworthy, and likely to remain in existence for a long time to come. However, some publishers, and some titles, seem less stable, changing platforms, formats, and ownership. Plus, not all publishers provide archival access — access to online journals for the period of the subscription, even if one no longer subscribes. And not all subscribers can purchase the permanent access packages which often have higher price structures. At USC, for example, we have not yet been able to acquire the premium access to the Blackwell Synergy journals. The notion of local institutional archiving of either pre- or post-print scholarly articles begins to look more attractive and sensible given these developments.
Electronic Storage
During the last decade or so, while online journals have been on the rise, the cost of electronic storage has gone down. In 2000, the U.S. Congressional Budget Office analyzed this phenomenon as part of the paper “The Role of Computer Technology in the Growth of Productivity,”20 from which the following chart is extracted.
Figure 1
Note that the figure uses a logarithmic scale in the price axis. As the authors say: “The prices at the beginning of the period are so much higher than those at the end—by close to five orders of magnitude — that presenting the same information using an arithmetic scale would obscure most of the price changes after the mid-1980s.”21
To look at this data in another way, the author reviewing the development of computer storage for the IEEE magazine Computer in 2002 put the dramatic decrease in personal computing terms:
“If you were a personal computing enthusiast in 1984, you could buy an IBM PC for $5,000 to
$6,000…For this you would get a 0.004-GHz 8- bit processor with 0.064 Mbytes of RAM, a 12- inch monochrome text-only display, a 0.16-Mbyte floppy drive, and no possibility of a hard drive. Compare that with a PC purchase today: a 2- GHz 32-bit processor with 2,000 Mbytes of
SLA_IR_Paper_Final_Corrected.doc/Tompson p. 4
SLA_IR_Paper_Final_Corrected.doc/Tompson p. 5
RAM, a 128-Mbyte video graphics card, a 700-Mbyte CD-RW (and maybe a DVD-R), a 17-inch 1600 × 1200 display with 32 million colors—all for maybe $1,500. The old machine cost six times as much for 1/2000th of the processing power. The purchasing power gain over 18 years is 1,200,000 percent.”22
Interestingly, now that storage is so inexpensive, organizations are buying so much of it, they cannot effectively manage it. “Storage virtualization” is a new buzz word to describe newer systems (upon which IRs can be constructed) to manage these large physical arrays. The older computer storage systems, used for some IR architectures, very much depend upon the physical structure (internal addresses) to locate things and this is slow across many terabytes of data.
IR Platforms
The continuing, dramatic drop in costs for storing of electronic data is one of the factors that drove the development of software for institutional repositories. The Massachusetts Institute of Technology (MIT) and Hewlett-Packard Laboratories jointly developed DSpace, beginning in early 2000.23 DSpace was developed principally as an IR platform — “the plan was to create an infrastructure for storing the digitally born, intellectual output of the MIT community and to make it accessible over the long term”24 — and a number of universities have and continue to implement it for that purpose. DSpace is freely available as an open source software product. Electronic data in a variety of formats and types — text, video, .RTF, .PDF, etc. — can be stored in a DSpace repository. It is extremely customizable. There is now a DSpace Federation, and members are active in sharing customization solutions and other information as they build and maintain IRs.
Documentum25 is a commercial product varyingly billed as “digital assets management” or an “enterprise content management” solution. In fact, it was purchased by the content management company EMC in 2004. Documentum is somewhat less customizable than DSpace, but it includes a number of automated processes for both ingest and extraction of data. Concordia University in Montreal is one of Documentum’s profiled users; they are using the product to manage faculty dossiers26, a function sometimes included in broadly defined IRs. USC has been using Documentum since 200427 to manage our Digital Archive28.
Other platforms on which IRs can be developed include FEDORA29, ePrints30 (used by the arXiv preprint server, now at Cornell University), Proquest’s Digital Commons31 and others. Indeed there has been more of a rush to develop IR architectures than IRs themselves! In some ways the software is the easy part; getting and sustaining contributions is the hard part.
Many current IR architectures do some things very slowly when you get more than a few thousand items in the repository, due to the storage and search structure limitations touched on above. The fundamental structure of some IRs may need to be changed as they grow. DSpace users are working on scalability solutions, according to their Wiki.32 The Storage Resource Broker from the San Diego Super Computing Center uses a different approach which borrows its federated structure from grid-computing.33 Grid computing is, briefly, distributed computing, but the networks of computers and processors are more virtual and larger than in older parallel processing models. With a grid approach, SRB works well for searching large collections across multiple organizations and heterogeneous storage systems, illustrated by the fact it is used by the BaBar international high energy physics collaboration34. This approach may prove a better platform for IR development. IR Usage
For all the talk about, and establishment of, IRs and software architectures for such repositories, their usage has not been as high as many predicted. As noted in a 2004 report in the Chronicle of Higher Education, many universities, including DSpace developer MIT, are finding the populating of their IRs slow going, and now, being committed to the repositories, have had to increase their marketing efforts and/or library staff involvement in the IR ingest processes.35
The Registry of Open Access Repositories (ROAR) developed at the University of Southampton (UK) tracks the growth of institutional archives and makes this data available in their Wiki36. As of December 2005, there were about 250 functional institutional repositories containing about 800,000 items, which averages out to only about 3,200 per archive, even though growth has taken off since 2002. Figure 2 is a graph of archival growth, created using the ROAR eTrac tool.
Figure 2, generated March 25, 2006
A recent case study article by librarians at the Rochester Institute of Technology notes:
“The authors did not anticipate the amount of work involved in marketing the IR and persuading faculty to use it and to deposit materials in it.”37
A growing number of institutions are discovering the serious investments in time, and money and change management that an IR often brings. Fortunately much of this information is being published or publicly shared, so those considering repository construction can find useful precautions and guidelines.
USC’s IR Needs Assessment
USC did not join the early adapters of IRs. This caution has allowed us to examine the state of IRs, as well as of developments in scholarly communication, before simply building an IR. The Dean of USC’s Information Services Division convened an Institutional Repository Needs Assessment (IRNA) Task Force in the Spring of 2005. Coauthor Deborah Holmes-Wong from the library’s Information Development and Management Department chaired the group, which also included representatives from the library’s Interdisciplinary Teams (arts & humanities, social sciences and science & engineering – Coauthor Tompson), a representative from the Norris Medical Library on the Health Sciences campus –
SLA_IR_Paper_Final_Corrected.doc/Tompson p. 6
SLA_IR_Paper_Final_Corrected.doc/Tompson p. 7
coauthor Brown, the University Archivist, and the Director of New Development from the computing side of the Division.
The IRNA Task Force was inspired in part by Nancy Foster and Susan Gibbons’ D-Lib article on understanding faculty as a prerequisite to developing sustainable institutional repositories. As they note, “Without content [created by faculty], an IR is just a set of ‘empty shelves’.”38
Our IR needs assessment task force took a three-pronged approach to this task of understanding potential IR contributors and users. We:
• Looked at institutional repository needs assessment literature to see what others have done.
• Interviewed fourteen USC research faculty from the arts, humanities, social sciences, medicine, sciences and engineering on the services that they need to enable their research and publications, and
• Conducted four focus groups with research faculty to ascertain the services that were most important to offer in an institutional repository.
The interviews and focus groups provided us an overall view of faculty needs. However, the sample size for the interviews and focus groups was too small to determine how widespread an attitude about IRs expressed by a particular researcher may be on campus. The purpose of the interviews was to identify of possible repository services, and the purpose of the focus groups was to determine if there was strong support for these services. The Task Force’s final report was compiled in November 2005 and presented to the library faculty. Holmes-Wong and others are continuing to examine possibilities for next steps regarding an institutional repository at USC.
Interviews
In order to conduct the interviews, the Task Force first identified 90 potential faculty participants, culled from: USC Web sites, USC high performance computing center participants, USC faculty in ISI Web of Knowledge’s highly cited authors39 group, and task force participants’ recommendations.
IRNA task force members selected faculty from the list and scheduled interviews. We decided to schedule ten to fifteen interviews with faculty members from across the university covering the major disciplines. We interviewed twelve faculty members originally. However, given our initial criteria we found that we had mostly older established, tenured faculty. Thus we identified additional faculty members to try to address the gender, age, and ethnicity issues the initial criteria created.
The interviews were conducted by two to three people, utilizing a format developed and practiced at USC40. The coauthors of this paper were the science, engineering and medical faculty interview team. Interviews were scheduled for an hour with each faculty member, but interviewers planned on two hours, so they could stay if the faculty member wanted to talk with them at length.
The interview consisted of several open-ended questions:
• What are your research interests?
• How do you disseminate your research findings?
• How do you incorporate your research in the curriculum that you teach?
• What is the role of graduate students and post docs in your research process?
These conversations were always substantive, sometimes intense, and sometimes illuminating. We learned a great deal more about the mechanics of the faculty members’ research processes, such as SLA_IR_Paper_Final_Corrected.doc/Tompson p. 8
formatting and file transfer difficulties they and their students encountered in saving, posting and sharing preprints, articles and postprints. We also heard a great deal of praise and criticism for various library services on both campuses. Establishing a comfort level in which faculty members could ask for help in using library resources led to more tangible outcomes than did their thoughts on IRs, which were decidedly mixed! Following each of the scitech/medical faculty interviews, one or more of us contacted an interviewee or his/her students to provide instructions on accessing a variety of electronic resources they had not previously realized were available.
The interviewers met after each interview to discuss and summarize the responses, paying attention to the unexpected issues that surfaced. Interview summaries were formatted into a document structure that worked well on the Wiki we developed to track the project. The interviewers then sent the documents to the faculty members, asking for each one to “sign-off” on the summary to ensure that it represented their responses accurately.
Focus Groups
From the interviews, we compiled ten use cases for the focus groups to discuss and prioritize. These use cases captured all the likely scenarios discussed with faculty members during the open-ended interviews. Focus groups participants were given definitions of institutional repositories, session goals, and use cases. The were asked to prioritize the use cases into two piles, services that they thought were important and would like to see and services that they felt were unimportant.
The faculty members were asked to attend a specific session. Thirteen total attended the focus sessions, which were segmented as follows:
• Library faculty on the University Park Campus
• Health science faculty on the Health Science Campus
• Senior faculty on the University Park Campus
• New faculty on the University Park Campus
We believed it was important to query our colleagues, as well as both senior and junior faculty and faculty on each campus (which are ten miles apart).
Results
In the interviews and focus groups, we did not find strong faculty support for an institutional repository for faculty self-archiving of already published articles, the model articulated by most of our peer institutions.
The more broadly applicable results (we received some very specific comments such as one school’s need for a system to handle student dossiers) of the focus groups’ discussions of the use case studies follow. They capture the comments of the interviewees as well.
• Top priority: Use Case #9 -- Secure persistent storage
The focus groups validated what we had learned through interviews: that although there is little support for archiving already-published materials, researchers need secure long-term storage space for their research data and a way to provide stable, persistent links to their files. Faculty want to control who has access — and mostly don't want anyone else to have access; they want to know that the integrity of the data is maintained. Some faculty members need a repository where they can archive their work including endnotes, image files, data sets, SLA_IR_Paper_Final_Corrected.doc/Tompson p. 9
software and multimedia software while writing research papers. Creating and managing a repository of unpublished research data is a very different undertaking than creating and maintaining an open access archive.
• Use Case #1 Automated generation of curriculum vitae & Use Case #2 Faculty research locator
The Health Sciences faculty who work more collaboratively and whose journals are indexed in ISI’s Web of Knowledge favored a system that would generate curriculum vitae for them from that database and other resources and allow them to use that system to find other USC faculty with intersecting research interests. The focus groups containing faculty from the humanities and social sciences whose disciplines are less well-served by Web of Knowledge, and who also tend to work more independently than science and engineering researchers, did not express a need for this service. They believed that this was a service that was more attractive to administrators than faculty.
• Use Case #6 Collaborating on an article in the repository & Use Case #8 Document/data set versioning
While software that supports collaboration and versioning on the Web was seen as important in all focus groups, the focus groups were divided into those who currently had access to software that supported this and felt this was not a feature for the IR, and those who did not have access and felt it was very important for the institutional repository to have this feature.
• Use Case #10 Check for permission to post preprints and post prints
In faculty interviews, several faculty members identified the difficulty in checking for usage permissions as an obstacle keeping them from publishing their research on the Web. However, a service to check for permissions was not rated as important as the other features listed above.
• Use Case # 5 “Automated” ingest of citations and OpenURLs
The ongoing “automated” ingest of citations and OpenURL41 links was not brought up by faculty members in any of the focus groups42. Such a feature would be needed in order to have the automated curriculum vitae service that many ranked fairly high.”
• Use Case # 3 User wants to add bibliography of work to system
All focus groups identified the capability of adding lists of works to the system as something that was not needed in the initial implementation of an institutional repository. The faculty that mentioned this use case felt that they would be more likely to contribute single items to an IR.
Two groups of faculty discussed the economics of the institutional repository, noting that it seemed such a system would be very expensive to implement and maintain for the long term. There was a concern that campus administration understand these costs before making a commitment to the faculty because it could be disastrous for a university to set up a repository that faculty members came to depend upon on for long term storage and access, only to have the IR taken down when the costs involved were fully realized.
All of our focus groups voiced the concern that if an institutional repository is built it should support access to data sets, multimedia, audio, and video. There is a growing body of digital “publication” being produced by both faculty and students these formats. Some of the faculty interviews pointed to a fairly new trend in scholarly publishing regarding other formats, with some publishers offering researchers the opportunity to produce peer-reviewed multimedia research pieces that are more than digital copies of SLA_IR_Paper_Final_Corrected.doc/Tompson p. 10
the traditional journal article. In the case of Cambridge journals such a piece is handled as an addition to a published work43; in the case of Optics Express44, it is an entirely separate entity.
The Future of IRs?
USC
USC faculty members do not appear very interested in IRs as repositories of preprints, postprints, or as an open access alternative to print journal publication. Faculty already feel pressed for time, as was mentioned by every interviewee. Margret Branschofsky of MIT, as quoted by the Chronicle, has said: “Professors have a million things to do, and they don't have a lot of resources.”45 USC faculty with whom the IRNA task force met are not interested in any other activity that will require effort by them — and for which they don't have an immediate benefit.
A place to store research data (before publication) was of high interest to USC faculty, as noted above. This is not the same as an institutional repository, and it is less clear to us that the library should provide such storage, but we could provide a means of accessing data within the storage system. Current systems specifically for IRs won't easily meet this need, since they were developed with the idea of making the information openly available.
The USC faculty interviewed and those in focus groups are comfortable with publishers being responsible for the long term preservation of their articles although some expressed the belief that this will change over time. Interestingly, this is something of a chicken-egg phenomena, as faculty are already satisfied with their current access to journal information because the libraries are spending a lot of money to license those online resources! Not all faculty members are aware of, and/or concerned with, the lack of local control and potential lack of archival access with online journals.
There is a need to do more about educating them on the problems of the current research publishing conundrum, including how much libraries are paying for the licenses for online journals. Librarians also need to do a better job of making clear to faculty members the simple fact that the only reason they are getting a full-text journal is because the library licensed it. The task force members learned that not all faculty members are aware, or remain aware, of the fact that not everyone can access electronic journals.
The IRNA task force members are in agreement that we should start by responding to the faculty’s perceived scholarly output needs and provide services for which they see an immediate benefit. Some of these services could be accomplished within an IR architecture. Perhaps once we have them using the IR to meet their needs, they would better understand some of the needs driving librarians, including regaining local collection control over their intellectual output. But we must be careful to avoid what we see as the trap into which some other institutions have fallen, quickly building an elegant repository but then spending large amounts of time recruiting content for it. Ex post facto marketing is clearly an ongoing concerns for institutions with repositories. As Allard et al. recently noted in a review of the literature on librarians and IRs:
“Encouraging the involvement of the authors of intellection property was mentioned in 90 percent of the articles, and nearly three-quarters of the articles referred to ideas about actively marketing the IR to authors.”46SLA_IR_Paper_Final_Corrected.doc/Tompson p. 11
Beyond
Storage of related and/or raw research material is a current concern with some USC faculty, and has also been a driver for the development of some IRs as discussed earlier in this paper. Lately more resources are being developed to meet this need. For example, Stanford’s HighWire Press now archives a great deal of supplemental data for many scientific journals, and makes it freely accessible47. Amazon.com is now renting storage space for 15 cents per month per gigabyte via their Amazon S3-Simple Storage Service48.
In terms of local storage, initiatives like LOCKSS (Stanford’s Lots of Copies Keep Stuff Safe49) allow an institution to capture and store data from subscribed online publications locally. However, stored publications cannot be accessed at need as with an open IR, but rather after a “trigger” event. In some cases such a trigger is defined as broadly as including the cancellation of a subscription for which one still requires back issues not contractually available from the publisher50.
Other new products continue to appear that may allow institutions to obviate their own local implementations and customizations by outsourcing the whole IR. For example:
• Open access journal publisher/aggregator BioMed Central’s Open Repository, built using DSpace software.51
• The University of California system’s California Digital Library consortium has mounted an eScholarship Repository for post-prints52. This resource does not presently archive accompanying material or raw data for the publications, however.
• More and more commercial products to rival Documentum and Digital Commons are coming online, as well as products to manage niche formats like video. Econtent Magazine aims to keep track of such developments53.
At the same time supporters of open access to information, including a wide array of scientific researchers, are advocating for open IRs in a variety of venues, but formal and informal. These latest efforts to make research accessible, are analogous to those that drove the early development of the World Wide Web. See, for instance, librarian Heather Morrison’s recent open letter to the President and members of the American Chemical Society on her open access-themed blog The Imaginary Journal of Poetic Economics, which reads in part:
“Change can be difficult for all of us, and perhaps more so for the privileged, profitable society publisher. However, as the Budapest Open Access Initiative [http://www.soros.org/openaccess/] stated so well, open access makes possible an unprecedented public good:
Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge…
I invite you personally, along with every member of the American Chemical Society… to engage in the process of transformation, to openly and immediately share not just peer-reviewed postprints, but also preprints, data, conference presentations, and research in progress.”54
Many of these activists are also developing IR-type functions on a variety of platforms, including blogs. Jean-Claude Bradley (Drexel University) has established the Useful Chem Experiments blog to freely share ongoing successes and failures in his lab55. Bradley explicitly advocates “open source science,” working to achieve three objectives via the blog platform: access, transparency and replication56. Such efforts may yet have a large impact on venues for peer-reviewed scholarly output. SLA_IR_Paper_Final_Corrected.doc/Tompson p. 12
The next step facing USC, and many other institutions, is to decide if we will build an institutional repository and, if so, what features it will include. If the funding is made available for an IR project at USC, faculty early adopters will need to be identified and enlisted and the appropriate software will have to be selected as the basis for the system. A key part of the process will be designing the interfaces so that they are useful to the faculty. Policies will also affect the usability of the system and could determine whether it is widely used. As Foster and Gibbons have noted:
“The phrase ‘if you build it, they will come’ does not yet apply to IRs. While their benefits seem to be very persuasive to institutions, IRs fail to appear compelling and useful to the authors and owners of the content. And, without the content, IRs will not succeed, because institutions will sustain IRs for only so long without greater evidence of success.”57SLA_IR_Paper_Final_Corrected.doc/Tompson p. 13
Notes
1 http://us.imdb.com/title/tt0097351/quotes
2 Sara R. Tompson has been a physical sciences and/or engineering librarian for 19 years, and has been Team Leader in the Science & Engineering Library at the University of Southern California (USC) since 2004. Tompson is currently serving as Secretary of SLA’s Physics/Astronomy/Math (PAM) Division. She has been involved with digital library concerns from the first days of Web browsers at the University of Illinois. In addition to one book and numerous articles, she coauthored with Elizabeth Eastwood “Digital Library Services: An Overview of the Hybrid Approach” in the 8th edition (2000) of the ASLIB (UK) Handbook of Information Management.
Deborah A. Holmes-Wong, Project Manager, Information Development and Management, USC, has held various positions at the University since she began her career there in 1987. She has spent the past five years involved in digital library initiatives as a project manager. She participated in ARL's Scholars Portal Project as one of USC's project managers. Her other projects have included planning and implementation of USC's collection information system for digital resources, openURL resolver and electronic resources management systems.
Janis F. Brown, Associate Director, Systems & Information Technology, Norris Medical Library, USC, has held various positions primarily related to technology and education in her 25 years with the University. She is the author of a book and four book chapters, and has presented nearly 40 papers and posters at professional meetings. Throughout her career she has been involved in information technology from the early days of Gopher as a campus wide information system to digital collection projects.
3 SPARC Institutional Repository Checklist & Resource Guide. Prepared by Raym Crow, SPARC Senior Consultant. Washington, DC: Scholarly Publishing & Academic Resources Coalition (SPARC), 2002, http://www.arl.org/sparc/IR/IR_Guide.html.
4 Lynch, Clifford, “Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age.” ARL Bimonthly (February 2003), http://www.arl.org/newsltr/226/ir.html.
5 IBID.
6 “The Case for Institutional Repositories: A SPARC Position Paper.” Prepared by Raym Crow, SPARC Senior Consultant, http://www.arl.org/sparc/IR/ir.html#exec.
7 For a discussion of this potential loss of information, see this recent editorial: Evangelou, Evangelos, et al. “Unavailability of Online Supplementary Scientific Information from Articles Published in Major Journals.” The FASEB Journal 19 (December 200), pp. 1943-1944.
8 Rhee, Seung Yon. “Carpe Diem. Retooling the ‘Publish or Perish’ Model into the ‘Share and Survive’ Model.” Plant Physiology 134 (February 2004), p. 543.
9 IBID.
10 See, for just one brief history, the W3 Coalition profile of Berners-Lee: http://www.w3.org/People/Berners-Lee/.
11 “The Impact of Paul Ginsparg’s ePrint Archive,” http://library.lanl.gov/libinfo/preprintsbib.htm.
12 W3Coalition Talk “The Web as a Unifying Force in Europe,” http://www.w3.org/2005/Talks/w3c10-WebAsUnifyingForce/?n=3. SLA_IR_Paper_Final_Corrected.doc/Tompson p. 14
13 Willinsky, John. “Scholarly Associations and the Economic Viability of Open Access Publishing.” Journal of Digital Information 4:2, Article No. 177, 2003-04-09, http://jodi.tamu.edu/Articles/v04/i02/Willinsky/ [an open access journal].
14 Top Cited HEP Articles from SPIRES-HEP database, SLAC Library, http://www.slac.stanford.edu/library/topcites/.
15 For just one portal into discussions of the early Web, see this CERN page: http://public.web.cern.ch/public/Content/Chapters/AboutCERN/Achievements/WorldWideWeb/WebHistory/WebHistory-en.html.
16 Beginning in 1991, Elsevier staff worked with representatives from eight universities on the project that became Science Direct, see: http://info.sciencedirect.com/about/brochure.pdf.
17 Tobin, Martin J. “The Official Copy of AJRCCM Is Posted but Not Printed.” American Journal of Respiratory and Critical Care Medicine 166 (2002), pp. 905-906.
18 The “surcharge over print” issue the Yale University Libraries address in their “Guidelines for Ejournal Packages”: http://www.library.yale.edu/CDC/public/subcommittees/codger/documents/GuidelinesPackages.pdf.
19 For one discussion, see Carol Hoover’s paper “Cancellation of Print Journal at a National Research Laboratory,” one of the papers contributed to this session in 2001: http://www.sla.org/division/dst/Annual%20Conference%20Contributed%20Papers/2001papers/cancellation.html.
20 The Role of Computer Technology in the Growth of Productivity. Washington, DC: The Congress of the United States, Congressional Budget Office, May 2002, http://www.cbo.gov/ftpdocs/34xx/doc3448/Computer.pdf.
21 IBID.
22 Scheible, John P. “A Survey of Storage Options.” Computer (December 2002), pp. 42-46.
23 Baudoin, Patsy and Branschofsky, Margret. “Implementing an Institutional Repository: The DSpace Experience at MIT.” Science & Technology Libraries 24:1/2 (2003), p. 32.
24 IBID.
25 http://www.documentum.com/
26 http://www.documentum.com/products/collateral/success/success_concordia.pdf
27 “Implementation of the Strategic Plan for ISD. Status Report, June 2004.” http://www.usc.edu/isd/strategicplan/private/doc/SPUpdate200406Projs.htm. Internal report.
28 http://digarc.usc.edu:8089/cispubsearch/
29Flexible Extensible Digital Object and Repository Architecture: http://www.fedora.info/
30 http://www.eprints.org/
31 http://il.proquest.com/products_umi/digitalcommons/SLA_IR_Paper_Final_Corrected.doc/Tompson p. 15
32 http://wiki.dspace.org/ScalabilityIssues
33 http://www.sdsc.edu/srb/index.php/Main_Page
34 http://www.slac.stanford.edu/BFROOT/
35 Foster, Andrea L. “Papers Wanted.” The Chronicle of Higher Education (June 25, 2004), p. 37.
36 http://archives.eprints.org/?action=analysis
37 Buehler Marianne A. and Boateng, Adwoa. “The evolving impact of institutional repositories on reference librarians.” Reference Services Review 33:3 (2005), p. 299.
38 Foster, N. F. and Gibbons, S. “Understanding Faculty to Improve Content Recruitment for Institutional Repositories.” D-Lib Magazine 11:1 (January 2005), http://www.dlib.org/dlib/january05/foster/01foster.html .
39 http://isihighlycited.com/
40 Customer Analysis: a Manual of Techniques. Los Angeles, CA: University of Southern California, University Libraries Customer Analysis Team, July 1997. Internal document.
41 http://alcme.oclc.org/openurl/
42 As Foster and Gibbons, and Bell, note in their latest paper, “The features of an IR that are most exciting to librarians, such as persistent URLs and metadata schemas, rarely register the same enthusiasm for faculty.”! (Bell, Suzanne, Foster, Nancy Fried and Gibbons, Susan. “Reference Librarians and the Success of Institutional Repositories.” Reference Services Review 33:3 (2005), p. 287.)
43 http://journals.cambridge.org/action/siteHoldings
44 http://www.opticsexpress.org/journal/oe/about.cfm
45 Foster, Andrea. IBID.
46 Allard, Suzie, Mack, Thura R. and Feltner-Reichert, Melanie. “The Librarian’s Role in Institutional Repositories: A Content Analysis of the Literature.” Reference Services Review 33:3 (2005), p. 331.
47 http://highwire.stanford.edu/lists/freeart.dtl
48 http://www.amazon.com/gp/browse.html/002-1184810-4786432?node=16427261
49 http://www.lockss.org/
50 As discussed at the “Electronic Archiving for Libraries” session at the Statewide California Electronic Library Consortium (SCELC) Colloquium in March 2006: http://scelc.org/meetings/programday/2006/.
51 http://www.openrepository.com/
52 http://repositories.cdlib.org/escholarship/
53 http://www.econtentmag.com/EContent100/SLA_IR_Paper_Final_Corrected.doc/Tompson p. 16
54 http://poeticeconomics.blogspot.com/2006/03/open-access-transformative-change.html
55 http://usefulchem-experiments1.blogspot.com/ .
56 http://drexel-coas-elearning.blogspot.com/2006/02/blogger-as-lab-notebook.html
57 Foster and Gibbons, IBID.

SLA_IR_Paper_Final_Corrected.doc/Tompson p. 1
Institutional Repositories: Beware the “Field of Dreams”1 Fallacy!
A Special Libraries Association Science & Technology Division
2006 Contributed Paper
By Sara R. Tompson, Deborah A. Holmes-Wong and Janis F. Brown2
Introduction
Institutional repositories (IRs) are all the rage right now, particularly with academic libraries. But are they living up to their potential? Not exactly. While the potential exists for IRs to “have an impact on the serials crisis” – one of the questions posed by the call for this contributed papers session – and on other areas of the publication and research cycle, there has generally been more impact in the literature than in the scholarly world thus far. This paper will examine in brief the rise in and use of institutional repositories with an emphasis on science and technology/engineering (scitech) arenas to set the context for a discussion of the work of the University of Southern California’s Institutional Repository Needs Assessment Task Force (IRNA), and close with a look toward the future.
Institutional Repositories
Definitions
Institutional repository (IR) is a phrase that implies both ownership and preservation. IRs in the context of this session, and generally in librarianship today, typically store documents in digital format. The Scholarly Publishing & Academic Resources Coalition (SPARC) has defined IRs as:
“Digital collections capturing and preserving the intellectual output of a single university or a multiple institution community of colleges and universities.” 3
IRs are typically thought of as serving two broad purposes regarding the intellectual output of scholarly institutions:
1. They can bring (back) some local control of scholarly work. This control is ceded to varying or lesser degrees to publishers when subscriptions move to online-only — authoritative versions of articles are not typically retained in accessible format locally, as they are with print journals.
2. IRs, by grouping much of a university’s intellectual output in one place, facilitate interdisciplinary research connections, statistic-keeping, marketing and other institutional efforts.
An IR populated with an academic institution’s faculty and student research has become a key service that member institutions of the Association of Research Libraries (ARL) want to provide to their constituencies. Clifford Lynch’s 2003 article on IRs for ARL both documented and furthered the trend.4 In that paper Lynch addressed the two IR purposes noted above in this way, with an emphasis on technology for the second purpose:
“The development of institutional repositories emerged as a new strategy that allows universities to apply serious, systematic leverage to accelerate changes taking place in scholarship and scholarly communication, both moving beyond their historic relatively passive role of supporting established publishers in modernizing scholarly publishing through the licensing of digital content, and also scaling up beyond ad-hoc alliances, partnerships, and support arrangements with a few select faculty pioneers exploring more transformative new uses of the digital medium.” 5SLA_IR_Paper_Final_Corrected.doc/Tompson p. 2
An ideal IR, according to Lynch, and we agree, is also a dynamic collection, not simply a storage place. Leveraging possibilities for IRs will be discussed briefly later.
The Scholarly Publishing and Academic Resources Coalition (SPARC), in a white paper that makes the case for IRs, describes the purposes as follows, noting that institutional repositories:
• “Provide a critical component in reforming the system of scholarly communication – a component that expands access to research, reasserts control over scholarship by the academy, increases competition and reduces the monopoly power of journals, and brings economic relief and heightened relevance to the institutions and libraries that support them.”
• “Have the potential to serve as tangible indicators of a university's quality and to demonstrate the scientific, societal, and economic relevance of its research activities, thus increasing the institution's visibility, status, and public value.” 6
One important point that could distinguish an IR from a simple preprint or postprint archive is that accompanying or related material – raw data, videos, class lectures, etc. could also be stored therein. Some scitech journals are doing this now on their websites, particularly in medical and cell biology fields where large genomic data arrays are being analyzed, but not all journals have the storage capacity for this. And, some journals only retain supplementary material for a limited time7. A key component of the scholarly research process — building upon what has been done before — can be hampered if the data is not available. As a plant biologist at Stanford recently noted:
“Journals are now facing the fact that results of microarray and proteomics experiments do not fit into publishable article pages, similar to the situation regarding publishing of sequence in articles some 15 years ago. Often, these results are archived on individual journals’ Web sites and are not well connected to community resources.”8
However, a broadly defined IR that includes all sorts of supporting material raises some legitimate concern in faculty researchers’ and university administrators’ minds regarding access — not everyone would want this information publicly accessible, and many definitions of IRs include an open access principal and structure. Already it is clear that if a library or institution builds a repository, not everyone will come, which can only partly be controlled by the type of IR built. Plus, Rhee, the biologist quoted above, notes, IRs are not (yet?) clearly integrated into the scholarly reward structure:
“… researchers are not accustomed to contributing their data and expertise to community databases. Here lies a conflict: Although there is a well-established reward structure for publishing in scientific journals and public repositories (largely through enforcement either as a condition for publication or for receiving grants), a similar reward system does not yet exist for contributions to community databases.”9
This and other inherent conflicts regarding IR purposes and uses was brought to light in the USC task force interviews of faculty, as will be discussed further below.
Developments in information technology have been the strongest driver in the rise of institutional repositories in two distinct ways: the rise of online journals, and the drop in cost of electronic storage.
Rise of the Web and Online Journals
The rise of stable, online journals has revolutionized collection development, and allowed libraries to provide “just in time” 24x7 online access to journals rather than storing thousands of linear feet of print journals “just in case” someone needs to use a volume. This has been an extremely important paradigm shift that is still underway — many libraries’ journal collections are still hybrids of print and online. SLA_IR_Paper_Final_Corrected.doc/Tompson p. 3
The benefits of online journals far outweigh the costs. But one cost, that can have potential significance, is that collection ownership is often no longer local with electronic journals.
Scitech libraries were some of the first to move to electronic journals, as our users’ disciplines were some of the first to use the Web. Indeed, high energy physics is credited with the invention of the Web. Tim Berners-Lee, at the particle physics laboratory CERN outside Geneva, Switzerland, wrote the first Web browser in the early 1990s.10 Physicists in particular, but also other scientists and engineers, were already reading and sharing articles via preprint servers (Alan Ginsparg set up the arXiv server at Los Alamos in 1991.11 The Web proved a much easier tool for this interchange than the more linear file transfer protocol (FTP), as it facilitated networking. The “E-Doc” Web platform was developed in the mid-1990s for scientific publishers, especially the American Physical Society, Nature, and Springer Verlag, to be able to utilize the Web as a publishing vehicle12. Electronic journals proliferated. By some counts, over 75% of scholarly journals are now online13.
Some scitech fields, notably physics, have allowed publication and citation in preprint archives to “count” as a measure of scholarly output. Many of these archives are de facto (bad science will get culled out) or in actuality (including editorial board review) peer-reviewed. In the high energy physics field, Stanford Linear Accelerator’s SPIRES bibliographic database, an early, freely accessible Web-based library catalog, has always indexed preprints and has always included a citation analysis tool.14
In the early days of the Web, many scientists expected a large percentage of scholarly journals to be openly accessible on the Web15, and/or to adopt the open approach of physicists. This has not come to pass. Journals from scientific member organizations like the American Physical Society and the Institute of Electrical and Electronic Engineers (IEEE) quickly went online, and while they were, and remain, fairly reasonably priced, they are not freely accessible. Commercial scientific publishers like Springer and Elsevier16 also mounted journals on the Web fairly early on. Many of these journal packages, particularly the commercial ones, are quite expensive, and thus only practically available to larger institutions.
Initially most online journals were provided as an adjunct to the print, with the print often being considered the authoritative version. Publishers handled subscriptions to this new format in various ways, sometimes charging extra for the online version, sometimes giving it away for free with the print subscription. Both models persist into this century.
As the online versions of scholarly journals have become more stable, more and more of them have become recognized as authoritative versions of record. For instance, the American Journal of Respiratory and Critical Care Medicine, which went online in 1997, designated the electronic version as the version of record in 2002.17 However, with the rise in prominence of the online version, some publishers have changed their subscription models, either charging more for the online version than the print18 (which seems particularly absurd, as it is less expensive to mount online versions than to print and mail out hard copies) and/or “penalizing” subscribers that drop the print versions. These newer, more punitive subscriptions models have been one of the drivers for both librarians and some faculty, the creators of the content of scholarly journals, to look at open access models and institutional repositories.
While journals were going online, more and more new journals continued to appear. The increasing quantity and cost of journals in the past two decades has often — including in the call for papers for this session — been termed “the serials crisis.” One aspect of the crisis has been physical storage space. As online journals became more stable, more and more libraries with large journal collections, principally academic, and especially scitech, libraries have cancelled print subscriptions in order to free up over-shelved ranges19. Print cancellation decisions are never made lightly, but are often necessary also to regain space for other services requested by users (e.g. information commons) and to free up funds to reallocate to the purchase of more online journals.
However, such decisions to go electronic-only have given up the local, preservation control of the content of the journals, content created in part by the libraries’ users. This need not be a cause for alarm, as some publishers have demonstrated themselves to be trustworthy, and likely to remain in existence for a long time to come. However, some publishers, and some titles, seem less stable, changing platforms, formats, and ownership. Plus, not all publishers provide archival access — access to online journals for the period of the subscription, even if one no longer subscribes. And not all subscribers can purchase the permanent access packages which often have higher price structures. At USC, for example, we have not yet been able to acquire the premium access to the Blackwell Synergy journals. The notion of local institutional archiving of either pre- or post-print scholarly articles begins to look more attractive and sensible given these developments.
Electronic Storage
During the last decade or so, while online journals have been on the rise, the cost of electronic storage has gone down. In 2000, the U.S. Congressional Budget Office analyzed this phenomenon as part of the paper “The Role of Computer Technology in the Growth of Productivity,”20 from which the following chart is extracted.
Figure 1
Note that the figure uses a logarithmic scale in the price axis. As the authors say: “The prices at the beginning of the period are so much higher than those at the end—by close to five orders of magnitude — that presenting the same information using an arithmetic scale would obscure most of the price changes after the mid-1980s.”21
To look at this data in another way, the author reviewing the development of computer storage for the IEEE magazine Computer in 2002 put the dramatic decrease in personal computing terms:
“If you were a personal computing enthusiast in 1984, you could buy an IBM PC for $5,000 to
$6,000…For this you would get a 0.004-GHz 8- bit processor with 0.064 Mbytes of RAM, a 12- inch monochrome text-only display, a 0.16-Mbyte floppy drive, and no possibility of a hard drive. Compare that with a PC purchase today: a 2- GHz 32-bit processor with 2,000 Mbytes of
SLA_IR_Paper_Final_Corrected.doc/Tompson p. 4
SLA_IR_Paper_Final_Corrected.doc/Tompson p. 5
RAM, a 128-Mbyte video graphics card, a 700-Mbyte CD-RW (and maybe a DVD-R), a 17-inch 1600 × 1200 display with 32 million colors—all for maybe $1,500. The old machine cost six times as much for 1/2000th of the processing power. The purchasing power gain over 18 years is 1,200,000 percent.”22
Interestingly, now that storage is so inexpensive, organizations are buying so much of it, they cannot effectively manage it. “Storage virtualization” is a new buzz word to describe newer systems (upon which IRs can be constructed) to manage these large physical arrays. The older computer storage systems, used for some IR architectures, very much depend upon the physical structure (internal addresses) to locate things and this is slow across many terabytes of data.
IR Platforms
The continuing, dramatic drop in costs for storing of electronic data is one of the factors that drove the development of software for institutional repositories. The Massachusetts Institute of Technology (MIT) and Hewlett-Packard Laboratories jointly developed DSpace, beginning in early 2000.23 DSpace was developed principally as an IR platform — “the plan was to create an infrastructure for storing the digitally born, intellectual output of the MIT community and to make it accessible over the long term”24 — and a number of universities have and continue to implement it for that purpose. DSpace is freely available as an open source software product. Electronic data in a variety of formats and types — text, video, .RTF, .PDF, etc. — can be stored in a DSpace repository. It is extremely customizable. There is now a DSpace Federation, and members are active in sharing customization solutions and other information as they build and maintain IRs.
Documentum25 is a commercial product varyingly billed as “digital assets management” or an “enterprise content management” solution. In fact, it was purchased by the content management company EMC in 2004. Documentum is somewhat less customizable than DSpace, but it includes a number of automated processes for both ingest and extraction of data. Concordia University in Montreal is one of Documentum’s profiled users; they are using the product to manage faculty dossiers26, a function sometimes included in broadly defined IRs. USC has been using Documentum since 200427 to manage our Digital Archive28.
Other platforms on which IRs can be developed include FEDORA29, ePrints30 (used by the arXiv preprint server, now at Cornell University), Proquest’s Digital Commons31 and others. Indeed there has been more of a rush to develop IR architectures than IRs themselves! In some ways the software is the easy part; getting and sustaining contributions is the hard part.
Many current IR architectures do some things very slowly when you get more than a few thousand items in the repository, due to the storage and search structure limitations touched on above. The fundamental structure of some IRs may need to be changed as they grow. DSpace users are working on scalability solutions, according to their Wiki.32 The Storage Resource Broker from the San Diego Super Computing Center uses a different approach which borrows its federated structure from grid-computing.33 Grid computing is, briefly, distributed computing, but the networks of computers and processors are more virtual and larger than in older parallel processing models. With a grid approach, SRB works well for searching large collections across multiple organizations and heterogeneous storage systems, illustrated by the fact it is used by the BaBar international high energy physics collaboration34. This approach may prove a better platform for IR development. IR Usage
For all the talk about, and establishment of, IRs and software architectures for such repositories, their usage has not been as high as many predicted. As noted in a 2004 report in the Chronicle of Higher Education, many universities, including DSpace developer MIT, are finding the populating of their IRs slow going, and now, being committed to the repositories, have had to increase their marketing efforts and/or library staff involvement in the IR ingest processes.35
The Registry of Open Access Repositories (ROAR) developed at the University of Southampton (UK) tracks the growth of institutional archives and makes this data available in their Wiki36. As of December 2005, there were about 250 functional institutional repositories containing about 800,000 items, which averages out to only about 3,200 per archive, even though growth has taken off since 2002. Figure 2 is a graph of archival growth, created using the ROAR eTrac tool.
Figure 2, generated March 25, 2006
A recent case study article by librarians at the Rochester Institute of Technology notes:
“The authors did not anticipate the amount of work involved in marketing the IR and persuading faculty to use it and to deposit materials in it.”37
A growing number of institutions are discovering the serious investments in time, and money and change management that an IR often brings. Fortunately much of this information is being published or publicly shared, so those considering repository construction can find useful precautions and guidelines.
USC’s IR Needs Assessment
USC did not join the early adapters of IRs. This caution has allowed us to examine the state of IRs, as well as of developments in scholarly communication, before simply building an IR. The Dean of USC’s Information Services Division convened an Institutional Repository Needs Assessment (IRNA) Task Force in the Spring of 2005. Coauthor Deborah Holmes-Wong from the library’s Information Development and Management Department chaired the group, which also included representatives from the library’s Interdisciplinary Teams (arts & humanities, social sciences and science & engineering – Coauthor Tompson), a representative from the Norris Medical Library on the Health Sciences campus –
SLA_IR_Paper_Final_Corrected.doc/Tompson p. 6
SLA_IR_Paper_Final_Corrected.doc/Tompson p. 7
coauthor Brown, the University Archivist, and the Director of New Development from the computing side of the Division.
The IRNA Task Force was inspired in part by Nancy Foster and Susan Gibbons’ D-Lib article on understanding faculty as a prerequisite to developing sustainable institutional repositories. As they note, “Without content [created by faculty], an IR is just a set of ‘empty shelves’.”38
Our IR needs assessment task force took a three-pronged approach to this task of understanding potential IR contributors and users. We:
• Looked at institutional repository needs assessment literature to see what others have done.
• Interviewed fourteen USC research faculty from the arts, humanities, social sciences, medicine, sciences and engineering on the services that they need to enable their research and publications, and
• Conducted four focus groups with research faculty to ascertain the services that were most important to offer in an institutional repository.
The interviews and focus groups provided us an overall view of faculty needs. However, the sample size for the interviews and focus groups was too small to determine how widespread an attitude about IRs expressed by a particular researcher may be on campus. The purpose of the interviews was to identify of possible repository services, and the purpose of the focus groups was to determine if there was strong support for these services. The Task Force’s final report was compiled in November 2005 and presented to the library faculty. Holmes-Wong and others are continuing to examine possibilities for next steps regarding an institutional repository at USC.
Interviews
In order to conduct the interviews, the Task Force first identified 90 potential faculty participants, culled from: USC Web sites, USC high performance computing center participants, USC faculty in ISI Web of Knowledge’s highly cited authors39 group, and task force participants’ recommendations.
IRNA task force members selected faculty from the list and scheduled interviews. We decided to schedule ten to fifteen interviews with faculty members from across the university covering the major disciplines. We interviewed twelve faculty members originally. However, given our initial criteria we found that we had mostly older established, tenured faculty. Thus we identified additional faculty members to try to address the gender, age, and ethnicity issues the initial criteria created.
The interviews were conducted by two to three people, utilizing a format developed and practiced at USC40. The coauthors of this paper were the science, engineering and medical faculty interview team. Interviews were scheduled for an hour with each faculty member, but interviewers planned on two hours, so they could stay if the faculty member wanted to talk with them at length.
The interview consisted of several open-ended questions:
• What are your research interests?
• How do you disseminate your research findings?
• How do you incorporate your research in the curriculum that you teach?
• What is the role of graduate students and post docs in your research process?
These conversations were always substantive, sometimes intense, and sometimes illuminating. We learned a great deal more about the mechanics of the faculty members’ research processes, such as SLA_IR_Paper_Final_Corrected.doc/Tompson p. 8
formatting and file transfer difficulties they and their students encountered in saving, posting and sharing preprints, articles and postprints. We also heard a great deal of praise and criticism for various library services on both campuses. Establishing a comfort level in which faculty members could ask for help in using library resources led to more tangible outcomes than did their thoughts on IRs, which were decidedly mixed! Following each of the scitech/medical faculty interviews, one or more of us contacted an interviewee or his/her students to provide instructions on accessing a variety of electronic resources they had not previously realized were available.
The interviewers met after each interview to discuss and summarize the responses, paying attention to the unexpected issues that surfaced. Interview summaries were formatted into a document structure that worked well on the Wiki we developed to track the project. The interviewers then sent the documents to the faculty members, asking for each one to “sign-off” on the summary to ensure that it represented their responses accurately.
Focus Groups
From the interviews, we compiled ten use cases for the focus groups to discuss and prioritize. These use cases captured all the likely scenarios discussed with faculty members during the open-ended interviews. Focus groups participants were given definitions of institutional repositories, session goals, and use cases. The were asked to prioritize the use cases into two piles, services that they thought were important and would like to see and services that they felt were unimportant.
The faculty members were asked to attend a specific session. Thirteen total attended the focus sessions, which were segmented as follows:
• Library faculty on the University Park Campus
• Health science faculty on the Health Science Campus
• Senior faculty on the University Park Campus
• New faculty on the University Park Campus
We believed it was important to query our colleagues, as well as both senior and junior faculty and faculty on each campus (which are ten miles apart).
Results
In the interviews and focus groups, we did not find strong faculty support for an institutional repository for faculty self-archiving of already published articles, the model articulated by most of our peer institutions.
The more broadly applicable results (we received some very specific comments such as one school’s need for a system to handle student dossiers) of the focus groups’ discussions of the use case studies follow. They capture the comments of the interviewees as well.
• Top priority: Use Case #9 -- Secure persistent storage
The focus groups validated what we had learned through interviews: that although there is little support for archiving already-published materials, researchers need secure long-term storage space for their research data and a way to provide stable, persistent links to their files. Faculty want to control who has access — and mostly don't want anyone else to have access; they want to know that the integrity of the data is maintained. Some faculty members need a repository where they can archive their work including endnotes, image files, data sets, SLA_IR_Paper_Final_Corrected.doc/Tompson p. 9
software and multimedia software while writing research papers. Creating and managing a repository of unpublished research data is a very different undertaking than creating and maintaining an open access archive.
• Use Case #1 Automated generation of curriculum vitae & Use Case #2 Faculty research locator
The Health Sciences faculty who work more collaboratively and whose journals are indexed in ISI’s Web of Knowledge favored a system that would generate curriculum vitae for them from that database and other resources and allow them to use that system to find other USC faculty with intersecting research interests. The focus groups containing faculty from the humanities and social sciences whose disciplines are less well-served by Web of Knowledge, and who also tend to work more independently than science and engineering researchers, did not express a need for this service. They believed that this was a service that was more attractive to administrators than faculty.
• Use Case #6 Collaborating on an article in the repository & Use Case #8 Document/data set versioning
While software that supports collaboration and versioning on the Web was seen as important in all focus groups, the focus groups were divided into those who currently had access to software that supported this and felt this was not a feature for the IR, and those who did not have access and felt it was very important for the institutional repository to have this feature.
• Use Case #10 Check for permission to post preprints and post prints
In faculty interviews, several faculty members identified the difficulty in checking for usage permissions as an obstacle keeping them from publishing their research on the Web. However, a service to check for permissions was not rated as important as the other features listed above.
• Use Case # 5 “Automated” ingest of citations and OpenURLs
The ongoing “automated” ingest of citations and OpenURL41 links was not brought up by faculty members in any of the focus groups42. Such a feature would be needed in order to have the automated curriculum vitae service that many ranked fairly high.”
• Use Case # 3 User wants to add bibliography of work to system
All focus groups identified the capability of adding lists of works to the system as something that was not needed in the initial implementation of an institutional repository. The faculty that mentioned this use case felt that they would be more likely to contribute single items to an IR.
Two groups of faculty discussed the economics of the institutional repository, noting that it seemed such a system would be very expensive to implement and maintain for the long term. There was a concern that campus administration understand these costs before making a commitment to the faculty because it could be disastrous for a university to set up a repository that faculty members came to depend upon on for long term storage and access, only to have the IR taken down when the costs involved were fully realized.
All of our focus groups voiced the concern that if an institutional repository is built it should support access to data sets, multimedia, audio, and video. There is a growing body of digital “publication” being produced by both faculty and students these formats. Some of the faculty interviews pointed to a fairly new trend in scholarly publishing regarding other formats, with some publishers offering researchers the opportunity to produce peer-reviewed multimedia research pieces that are more than digital copies of SLA_IR_Paper_Final_Corrected.doc/Tompson p. 10
the traditional journal article. In the case of Cambridge journals such a piece is handled as an addition to a published work43; in the case of Optics Express44, it is an entirely separate entity.
The Future of IRs?
USC
USC faculty members do not appear very interested in IRs as repositories of preprints, postprints, or as an open access alternative to print journal publication. Faculty already feel pressed for time, as was mentioned by every interviewee. Margret Branschofsky of MIT, as quoted by the Chronicle, has said: “Professors have a million things to do, and they don't have a lot of resources.”45 USC faculty with whom the IRNA task force met are not interested in any other activity that will require effort by them — and for which they don't have an immediate benefit.
A place to store research data (before publication) was of high interest to USC faculty, as noted above. This is not the same as an institutional repository, and it is less clear to us that the library should provide such storage, but we could provide a means of accessing data within the storage system. Current systems specifically for IRs won't easily meet this need, since they were developed with the idea of making the information openly available.
The USC faculty interviewed and those in focus groups are comfortable with publishers being responsible for the long term preservation of their articles although some expressed the belief that this will change over time. Interestingly, this is something of a chicken-egg phenomena, as faculty are already satisfied with their current access to journal information because the libraries are spending a lot of money to license those online resources! Not all faculty members are aware of, and/or concerned with, the lack of local control and potential lack of archival access with online journals.
There is a need to do more about educating them on the problems of the current research publishing conundrum, including how much libraries are paying for the licenses for online journals. Librarians also need to do a better job of making clear to faculty members the simple fact that the only reason they are getting a full-text journal is because the library licensed it. The task force members learned that not all faculty members are aware, or remain aware, of the fact that not everyone can access electronic journals.
The IRNA task force members are in agreement that we should start by responding to the faculty’s perceived scholarly output needs and provide services for which they see an immediate benefit. Some of these services could be accomplished within an IR architecture. Perhaps once we have them using the IR to meet their needs, they would better understand some of the needs driving librarians, including regaining local collection control over their intellectual output. But we must be careful to avoid what we see as the trap into which some other institutions have fallen, quickly building an elegant repository but then spending large amounts of time recruiting content for it. Ex post facto marketing is clearly an ongoing concerns for institutions with repositories. As Allard et al. recently noted in a review of the literature on librarians and IRs:
“Encouraging the involvement of the authors of intellection property was mentioned in 90 percent of the articles, and nearly three-quarters of the articles referred to ideas about actively marketing the IR to authors.”46SLA_IR_Paper_Final_Corrected.doc/Tompson p. 11
Beyond
Storage of related and/or raw research material is a current concern with some USC faculty, and has also been a driver for the development of some IRs as discussed earlier in this paper. Lately more resources are being developed to meet this need. For example, Stanford’s HighWire Press now archives a great deal of supplemental data for many scientific journals, and makes it freely accessible47. Amazon.com is now renting storage space for 15 cents per month per gigabyte via their Amazon S3-Simple Storage Service48.
In terms of local storage, initiatives like LOCKSS (Stanford’s Lots of Copies Keep Stuff Safe49) allow an institution to capture and store data from subscribed online publications locally. However, stored publications cannot be accessed at need as with an open IR, but rather after a “trigger” event. In some cases such a trigger is defined as broadly as including the cancellation of a subscription for which one still requires back issues not contractually available from the publisher50.
Other new products continue to appear that may allow institutions to obviate their own local implementations and customizations by outsourcing the whole IR. For example:
• Open access journal publisher/aggregator BioMed Central’s Open Repository, built using DSpace software.51
• The University of California system’s California Digital Library consortium has mounted an eScholarship Repository for post-prints52. This resource does not presently archive accompanying material or raw data for the publications, however.
• More and more commercial products to rival Documentum and Digital Commons are coming online, as well as products to manage niche formats like video. Econtent Magazine aims to keep track of such developments53.
At the same time supporters of open access to information, including a wide array of scientific researchers, are advocating for open IRs in a variety of venues, but formal and informal. These latest efforts to make research accessible, are analogous to those that drove the early development of the World Wide Web. See, for instance, librarian Heather Morrison’s recent open letter to the President and members of the American Chemical Society on her open access-themed blog The Imaginary Journal of Poetic Economics, which reads in part:
“Change can be difficult for all of us, and perhaps more so for the privileged, profitable society publisher. However, as the Budapest Open Access Initiative [http://www.soros.org/openaccess/] stated so well, open access makes possible an unprecedented public good:
Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge…
I invite you personally, along with every member of the American Chemical Society… to engage in the process of transformation, to openly and immediately share not just peer-reviewed postprints, but also preprints, data, conference presentations, and research in progress.”54
Many of these activists are also developing IR-type functions on a variety of platforms, including blogs. Jean-Claude Bradley (Drexel University) has established the Useful Chem Experiments blog to freely share ongoing successes and failures in his lab55. Bradley explicitly advocates “open source science,” working to achieve three objectives via the blog platform: access, transparency and replication56. Such efforts may yet have a large impact on venues for peer-reviewed scholarly output. SLA_IR_Paper_Final_Corrected.doc/Tompson p. 12
The next step facing USC, and many other institutions, is to decide if we will build an institutional repository and, if so, what features it will include. If the funding is made available for an IR project at USC, faculty early adopters will need to be identified and enlisted and the appropriate software will have to be selected as the basis for the system. A key part of the process will be designing the interfaces so that they are useful to the faculty. Policies will also affect the usability of the system and could determine whether it is widely used. As Foster and Gibbons have noted:
“The phrase ‘if you build it, they will come’ does not yet apply to IRs. While their benefits seem to be very persuasive to institutions, IRs fail to appear compelling and useful to the authors and owners of the content. And, without the content, IRs will not succeed, because institutions will sustain IRs for only so long without greater evidence of success.”57SLA_IR_Paper_Final_Corrected.doc/Tompson p. 13
Notes
1 http://us.imdb.com/title/tt0097351/quotes
2 Sara R. Tompson has been a physical sciences and/or engineering librarian for 19 years, and has been Team Leader in the Science & Engineering Library at the University of Southern California (USC) since 2004. Tompson is currently serving as Secretary of SLA’s Physics/Astronomy/Math (PAM) Division. She has been involved with digital library concerns from the first days of Web browsers at the University of Illinois. In addition to one book and numerous articles, she coauthored with Elizabeth Eastwood “Digital Library Services: An Overview of the Hybrid Approach” in the 8th edition (2000) of the ASLIB (UK) Handbook of Information Management.
Deborah A. Holmes-Wong, Project Manager, Information Development and Management, USC, has held various positions at the University since she began her career there in 1987. She has spent the past five years involved in digital library initiatives as a project manager. She participated in ARL's Scholars Portal Project as one of USC's project managers. Her other projects have included planning and implementation of USC's collection information system for digital resources, openURL resolver and electronic resources management systems.
Janis F. Brown, Associate Director, Systems & Information Technology, Norris Medical Library, USC, has held various positions primarily related to technology and education in her 25 years with the University. She is the author of a book and four book chapters, and has presented nearly 40 papers and posters at professional meetings. Throughout her career she has been involved in information technology from the early days of Gopher as a campus wide information system to digital collection projects.
3 SPARC Institutional Repository Checklist & Resource Guide. Prepared by Raym Crow, SPARC Senior Consultant. Washington, DC: Scholarly Publishing & Academic Resources Coalition (SPARC), 2002, http://www.arl.org/sparc/IR/IR_Guide.html.
4 Lynch, Clifford, “Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age.” ARL Bimonthly (February 2003), http://www.arl.org/newsltr/226/ir.html.
5 IBID.
6 “The Case for Institutional Repositories: A SPARC Position Paper.” Prepared by Raym Crow, SPARC Senior Consultant, http://www.arl.org/sparc/IR/ir.html#exec.
7 For a discussion of this potential loss of information, see this recent editorial: Evangelou, Evangelos, et al. “Unavailability of Online Supplementary Scientific Information from Articles Published in Major Journals.” The FASEB Journal 19 (December 200), pp. 1943-1944.
8 Rhee, Seung Yon. “Carpe Diem. Retooling the ‘Publish or Perish’ Model into the ‘Share and Survive’ Model.” Plant Physiology 134 (February 2004), p. 543.
9 IBID.
10 See, for just one brief history, the W3 Coalition profile of Berners-Lee: http://www.w3.org/People/Berners-Lee/.
11 “The Impact of Paul Ginsparg’s ePrint Archive,” http://library.lanl.gov/libinfo/preprintsbib.htm.
12 W3Coalition Talk “The Web as a Unifying Force in Europe,” http://www.w3.org/2005/Talks/w3c10-WebAsUnifyingForce/?n=3. SLA_IR_Paper_Final_Corrected.doc/Tompson p. 14
13 Willinsky, John. “Scholarly Associations and the Economic Viability of Open Access Publishing.” Journal of Digital Information 4:2, Article No. 177, 2003-04-09, http://jodi.tamu.edu/Articles/v04/i02/Willinsky/ [an open access journal].
14 Top Cited HEP Articles from SPIRES-HEP database, SLAC Library, http://www.slac.stanford.edu/library/topcites/.
15 For just one portal into discussions of the early Web, see this CERN page: http://public.web.cern.ch/public/Content/Chapters/AboutCERN/Achievements/WorldWideWeb/WebHistory/WebHistory-en.html.
16 Beginning in 1991, Elsevier staff worked with representatives from eight universities on the project that became Science Direct, see: http://info.sciencedirect.com/about/brochure.pdf.
17 Tobin, Martin J. “The Official Copy of AJRCCM Is Posted but Not Printed.” American Journal of Respiratory and Critical Care Medicine 166 (2002), pp. 905-906.
18 The “surcharge over print” issue the Yale University Libraries address in their “Guidelines for Ejournal Packages”: http://www.library.yale.edu/CDC/public/subcommittees/codger/documents/GuidelinesPackages.pdf.
19 For one discussion, see Carol Hoover’s paper “Cancellation of Print Journal at a National Research Laboratory,” one of the papers contributed to this session in 2001: http://www.sla.org/division/dst/Annual%20Conference%20Contributed%20Papers/2001papers/cancellation.html.
20 The Role of Computer Technology in the Growth of Productivity. Washington, DC: The Congress of the United States, Congressional Budget Office, May 2002, http://www.cbo.gov/ftpdocs/34xx/doc3448/Computer.pdf.
21 IBID.
22 Scheible, John P. “A Survey of Storage Options.” Computer (December 2002), pp. 42-46.
23 Baudoin, Patsy and Branschofsky, Margret. “Implementing an Institutional Repository: The DSpace Experience at MIT.” Science & Technology Libraries 24:1/2 (2003), p. 32.
24 IBID.
25 http://www.documentum.com/
26 http://www.documentum.com/products/collateral/success/success_concordia.pdf
27 “Implementation of the Strategic Plan for ISD. Status Report, June 2004.” http://www.usc.edu/isd/strategicplan/private/doc/SPUpdate200406Projs.htm. Internal report.
28 http://digarc.usc.edu:8089/cispubsearch/
29Flexible Extensible Digital Object and Repository Architecture: http://www.fedora.info/
30 http://www.eprints.org/
31 http://il.proquest.com/products_umi/digitalcommons/SLA_IR_Paper_Final_Corrected.doc/Tompson p. 15
32 http://wiki.dspace.org/ScalabilityIssues
33 http://www.sdsc.edu/srb/index.php/Main_Page
34 http://www.slac.stanford.edu/BFROOT/
35 Foster, Andrea L. “Papers Wanted.” The Chronicle of Higher Education (June 25, 2004), p. 37.
36 http://archives.eprints.org/?action=analysis
37 Buehler Marianne A. and Boateng, Adwoa. “The evolving impact of institutional repositories on reference librarians.” Reference Services Review 33:3 (2005), p. 299.
38 Foster, N. F. and Gibbons, S. “Understanding Faculty to Improve Content Recruitment for Institutional Repositories.” D-Lib Magazine 11:1 (January 2005), http://www.dlib.org/dlib/january05/foster/01foster.html .
39 http://isihighlycited.com/
40 Customer Analysis: a Manual of Techniques. Los Angeles, CA: University of Southern California, University Libraries Customer Analysis Team, July 1997. Internal document.
41 http://alcme.oclc.org/openurl/
42 As Foster and Gibbons, and Bell, note in their latest paper, “The features of an IR that are most exciting to librarians, such as persistent URLs and metadata schemas, rarely register the same enthusiasm for faculty.”! (Bell, Suzanne, Foster, Nancy Fried and Gibbons, Susan. “Reference Librarians and the Success of Institutional Repositories.” Reference Services Review 33:3 (2005), p. 287.)
43 http://journals.cambridge.org/action/siteHoldings
44 http://www.opticsexpress.org/journal/oe/about.cfm
45 Foster, Andrea. IBID.
46 Allard, Suzie, Mack, Thura R. and Feltner-Reichert, Melanie. “The Librarian’s Role in Institutional Repositories: A Content Analysis of the Literature.” Reference Services Review 33:3 (2005), p. 331.
47 http://highwire.stanford.edu/lists/freeart.dtl
48 http://www.amazon.com/gp/browse.html/002-1184810-4786432?node=16427261
49 http://www.lockss.org/
50 As discussed at the “Electronic Archiving for Libraries” session at the Statewide California Electronic Library Consortium (SCELC) Colloquium in March 2006: http://scelc.org/meetings/programday/2006/.
51 http://www.openrepository.com/
52 http://repositories.cdlib.org/escholarship/
53 http://www.econtentmag.com/EContent100/SLA_IR_Paper_Final_Corrected.doc/Tompson p. 16
54 http://poeticeconomics.blogspot.com/2006/03/open-access-transformative-change.html
55 http://usefulchem-experiments1.blogspot.com/ .
56 http://drexel-coas-elearning.blogspot.com/2006/02/blogger-as-lab-notebook.html
57 Foster and Gibbons, IBID.