The Unified Computer Science Technical Report Index:

Lessons in indexing diverse resources

Abstract

UCSTRI is a WWW service which provides a searchable index
over thousands of existing technical reports, theses, preprints, and
other documents broadly related to computer science. This service has
been in operation since May of 1993 and has enjoyed significant use;
it received an honorable mention for "Best Professional Service" in
the 1994 Best of the Web awards, and is available at <URL:http://www.cs.indiana.edu/cstr/cover.html>.

The design and philosophy of UCSTRI is presented and compared with
other approaches to indexing technical reports. The lessons learned
about organizing electronic resources are discussed, both regarding
technical publications and network services in general.

Introduction

In the electronic revolution currently underway in the field of
``publishing,'' the academic community has in some ways held the
leading edge. This community was among the first to have access to
global network connectivity, and academic publication greatly
simplifies compensation issues since the author does not generally
expect financial renumeration. Academics have long been exchanging
preprints and technical reports with colleagues relatively freely;
this makes their findings widely available prior to publication in
more conventional channels, such as journals [Odlyzko]. These factors form an ideal
environment for electronic publishing.

In recent years, this informal network has moved onto the Internet, as
many departments, research groups, and other institutions make
publications such as technical reports, preprints, and theses freely
available electronically. The typical arrangement was a set of
documents, usually in PostScript® format, being made available
via FTP. This brought a large amount of information into the realm of
network accessibility, but did little to allow scholars to easily
search for items within this sea of data which might be of interest.
Pointers to items could be passed around through essentially the same
informal network of people, moved online.

The idea of combining archives of academic papers to form a searchable
interface is not a new one; the domain is a standard resource
discovery problem. Many other attempts to index online information
also exist; however, academic papers are particularly suited to
indexing because they typically are made available along with
``metadata'' which provides a concise and manageable description of
their contents (author names, title, sometimes an abstract.) This is
richer indexing information than is found in, say, FTP filenames or
Gopher menu entries.

Enter UCSTRI

The Unified Computer Science Technical Report Index, or
UCSTRI (rhymes with ``Spruce Tree'') is, as its name
suggests, an attempt to unify a wide variety of technical documents
broadly related to computer science as a searchable index. Technical
Reports about computer science form the core of the collection;
theses, preprints, and other papers from CS and related areas are also
included. At UCSTRI's core are some essential ideas:

Limited central resources.
Maintaining a central copy of every document is neither
desirable nor feasible. Maintaining a central copy of citations,
however, is viable as a near-term solution; obviously it cannot scale up
indefinitely.

Full hypertext connectivity.
The system should take the user
all the way to the final document, not to a screen of instructions of
FTP addresses or a phone number to fax your request to. This means
hypertext links to the remote document servers.

Currency. The system should handle document additions, deletions,
and changes automatically so its references are reasonably accurate
and timely. This is particularly important because most technical
reports and preprints are of interest for a relatively short period of
time (on the order of weeks or months.)

Minimal cooperation from providers.
The system should work without the institutions providing technical
reports needing to install special software, organize their archive in
a special way, or even be aware of the index. (We do try to let sites
know they are included as a courtesy, but this is not required.)

Design

UCSTRI requires two major modules. An index builder
polls numerous FTP sites for item information to construct a master
index file. The list of sites and their characteristics is the only
component of the system's operation that must be maintained by hand.
A search engine then processes queries into that
file to return citations and hypertext links to appropriate items.
The overall structure is similar to other indexing systems (e.g. [Aliweb].)

Figure showing UCSTRI's major components

A major distinction between UCSTRI and ALIWEB lies in the rigidity of
the index file; ALIWEB relies on the provider to have created the file
in a specific format, while UCSTRI assumes the provider probably
already has a file, but assumes as little as possible about the
structure of that file.

The hard part of UCSTRI is in the index builder: it must be general
enough to extract the metadata from index files on remote servers
despite the fact that those files do not necessarily follow any
consistent format. The indexer must find the index file, split it up
into separate records for each different document, and match those
records with the filenames of the items themselves.

Filename extensions are highly standardized and easily removed.
In this particular index, the records contain a space after
TR while the filenames do not; such standard changes can
be accomodated by simple subsitution rules.
The textual contents of the index file are simply included blindly
with whitespace folded together. From the example above, the entries
created are:

TR
340 Gregory J. E. Rawlins. The new publishing:
Technology's impact on the publishing industry over the next
decade. (Nov. 1991). 68pgs

When parsing ordinary index files, the content descriptions are opaque
to the indexer (as are the files containing the documents themselves.)
Some sites employ more specific formats, such as the format employed
by the UNIX program refer or defined by [RFC1357], more structure is provided and
therefore the resulting entries can be formatted more nicely. After
culling from the sites, all entries are placed in an master index
file.

The search engine was designed for lightweight simplicity and power;
termed Simple Index Keyword Search or SIKS, it accepts
multiple keywords (actually regular expressions) and returns items
ordered by how many expressions each matched. This search is flexible
for users familiar with regular expressions, but does not employ
pre-constructed tables such as those employed by [Wais]; such an engine could be employed without
altering UCSTRI's essential design.

A sample query might be a search for information about Knuth's work
with sandwich theorems via specifying keywords sandwich theorem
knuth. The results are shown below.

Figure showing results from a sample query

Results

UCSTRI's structure has permitted it to grow reasonably rapidly; as of
this writing, it indexes 9,766 items at 177 different sites throughout
the network world (although there are inevitably some items which are
duplicates, errors, or otherwise not useful.) The resulting index
file, about 6.2 megabytes in size, is small enough to manage easily.
Although active participation on the part of indexed sites is
unneeded, some sites have become interested in UCSTRI and design their
FTP archive with it in mind.

Between March 18 and September 11, UCSTRI received 119,630 queries
originating from 21,053 distinct IP addresses. The two graphs below
show where these figures came from and when they arrived. These
figures only include actual searches on the database, not connections
to view the search cover page or information about the service. A
number of hosts could not be resolved by the Domain Name System; they
are listed as DNS failures.

Figures showing top-level domains of queries and frequencies of
queries over time

UCSTRI is a reasonably old and mature service as WWW services go; it
first came online in May of 1993. Use is somewhat volatile,
particularly as the large segment of academic users ebbs and flows
during the summer, and the esoteric nature of much technical
information limits the audience of serious users.

Stability is always a problem for network services, and certainly is
for UCSTRI. The service is not formally supported in any way; it is
administered as a hobby and run with machine resources donated by the
Indiana University Computer Science department. We now have a mirror
site in Japan, but in general no provisions exist for effectively
distributing the current system.

The lack of active
participation by information providers causes frequent maintenance
problems due to changing circumstances; frequently a file format will
change, an FTP server will move, or a filename will change from
Index to Index-1994.

The synopsis is that UCSTRI is a hack that works for now. The
maintenance required is relatively high; supporting the service as it
could be supported would probably take 8-10 hours a week (in practice,
the support it gets is somewhat irregular.)

UCSTRI and other indexers

Rik Harris's WAIS index of abstracts [Harris] provided some of the first broad
search functionality. Unfortunately, its interface does not provide
hypertext links to the final content when available online. Its list
of sites, however, was invaluable in constructing UCSTRI.

The Wide Area Technical Report System [Waters] is another attempt at organizing such
information.
The National CS TR Library [Dienst,
Davis] represents another, more
ambitious approach to a distributed digital library. Both systems
offer more sophisticated functionality and better scalability than
UCSTRI, but both also require sites to use specific software to be
included. The development time is consequently much longer because
consensus must be reached with a large number of participants; each is
still working with only a handful of participating sites in the short
term.

One intriguing recent addition is a broker for CS technical reports as
a demonstration application of Harvest [Harvest]. This system extracts information
from the documents themselves, unlike most other systems which treat
documents (typically in difficult-to-analyze formats like compressed
PostScript) as opaque objects[Essence]. Since files are more standard than
index formats, Harvest is able to function with less intensive
maintenance than UCSTRI. This broker has somewhat broader coverage
than UCSTRI (indeed, UCSTRI is one of the sources from which the
broker builds its list of sites) though, unlike UCSTRI, it includes
many entries for reports not available online. The broker also seems
to have greater problems with duplicate entries (for example, the TRs
of the author's department are all listed three times using different
domain names for the same machine.) The Harvest TR broker also uses
significantly more storage space for its index, though provisions for
distribution make that system more scalable overall.

The various indexing services might be considered along various scales
indicating ease of use for providers of documents, maintainers of the
service itself, and users. For providers, Harvest is easiest (no
effort is required) followed closely by UCSTRI (little or no effort is
required.) For maintainers, UCSTRI is probably the most
time-consuming and tedious to maintain well. For users, ignoring
differences in coverage, Dienst is probably the easiest to use for
generating powerful queries; WATERS is roughly comparable, with UCSTRI
falling significantly below it.

Lessons learned

As the amount of information available has grown, making the data
structured in time becomes important. Technical reports and related
information is of most use early in its lifetime, and older reports
are less likely to be of value to users. UCSTRI orders its results by
time using the modification dates on the files obtained from FTP
directory listings, but such a solution is sadly incomplete. It
should be possible to restrict queries to a specific time interval,
such as the past six months.

Building an effective index for resource discovery which is smaller
than the space being indexed requires finding characteristics to
concisely describe the content of each item. Some approaches, such as
[Essence], attempt to extract such
information automatically; others, such as [Waters] or [Dienst], make the provider responsible for
making it in a specific form. UCSTRI steers a middle course, assuming
that this ``metadata'' often already exists and is maintained but is
not necessarily made available in a standardized format. The former
WAIS archive of FTPable-READMEs also employed such a strategy.

In general, metadata provided by people is likely to do a better job
of concisely expressing the essence of an item to a human reader than
metadata extracted by a program. The result is that search results
from a system based on explicit metadata are likely to be more
intelligible by the user than search results from a system based on
implicit metadata, such as the Harvest TR indexer [Harvest] or Archie [Archie].

Like most other successful network resource indices, UCSTRI is a quick
solution that works. As a more general framework for resource
discovery on the Internet evolves, the need for such solutions tends
to go away. Provider-supplied metadata, however, seems likely to
continue to play an important role in any general solution.

Acknowledgments

UCSTRI is run on facilities made available by the Department of
Computer Science at Indiana University, Bloomington. Thanks to Bill
Dueber and Tom Loos for helping develop this document, and to
Jun-ichiro Itoh for provinding a mirror in Japan and enhancing
UCSTRI's formatting to handle Harris's index format.

Rawlins, G. ``The new publishing: Technology's impact on the
publishing industry over the next decade.'' Technical Report 340,
Department of Computer Science, Indiana Univeristy. November, 1991.
An abbreviated version of this paper appeared in Journal of the
American Society for Information Science 44:474.
<URL:ftp://ftp.cs.indiana.edu/pub/techreports/TR340.ps.Z>

Author's Biography

Marc VanHeyningen is a doctoral student in the Computer Science
Department at Indiana University, Bloomington. His research, with
advisor Gregory J. E. Rawlins, involves evolving index strategies for
large image databases. He is also employed by University Computing
Services to construct a document registry and index for network
resources available throughout the university.

Marc has been actively involved in the WWW community for some time; he
authored the first sophisticated Perl HTTP daemon which, after much
work by others, formed the core of the Plexus server. He still
administers the departmental HTTP server.