This paper describes experiments to extract a set of multimedia documents from
a digital library in response to a user query, and then to present these
documents as a hypermedia application called a "bundle". A visual interface to
a query engine with multiple query tools, using successive refinements is
described. The principle of automatic generation of hypertexts using the
structure inherent in the library catalogue is explained. Issues arising from
these experiments include the content of digital library catalogues and the
ownership and regulation of these catalogue entries. The paper explores these
issues, and examines the possibility of using the resulting system as a
workbench for investigating agent technology.

The advent of the digital library presents librarians and computer system
builders with new challenges and opportunities.

On a national level in the UK, the Joint Funding Councils, under the
chairmanship of Sir Brian Follett, produced a significant report [13] in 1993
which has recommended a considerable investment in IT in British university
libraries. The Multimedia Research Group at the University of Southampton, UK,
has a proposal accepted by the Follett Implementation Group on Information
Technology (FIGIT) to establish a standard framework for integrating journals
with other networked journals and information resources. The other partners
are, the Cognitive Sciences Centre at Southampton, headed by Prof. Stevan
Harnad who is the founding editor of Psycoloquy, the first peer-reviewed
electronic journal on the Internet, the University of Nottingham (home of the
CAJUN - CD-ROM Acrobat Journals Using Networks) project, the Company of
Biologists and the British Computer Society which is Europe's largest
professional computing society. The end product will be an Open Journal
Framework: a combination of document server and hypermedia client technologies
which allow customised access to a range of secondary information sources from
a central primary source.

While academic libraries in the UK are planning ahead to combat ever
decreasing funds with electronic solutions, similar pressures are being
investigated in the public domain. The most extensive review of public
libraries in the UK since 1942 is just being completed . The recommendation,
"Infrastructure investments" of the Review of the Public Library Service in
England and Wales [2] puts great emphasis on connecting to the information
superhighway and suggests the establishment of 5 or 6 `hyperlibraries' to
incorporate decentralised collections in specialised subject areas from the
British Library. This would allow the sharing of resources and relieve pressure
on large central area libraries. In these `hyperlibraries' some local
collections could also be developed to the full extent of their national or
international appeal. Improving remote access and where feasible producing
digital libraries would then dramatically increase the number of users able to
exploit these collections. Extracting customised and manageable subsets from
such digital libraries is an important issue, and the principal subject of this
paper.

One of the immediately apparent advantages of maintaining resources digitally
is the ease with which one may make a query, and then retrieve the documents
identified. Another advantage is the ease with which one can browse the
materials, quickly skipping from one document to another. The success of the
World Wide Web is testimony to the importance of these features.

If Digital Libraries are to be more than computerised search engines, which
merely identify the location of the paper document, or allow the user to view
or print an electronic copy of the document, then it is essential that the
digital library adds value to what is currently available. Gladney et al, [9]
in defining a digital library state that "A full service digital library must
accomplish all essential services of traditional libraries and also exploit the
well-known advantages of digital storage, searching, and communication".
However, the result of a computerised query is generally nothing more than a
list of documents. All one can do is to traverse the list searching for an
appropriate document. What we would like to be able to do is to make a
query, and have the system deliver an article or book, of exactly the correct
length, that was specially written in response to the query and to exactly the
required conceptual depth. Of course, one of the reasons readers browse at
libraries and bookshops is to evaluate books to see if they deliver their
subject at the correct level.

A second advantage of the digital library is that we may store and present
multimedia resources. We believe that multimedia presentations can make
very powerful learning aids, and it would be a very useful facility if we could
build multimedia presentations from the stored information. These presentations
might be requested by some client, built at the library from the latest
materials available, and then delivered to the site of learning. But who is to
build these presentations? Librarians are certainly overworked, and anyway
manual authoring of these presentations is not practical, except in special
cases where the end value of the product is intrinsically high.

Articles, books and multimedia presentations built in response to user queries
would be good examples of the sort of value add that digital libraries can
provide. We refer to them as bundles, and this paper describes various
experiments we are conducting on extracting these bundles from large catalogued
collections of digital resources.

The objective of our experiments is to provide a system which will produce an
appropriate and manageable hypermedia presentation in response to a user's
queries to an interactive library catalogue. The resulting presentation is what
we refer to as a bundle, being a set of multimedia documents, relevant
to a particular topic, that the user has selected as being of interest and that
have been automatically linked together as a hypertext.

There are therefore two stages to the creation of a bundle; the first involves
the user interacting with the user catalogue in order to specify the set of
documents to bundle, and the second, the creation of the hypertext, is
undertaken automatically. These two processes may be seen as simple cases of
the mediator and collection-interface agents under development in the
University of Michigan Digital Library [3]. The following subsections examine
these processes.

In order to help locate suitable documents from the resource catalogue we are
prototyping three tools, as shown in figure 1. The first of these tools is the
classification tool, which allows the user to view the classification
hierarchy (e.g. Dewey Decimal) as a tree, and to select that sub-tree or branch
which contains the set of documents to be placed in the intermediate results
list.

The second tool is the attribute tool, which is a form based query tool,
which displays all the document attributes from the catalogue, and allows the
user to specify the attributes of the set of documents to be placed in the
intermediate results list. These attributes are such items as the author, the
title and keywords. The classification tool and the attribute tool are intended
to provide similar functionality to Bellcore's Hierarchical On-line Public
Access Catalogue (OPAC) [1]: in their system, the intermediate results list is
known as the bookshelf.

The third tool is an information retrieval tool which allows the user to
enter a free text query. The tool then locates that set of documents having the
best match to the query, using an algorithm developed by Li [17] based upon
that suggested by Salton [20] . The tool compares the frequency of terms in the
query with the frequency of terms in the documents, as held in pre-prepared
indexes. At present this tool only works with text based documents, but we
envisage that at a later date we will incorporate the results of work in the
Multimedia Research Group in the area of media based content retrieval [16].

Figure 1: Querying the Library Catalogue.

Once the initial query has been made, a list of suitable documents will
appear in the intermediate results list box. From this list, the user may elect
to keep all, or a selection, of the documents retrieved. Documents that are
marked for keeping will appear in the final list of results, regardless of any
subsequent queries. The user may now elect to refine the query, or add further
results to the list. This is achieved by running any one of the query tools
again, with a new query. The results of each iteration may be ANDed with the
previous results, so that they are added to the previous results, expanding the
set of results in the intermediate result box. Alternatively the new results
may be ORed with the previous results, so that the new list of results
satisfies all previous queries, thus successively refining the query, until a
suitably small set of results has been collected.

Due to the fact that we do not yet have a suitably large digital library of
multi-subject on-line resources, for the purpose of prototyping the query
tool, we created a catalogue in MS-Access, of every document we have ever
included in a Microcosm project within the Multimedia Laboratory. This
catalogue refers to around 5000 multimedia documents from 25 different subject
areas. The catalogue we created contained all the usual fields that would be
expected in library catalogue. However we have found two further attribute
fields to be useful.

We have added a field for document length. Expressing the length of a
multimedia article is quite difficult. The purpose of keeping this attribute is
to allow the user to estimate the time that will be required to view this
document. "Pages" might have been a suitable unit of length for text documents,
but since many documents are held in media other than text, we decided to adopt
the "minute" as an experimental unit of length, being the approximate time that
might be required to peruse and understand the contents. This is inevitably a
subjective unit, but it does have the advantage that the user may refine
queries, for example, by asking only for "documents with length less than 5
minutes".

We have also added a field called reader level. This field is intended
to indicate the type of readership that is intended. We have restricted the
allowable entries to a very short list, including secondary school,
university teaching, research level, and review article.

The purpose of both the above fields is to represent a primitive form of user
profile. User profiling by mediation agents is an active research topic [21].

Automatic construction of hypertexts from a collection of linear documents is
one of the holy grails of the hypermedia research community. However, when a
document (or document set) has explicit structure, it is usually possible to
produce usable hypertexts automatically. Most work in the area [18, 8, 14] has
concentrated on using features such as tables of contents and indexes to
construct the hypermedia links. In this case we have a much coarser grained
table of contents, namely the library catalogue, and in place of indexes, we
have attributes such as keywords. However, our experiments have indicated that
it is still feasible to produce a usable hypertext bundle, given a set of
documents and their library catalogue records.

The hypertext system that we have used for these experiments has been Microcosm
[6,5] which has certain key advantages for acting as such an experimental
workbench, not least because of the fact that it was developed at Southampton,
so we have access to all its API's, but also because it supports multiple
methods of locating information. Microcosm produces hypertext applications
which have a bias towards querying and information retrieval, rather than the
simple "button pushing" hypertext that we have come to expect from some of the
more popular authoring packages. This quality of hypertext lends itself to this
application.

The methods Microcosm supports for locating information are represented in
figure 2. The first method of access to the information uses the
classification information. This is the information that that is held in
the Document Management System (DMS) and is almost exactly equivalent to the
information that is held in a library catalogue. It contains the position(s)
within the document hierarchy that the file will be located, and it contains
all the attributes, such as title, author, date of creation, physical media
type and keywords. This information is very high quality, as it provides
specific information about documents, and Microcosm provides tools for
traversing the subject hierarchy, and for querying the documents by attributes
such as keywords.

We were able to create the records for the document management system by
writing a few simple macros to export the information from the MS-Access
version of the library catalogue. Microcosm supports its own "logical
hierarchy", which is similar to a folder structure, and documents may be placed
in one or more branches within this structure.

Our initial reaction was to mirror the librarian's classification hierarchy
onto the Microcosm DMS. However observation of the ways in which users of
Microsoft's Encarta tend to make queries, for example, by asking to see all
videos about some topic, lead us to believe that categorisation by physical
type is just as important as classification by subject, so we have implemented
two hierarchies - one for subject and one for physical types.

The second method of access is hypermedia link following. It is
generally supposed that hypermedia links are manually authored in order to
indicate some relationship between the two items at each end of the link. Such
links are of very high value, but creating them requires considerable manual
effort. However, the Multimedia Research Group at Southampton always considered
that the reduction of this effort, and the automation of link creation, was an
important research issue. One of the earliest results of this research was the
introduction of generic links [7], which are links with a fixed end
point, but which may start at any point where the source object is located.
Typically this means any place where a particular text string is located.

We have taken all the keywords from the library catalogue, and automatically
generated generic links to the top of each document for which this keyword was
used. The result, from the user's point of view is that all occurrences of
keywords appear as buttons within the text, and can be used to navigate to
documents sharing this keyword.

The third method of data access is text based information retrieval. The
interface to this functionality requires the user to make a selection of text
from some document, or else to type in some text, and then choose "compute
links" from the menu. The information retrieval engine will consult its
indexes, and return a list of documents with the most similar vocabulary to the
query. This is exactly the same engine as is used for our information retrieval
tool in the catalogue query engine.

We are in the process of producing our first prototype for the query engine,
and the automatic Microcosm hypertext construction engine, by integrating
various tools that have already been produced. We are testing it using a
"digital library" consisting of a large number of on-line documents available
within the Multimedia Laboratory. The bundles that the system produces clearly
provide for a greater ease of navigation and information location than the
simple raw set of documents would have provided, and subjectively we would
claim that the clearly marked boundaries to the bundle give the user a greater
sense of the scope of the information than would be provided by a number of
references into a very large collection of data.

This system was built as a workbench for experimenting with various ideas
within the digital libraries domain. It is clear that there is much room for
improving our system, and we hope that over time various additions will be
made. We would like to add a synonym generator. This would create further terms
from the document keywords, and create further generic links on these terms. We
would also like to integrate media based content retrieval and navigation [12,
16].

On a more complex level, we see the system as an ideal workbench for the
development and testing of intelligent agent technology. There are many ways
that such technology could be usefully deployed in this environment. Search
agents could track users' interactions and attempt to mine for other relevant
topics using background information retrieval, and link creation agents could
attempt to create further links and trails through the bundle. We are already
working on an agent to produce a "front-end document" as a kind of hypertext
overview [15] of the bundle.

Academic libraries, in particular, are now generally highly computerised in
their basic housekeeping processes and in presenting themselves via on-line
public access catalogues (OPACs).

One of the strengths of the traditional libraries is that many large
collections of books and materials in other media are catalogued with great
care and attention to bibliographic detail and accuracy using standard
cataloguing rules such as Anglo-American Cataloguing Rules (AACR2). As Howard
Rheingold points out `Librarians and other specialists have a toolkit and
syntax for dealing with well-known problems that people encounter in trying to
make sense of large bodies of information [19]. However even key libraries such
as the Library of Congress are constantly having to review their strategies and
approaches to alternative methods of cataloguing such as copy cataloguing and
minimal cataloguing in order to dramatically bring down their backlogs.

On-line journal abstracting databases also traditionally provide very thorough
bibliographic detail created by professional graduate level staff. INSPEC, for
example, can help you refine your search strategy with classification terms
such as C7210L (INSPEC's classification for "library automation"), thesaurus
terms such as "document image processing" and free terms such as "multimedia
databases".

We wish to help maximise the return from this labour intensive work using our
tools for producing and viewing bundles. The combination of a variety of
materials from different sources may also provide a variety of classification
schemes and types of subject headings. These can be added to the bundle's
hierarchies. Although at the document level the classification may not easily
convey a sufficient level of separation, nevertheless it can be used in the way
a library user traditionally browses a library shelf. On the World Wide Web we
see subject catalogues such as the WWW Virtual Library being developed to give
a useful alternative way to access a huge amount of information. At the same
time Internet Yellow Pages directories [10], while they become obsolete fairly
quickly, still sell because they provide an overview of accessible items which
is not yet so easily scanned from today's computer screen.

The options for locating information are varied and different ones will suit
different users at different times and also the same users at different levels
of understanding. The level of a item - who it is intended for- has less
frequently been emphasised in the past. It may perhaps be highlighted by a
pointer such as a treatment code or a readership category, but this may become
more important in the future with the enormous volumes generated by both paper
and electronic publication. This possibility has been initially addressed by
our reader level field.

The catalogue entries produced by the information specialist may give a degree
of rigour to the information retrieval process. However, a multimedia bundle
will frequently contain documents produced by and often further modified by the
authors, particularly in the academic environment, which do not have the
advantage of such in-depth indexing. One way to compensate for this might
simply be to provide or point to a glossary or dictionary which helps the
searchers choose their terms. The linking mechanisms of Microcosm do, in any
case, give considerable help in retrieving relevant items. Free text terms used
by the author are often more up to date in style and therefore more akin to the
searcher's own vocabulary. The art is in balancing the effort required versus
the result, particularly as terminology can change dramatically as one moves
forwards or backwards in time. The effort also depends on how diffuse the
intended audience is - a multimedia database with entirely local users need not
be so concerned with the difference in terminology and synonyms between Britain
and America, for example.

An important feature of the approach described in this paper, is that
knowledge may be extracted from a library and handed to a user in some
manageable piece of information: the user should be assured that the material
is all relevant, that the subject matter is at an appropriate level and, most
importantly, the quantity of information is manageable. If a ten year old
school child has made a query on the term "geology" we do not wish to hand them
the entire British Library entry on this topic - we probably need to explain
the term, and give a few examples of the sort of work undertaken by geologists.

One way to help users to comprehend the boundaries of a set of materials is to
extract the relevant materials, or copies of them, and present them to the user
in some form that is manageable and familiar. This is what a book is. In
attempting to simulate this effect within the digital library, we feel that at
is important that the users can visualise the boundaries of their newly created
hypertext material. This is why we produce bundles. In our experiments these
bundles have been created in Microcosm, which provides an ideal environment for
the production of a self contained hypermedia application.

However, there is nothing that makes Microcosm essential to this application.
The essential features are a mechanism for allowing users to search a
hierarchy, search attributes, follow hypermedia links and to carry out
information retrieval. Microcosm's generic links may be simulated within a
finite document set, by scanning the source text for each occurrence of a
string that is the source of a generic link, and marking it as a link in the
format used for the host hypertext system..

It would be perfectly feasible to produce a World Wide Web (html)
version of this application. There might be some problems in converting data
formats, but this route would have the advantage that its network delivery
would be superior to the current version of Microcosm. However, making new
links in html is rather more difficult than using Microcosm, and links
cannot be made to objects within other media, so if the user is expecting to
personalise and extend the delivered bundle, then perhaps this route is not so
suitable. The development of a tool for building html documents from a
Microcosm application is described in Hill et al. [11].

Another option might be to compile the resulting bundle into Microsoft's
Multimedia Viewer. Again, although there might be some problems automatically
converting data formats, and end user link making will not be possible, the
compensation would be a superior front end to the delivery engine. The
Multimedia Viewer will no doubt emerge as an industry standard for such
presentations, particularly for the less computer literate user.

There are two issues which we have deliberately ignored in creating this
experiment, but which should nevertheless be mentioned.

We have assumed that there is no copyright problem, and that any user may take
an electronic copy of any set of documents away in a bundle. We recognise that
there are problems with this approach, but others are addressing this issue [4]
and, as part of a wider discussion on charging policy, we are intending to
address the issue in our "Open Journal Framework" project for FIGIT. Until such
time as the position is clearer, we do not feel that such matters should be a
restriction to research into creating usable library technology. In the
meantime, it is possible to imagine scenarios where references to the documents
in the bundle are created in a private workspace within the domain of the
digital library server, so that viewing the documents in the bundle is actually
no different (legally or technologically) from viewing them using whatever
software was provided by the server.

The second issue we have chosen to ignore is the price (in terms of both
network transport time and financial cost) of getting a copy of a document.
Paul Evan Peters, his keynote address to Digital Libraries `94, advised that
we should assume that the cost of networking is free. We have taken his advice,
and being academics, we are, of course, sheltered from knowing the financial
cost of anything. For use in a real library, it would be necessary to make
these matters explicit to users, so that they were aware of the time it would
be likely to take to download all the documents in their bundle, and the cost
of the documents specified.

Initial experiments with our system have indicated that bundles are a
manageable and appropriate method of delivering information from the digital
library. This approach has the advantages that:

Downloading the resources from the server to the bundle is a one off effort,
this minimising the strain on the server and communications network.

The hypermedia links and information retrieval enable better browsing than can
be achieved within a simple list of documents.

The boundaries of the information are known to the user, and the information
contained within the bundle has been selected as being of an appropriate
level.

The approach need not be tied to any particular software or hardware delivery
platform.

We have attempted to show that such bundles may be created by using the
information that would be available from a standard OPAC. However, we have
found that the documents described in such catalogues tend to be too large,
such as whole books, and it is necessary to describe smaller chunks of
documents such as chapters and pictures. This has knock on effects on the
schemes for classifying and keywording information, that we find need to be
finer grained and more diverse than is generally available using standard
schemes. We have found user defined classifications and keywords to be helpful
at the level of the bundle, but acknowledge that there would be difficulties if
such anarchy was allowed at the digital library level.

The framework for creating and delivering bundles is a suitable workbench for
investigating the behaviour of intelligent agents, both in the field of
locating suitable information for placing in the bundle, and for creating links
within the completed bundle.

The Authors

Hugh Davis is a lecturer in Computer Science at the University of
Southampton, UK, and was a founder member of the multimedia research group. He
was one of the inventors of the Microcosm open hypermedia system, and is
manager of the Microcosm research laboratory. His research interests include
data integrity in open hypermedia systems and the application of multimedia
information retrieval techniques to corporate information systems and to
digital libraries.

Jessie Hey is a chartered librarian/information specialist and qualified
teacher who has worked in a variety of library/information roles at California
Institute of Technology, CERN and Southampton Institute of Higher Education.
This was followed by 12 years at IBM's UK Development Laboratory where her jobs
included managing the technical and business information services and setting
up an interactive learning centre. She is now pursuing postgraduate research
with the Multimedia Research Group at the University of Southampton.

We would like to thank Professor Wendy Hall, the director of the Multimedia
Research Group, for her inputs into this project, and the other members of the
group, too numerous to mention, but especially Les Carr, for their help and
ideas.

Hill, G.J, Hall, W., De Roure, D.C. & Carr, L.A. Applying Open
Hypertext Principles to the World Wide Web. To be published in the Proceedings
of the International Workshop on Hypermedia Design `95. Montpellier. Available
from the authors as a Computer Science Technical Report. University of
Southampton, 1995.