Text Encoding Initiative
Tenth Anniversary User Conference

Delivering Electronic Text Over the Web - The Current and Planned Practices
of the Oxford Text Archive

Alan Morrison and Jacob Fix

1 Introduction

In the summer of 1996, the Oxford Text Archive was appointed as one of five
Service Providers for the UK-based, national Arts and Humanities Data
Service (AHDS). The AHDS has a broad remit, requiring it to work
collaboratively on behalf of the academic community to:

Collect, catalogue, and preserve digital research data and distribute
these for use in teaching and research

Document and promote information interchange standards

Facilitate more uniform access to online catalogues, finding aids and
other networked resources

Develop strategies for preserving our digital cultural heritage

Provide guides to good practice in the creation and scholarly use of
digital resources

Encourage the development of high-quality, scholarly digital resources
by brokering creative new partnerships

The Oxford Text Archive specializes in the area of electronic texts, and
strongly advocates the use of TEI-conformant SGML. The majority of the
Archive's collection is now stored as TEI Lite texts, and it is these
materials that we seek to distribute as part of our contribution to the
workings of the AHDS. However, we also need to operate within the framework
of the AHDS, which will mean making our holdings accessible via the AHDS'
integrated catalogue (which aims to integrate the collections held at all
the Service Providers), and catering for the requirements of end users who
may have little knowledge of either SGML or TEI.

The Archive is currently facing the dual challenge of how to make our texts
accessible through the AHDS catalogue (which is likely to offer little or
no support for TEI-conformant texts), and how to deliver our texts to
end-users in a format they will find useful for their purposes. We believe
that the issues faced by the Oxford Text Archive are in many ways more
crucial that those confronting the other Service Providers, as many of
those cater for dedicated, subject-oriented communities (e.g.
archaeologists, historians etc.), whereas electronic texts are frequently
of interest to a broad range of humanities disciplines, and beyond.

2 Current practice - how texts are managed at the Oxford Text Archive

In the past, electronic texts have been deposited with the Oxford Text
Archive on a largely ad hoc basis. However, the formation of the AHDS has
meant that certain types of grant-holder (e.g. recipients of awards from
the British Academy, Leverhulme Trust etc.), are now actively encouraged,
if not obliged, to consider offering any scholarly electronic texts
produced as a direct or indirect result of funded research, for deposit
with the appropriate AHDS Service Provider (thus a report detailing an
archaeological excavation will more properly be offered to the Archaeology
Data Service than the Oxford Text Archive, even if it consists solely of
electronic text). Moreover, the fact that the Archive has been operating
for over twenty years has meant that even though many of our deposits are
offered as TEI Lite-conformant texts, there is notable variation in the
content (both markup and data) of the TEI Headers that have been used.

Over the past twelve months the Archive has been working to develop a
consistent TEI Header structure, which we hope will be appropriate for all
our current and future needs. To this end, the Archive has also organized
a meeting of representatives from some of the major TEI-aware, scholarly
electronic text creators and users, (to take place in Oxford, in September
1997) to see if we can come to any sort of consensus regarding the
application of TEI Headers. (A report on this meeting has been submitted
for consideration for this conference).

Rather than maintain a separate catalogue of the Archive's holdings, our
intention is to use the information stored in the texts' TEI Headers to
assist in the identification and retrieval of resources (an intention which
also implies close control over the data content of each TEI Header through
the use of conventional library authority files, controlled vocabulary
lists etc.). Ultimately, we would like to use the Headers to store
sufficient information to support a complete document management system, to
assist with collections management, document revision, and so on.

At the time of writing, the only publicly accessible catalogue of our
holdings is a collection of automatically generated HTML files, available
through the Archive's home page, which are produced from the ageing
database system we currently use to manage our collection. This online
catalogue allows the identification of texts by language, author, and
title, but in an extremely rudimentary fashion; it does not take full
advantage of the facilities now available through most web browsers, nor
does it allow end-users direct access to the wealth of metadata held in the
texts' TEI Headers.

3 Making texts more accessible -- the AHDS integrated catalogue, and
beyond

Each of the five AHDS Service Providers have adopted their own approach to
cataloguing their collections. In part this is a reflection of the varied
nature of the collections concerned (everything from digitized video and
geospatial data, to digitized images and machine-readable population census
data), but it also forms the basis of one of the objectives of the AHDS,
namely to explore the practical issues involved in developing an integrated
catalogue of diverse distributed resources. Whatever the local intentions
of the Oxford Text Archive, it is important that any cataloguing-related
activities that we undertake do not limit our full participation in the
development of the AHDS' integrated catalogue.

Between April and June of this year, under the auspices of the AHDS and
UKOLN (the UK Office for Library and Information Networking), each of the
Service Providers organized a meeting of specialists and end-users with a
particular interest in subject-specific metadata to assist initial resource
discovery. In each case, discussions centred on the usefulness (or
otherwise) of the Dublin Core element set to capture the basic metadata
considered essential to enable an end-user to find and identify a
particular electronic resource as being of possible relevance to his/her
area of concern. The Oxford Text Archive was well-placed in these
discussions, because the Dublin Core is reasonably well-suited to
describing text-like resources (as opposed to, say, digitized sound
recordings), and the mapping of information from a TEI Header to a Dublin
Core record is a straightforward process (assuming that the required data
has actually been recorded in the Header). Work is currently underway to
review the findings of these metadata meetings, to identify the set of
elements within the Dublin Core that can be supported across the domains of
all the Service Providers, and which will thus form the basis of the
minimal set of information that each Service Provider agrees to make
available through the AHDS' distributed catalogue. (At the time of
writing, it is planned that the catalogue will be based upon a network of
Z39.50-compliant client/servers).

As well as offering our collection through the AHDS, the Archive is also
working on the automatic generation of MARC records from TEI Headers.
These records will be loaded into Oxford University's OPAC, so that library
users will be made aware of the Archive's holdings alongside the
conventional resources that are also available. Moreover, because Oxford's
OPAC forms a crucial part of the CURL (Consortium of University Research
Libraries) OPAC, information about the Archive's holdings will be readily
available nation-wide (in addition to the online catalogue available via
our web pages, mentioned above).

The two approaches outlined above effectively disregard much of the
valuable metadata information that is contained within the TEI Headers of
the Archive's collection. With this in mind, we have also been exploring
the development of a PAT/web gateway, which will allow users to make full
use of OpenText's powerful search and retrieval engine (PAT), via an
easy-to-use web front-end. This has the potential to allow users to search
across the collection for any information likely to be stored in a TEI
Header (within the constraints of the documentation practices adopted by
the Archive), and retrieve texts accordingly, which is clearly much more
powerful than the search and retrieval facilities offered by a conventional
library catalogue.

4 Delivering texts over the web ‹ current and future practice

At present, the vast majority of the Archive's holdings are only made
available via either public or private ftp. A limited number of requests
for materials to be supplied on disk, magnetic tape, or CD-ROM are met each
year, but we intend to phase-out this service in the very near future.
There are two distinct problems caused by the Archive's ftp service: the
arrangements for distributing texts to which usage restrictions apply (i.e.
those which are not freely available) are cumbersome and labour-intensive;
many new internet users are unfamiliar with ftp, and are unable to
configure their web browsers appropriately.

Once a user has identified a text which may be of interest, either via the
AHDS' distributed catalogue, Oxford University's OPAC, or by directly
searching our online resources, there are a variety of ways by which that
resource may be delivered. Under existing practice, a user who retrieves a
text from our public ftp site will be supplied with a copy of the raw ASCII
version of the file, complete with TEI Lite- conformant SGML markup. In
common with a number of other electronic text centres, we have also been
experimenting with delivering accompanying catalog and style files, such
that if an end-user's web browser is configured to launch a suitable
SGML-aware application (such as Panorama or Multidoc Pro), a formatted,
browsable version of the text can be made available.

The above scenario assumes that the end-user has installed an appropriate
SGML-aware application, be it a browser or some other sort of application,
and has configured his/her web browser appropriately. However, it is our
belief that this is still outside the reach of many users of the Archive
(particularly, we suspect, those that come through a less direct route,
such as via the AHDS' catalogue), and so for the foreseeable future we are
likely to offer a number of alternatives. Perhaps the most user-friendly
of these is to support an on-the-fly conversion of TEI Lite to HTML, which
will allow the end-user to browse a reasonably well- formatted version of
the text, and as more web browsers offer support for such features as
cascading stylesheets, we will have even greater control over the final
appearance of the text. We have also experimented with conversion to other
formats, such as RTF, on the grounds that this will at least provide
end-users with something that they find more familiar/acceptable, even
though this is at the high price of discarding all descriptive markup. An
alternative approach is to focus less on the needs of users who want to
read/browse texts, but concentrate instead upon other things that they
might wish to do with a text (which might in fact be the reason why they
sought-out an electronic text in the first place), such as some form of
text analysis. To this end, we have been investigating offering a more
sophisticated PAT/web gateway, which will allow users to perform elementary
searches and analyses of texts, without having to download the entire
text(s) to their own machine, install and configure analysis software, and
so on.

We are closely monitoring the development of XML, as we feel that this is
directly relevant to the work of the Oxford Text Archive. For example, the
widespread use of XML-aware web browsers would certainly encourage us to
offer XML versions of our collection over the web (indeed, it is perhaps
conceivable that if XML lives up to expectations, we may move to using SGML
as purely an archival format, and make available on the web nothing but
valid XML documents). Similarly, we are keen to explore the possibilities
of using the Document Style Semantics and Specification Language (DSSSL) in
the online delivery of our texts, although at the time of writing we have
only carried out some initial work using a DSSSL formatting engine (James
Clark's JADE), to convert files from TEI Lite into other formats.

However, in addition to all the possible web-based delivery mechanisms
outlined above, the Oxford Text Archive will need to conform to AHDS
practice with regard to rights management. Closely coupled with the AHDS,
integrated catalogue will be a user registration and authentication
mechanism, which will enable (and require) users to register to use the
holdings of the various AHDS Service Providers. Although the exact nature
and functionality of this service is, at present, slightly unclear, it is
highly probable that this will have an impact upon the methods used to
deliver electronic texts to the end-users of the Archive.