A new hypertextual interface for thesaurus navigation is described. When the
user enters a subject term, the interface responds with a graphical
representation of that term's relationships to other terms in the thesaurus:
its place in one or more subject hierarchies, if any, as well as its place in a
space of loose associations with other terms. The user can selectively
collapse or expand displayed hierarchies. Every thesaurus term in the display
is hypertextual. The user can click on any term to move to that term's
graphical thesaurus display. The user can also drag terms to other elements of
the overall search and retrieval interface. Keyword and keyword-in-context
tools are available to access multiple-word terms in the thesaurus and aid in
entry into the thesaurus. The INSPEC Thesaurus is the first of several
thesauri we plan to mount in this interface.

One of the research activities of the University of Illinois Digital Library
Initiative (project URL: http://www.grainger.uiuc.edu/dli) is to find ways to
improve subject access through the use of interactive thesaurus displays. As
the testbed project so far involves retrieval of SGML-encoded computer science
and physics journal articles, the INSPEC Thesaurus was chosen as the means for
providing subject access to these articles as one way to augment full-text
access, which lies at the heart of the project.

Each entry in the INSPEC Thesaurus has a structure typical of most subject
thesauri. Relationships used in the INSPEC Thesaurus are shown in the sample
entry shown in figure 1.

Figure 1 shows the INSPEC Thesaurus entry for shock waves. Each
two-letter code, respectively, stands for Use For, Narrower Term(s), Broader
Term(s), Top Term(s), Related Term(s), and Date Input. Every term designated
as a NT, BT, TT, or RT in an entry has, in turn, its own entry, which indicates
the UF, NT, BT, TT, and RT relationships for that term. (In this paper,
term is synonymous with descriptor, meaning preferred or approved
INSPEC indexing term. Nonpreferred terms are referred to as access
terms.)

An additional relation, PT, not shown here, indicates a Prior Term. When used
with the DI field, the PT field helps to demark the beginning and end of a
term's lifetime in the thesaurus.

Figure 1: Sample INSPEC Thesaurus entry.

The INSPEC Thesaurus consists of approximately 8,000 descriptors and
another 8,000 or so access terms which lead to the descriptors. All of the
terms in figure 1 except for sonic boom are INSPEC Thesaurus
descriptors; sonic boom is an access term, as is any other term falling
under the UF tracing in an entry. The user who looks up the term sonic
boom, for example, is directed to use shock waves instead.

The INSPEC Thesaurus is particularly well-suited to this project not only
because of its subject scope but also because of its overall structure. In
several important respects it is a "well-behaved" thesaurus and is therefore
amenable to algorithmically-controlled access and display. The INSPEC
Thesaurus consists of several hundred subject hierarchies, ranging in length
from two terms up to several hundred terms, with a maximum hierarchical depth
of about six levels. Subject hierarchies are built from recursively tracing
all the NT (narrower term) relationships from each TT (top term) in the
thesaurus. Orphaned terms do occur in the INSPEC Thesaurus, but what few there
are are still connected to neighboring hierarchies by RT (related term)
tracings. Its moderate size, varied though consistently applied structure,
rich interconnectivity, and avoidance of inordinately deep hierarchies makes
the INSPEC Thesaurus well suited for the kinds of subject access research we
report here and have planned for the future.

Most IR (information retrieval) systems and their companion OPACs (Online
Public Access Catalogs) in library and information networks have had a simple
design from 1968 onward: create a database of bibliographic records made up of
descriptive elements and subject index terms, possibly an abstract, then invert
several of the data elements creating an index to the document file, thereby
providing keyword access. All these keywords come from the document set as
they appear in each document. Little or no editing occurs to ensure that
synonyms will be collocated or variations of name form will be controlled.
Occasionally, if a thesaurus existed for inter-indexer consistency within the
database, this thesaurus would be mounted as an auxiliary file in the IR system
and be invoked by the user if and when results of a search were disappointing
and the search query had to be enhanced. The onus was on the end user or the
intermediary to remember to use such a file in time of need. Few systems
designed transparent interconnections between the inverted index lists and the
thesaurus file. Most had cryptic codes to indicate when a keyword in the index
list was a thesaurus descriptor. "Expanding" on the keyword to recover the
thesaurus record of relationships (broader, narrower, and related), the scope
note, synonyms, etc. was a separate command and usually interrupted the search
formulation process. This type of system design can be thought of as
replicating the paper lookup process where someone using the card catalog must
refer to the "Red Book," namely the Library of Congress List of Subject
Headings. Balancing the two operations was never easy and searchers chose the
path of least resistance. In on-line searching, especially when the keyword
index (described above) would turn up some but never all the relevant documents
in the file, the searcher just ignored the thesaurus all together.

The time has now come, thanks to Graphical User Interfaces (GUIs), hypertext,
and better understanding of user requirements during the search process, to
provide a Searcher's Thesaurus at the outset of a search. Marcia Bates [2],
Wilf Lancaster [10], Jean Aitchison [1], Pauline Atherton Cochrane [6], and
Susan Jones [8], among other authors, have expressed a need for such a
thesaurus because the user needs help immediately upon contact with an
information system. Only in this way can the system clearly show a willingness
to help with synonyms, variations in phrasing compound concepts, broader and
narrower terms, and other means of expanding or limiting queries to
bibliographic databases.

Susan Jones, et al. [9], describe experiments in interactive thesaurus
navigation with intelligence rules. They review the various attempts at
weighing relationships between terms, processing co-occurrence of terms to
present a concordance to the user, hypertextual thesaurus files, user
navigation techniques all in an attempt to provide heuristics for increasing or
decreasing recall and precision via a thesaurus. They end their discussion of
related research by saying that "the thesaurus component was not considered
separately so it is impossible to tell how much it contributed to the overall
success" of the search. In our work, because we consider this component of the
IR system so important, we are studying thesaurus use separately and will
redesign the interface and thesaurus as needed after user evaluation tests.

To quote Jones et al. again: "A thesaurus can be viewed as a bridge
(emphasis in original) between queries phrased in natural language and an
abstract classification structure which constitutes a map of a particular
domain....we can view the thesaurus mainly as a source of natural language
terms for query enhancement in a more general context..." (page 59) These two
statements seem to us to imply the need for a man-made thesaurus and a
machine-made term relationship list (sometimes called an automatic thesaurus)
to be conceived of as a single file for purposes of creating a Searcher's
Thesaurus. This is our intended line of research. This paper represents the
bare beginnings of our attempt to enhance IR system use at the outset and
throughout the searching and retrieval processes by providing easy access and
manipulation of information in a "thesaurus" file.

The thesaurus browser described in this paper is an important component in the
redesign of a user interface for a digital library collection. Its novel
features provide access to a thesaurus descriptor's total hierarchy and the
"cloud" of related terms surrounding it. The display represents an ordered,
hypertextual concept space in which the user can move about at will, selecting
search terms for immediate use or for use during subsequent searches without
leaving the thesaurus or going into a separate search "mode."

The subject access interface takes you directly into the thesaurus when you
enter a word or phrase. The idea is to show you where the word or phrase you
typed lies in the thesaurus' conceptual space and to immediately allow you to
pick any of the other descriptors shown on the screen. In figure 2, the term
shock waves has been typed into the text box at upper left, and the
<Enter> button clicked. (The <Done> button closes
the thesaurus form and initiates a search for bibliographic records containing
the chosen term. The <Cancel> button closes the thesaurus display
without changing the state of any current searches.) Compare the graphical
thesaurus entry shown in figure 2 to the conventional thesaurus entry for the
same term shown in figure 1.

In figure 2, the thesaurus display has two sections. On the left, below where
you enter your initial subject search, it displays the current thesaurus term
in boldface with the related terms (RTs) floating in space around it. The image
conveys the related terms as having no hierarchical relationship to the current
term, but merely near it in as equidistant a way as possible. RTs which appear
closer to the current term are not in any way more "closely related" to it than
those which appear farther away; in the parlance of the crowd, they merely got
there first.

Figure 2: INSPEC Thesaurus display with shock waves as the current term.

The current term appears on the right in boldface as well, but in a list
of hierarchical relationships between it and other terms. In figure 2, the
narrower terms (NTs) to shock waves are detonation waves and
plasma shock waves, the broader term (BT) is acoustic waves, and
the top term (TT) is waves. In cases where a term has more than one BT
or TT, the interface can display a polyhierarchy as well (see figure 7).

The right-hand section of the display allows you to discern relationships among
terms other than the current term and its immediate broader and narrower terms.
For example, acoustic waves, the BT for shock waves, has the BT
elastic waves. Elastic waves has the NTs acoustic waves,
Love waves, magnetoelastic waves, etc.

The left-hand section of the display only shows terms listed in the thesaurus
as immediately related to the current term.

The thesaurus display as a whole reveals other interesting "polyrelations"
between terms not discernible from the printed form. In figure 2 again,
seismic waves is shown as a RT to shock waves, although the
hierarchic display on the right also reveals that seismic waves is a
sibling term (has the same BT, elastic waves) with acoustic
waves, the BT of shock waves. Depending on your gender preference,
you might then call seismic waves an "Aunt" or "Uncle" term of shock
waves.

The thesaurus display is completely hypertextual. When you click on any
descriptor on the display, the interface software immediately changes the
thesaurus display to have that descriptor as the current term. For example,
clicking on Mach number on the display as shown in figure 2 (to the
lower right of shock waves in the related term display) changes the
display to that shown in figure 3. In figure 3, Mach number is shown in
boldface in both the related term and hierarchic displays, indicating that it
is now the current term.

Since RTs are symmetrical relationships, shock waves is shown as an RT
to Mach number. An interesting thing about this particular display is
that all the RTs for Mach number were seen in figure 2 as RTs for
shock waves. At the current stage of development of the interface you
have to flip back and forth between displays a number of times to discover this
fact.

Continuing with the example display shown in figure 3, the hierarchical display
shows Mach number as a NT for fluid mechanics (in other words,
fluid mechanics is the BT for Mach number), and mechanics
is the TT for the whole hierarchy. Mach number itself, however, has no
NTs.

From the hierarchical display in figure 3 you can tell that Mach number
has no NTs in two ways.

Figure 3: INSPEC Thesaurus display with Mach number as the current
term.

First, there are no other terms shown as indented immediately below it;
intermolecular mechanics, though it is immediately below Mach
number, is actually indented relative to mechanics, the top term of
the hierarchy, and therefore an NT of mechanics, not Mach number.

Figure 4: INSPEC Thesaurus display with intermolecular mechanics as
the current term.

The second and more useful way of knowing that Mach number has no
NTs is the absence of either a "+" (plus) or "-" (minus) sign to the immediate
left of it. These signs both mean the same thing: that the term they precede
has NTs. The difference is that when the sign is a "+" the NT hierarchy
beneath the term is collapsed, but when the sign is a "-" the NT hierarchy
beneath the term is expanded. Both figures 2 and 3 show NT hierarchies in
states of expansion and collapse: in figure 2, the NT hierarchies under
elastic waves, acoustic waves, and shock waves are
expanded, with the rest collapsed; in figure 3, the NT hierarchy under fluid
mechanics is expanded, and all the rest are collapsed.

The thesaurus interface software automatically expands or collapses NT
hierarchies to show the BTs and NTs surrounding the current term. Thus, even
when the current term occurs in a long hierarchy, the software can display it
as part of a fairly short and thus readily comprehensible display. The fully
expanded hierarchy under mechanics in figure 3, for example, is one of
the longest in the INSPEC Thesaurus. Yet, because only the immediately broader
parts of the hierarchy are expanded around the current term Mach number,
its relation to the overall body of knowledge to which engineers ascribe the
term "mechanics" is clearly illustrated.

You can see directly how the thesaurus interface automatically expands and
collapses hierarchies by clicking on different terms in the same hierarchy.
Clicking on intermolecular mechanics as it appears in figure 3 changes
the hierarchy to the one shown in figure 4 (because the thesaurus now has
intermolecular mechanics as its current term, it changes the RTs
displayed as well). Note that it collapsed the NT hierarchy under fluid
mechanics (where Mach number appears) and expanded the NT hierarchy
under intermolecular mechanics. The sole NT of intermolecular
mechanics is intermolecular forces, which has, as indicated by the
"+" sign before it, one or more NTs itself.

Figure 5: INSPEC Thesaurus display with fluid dynamics as the
current term. The scroll bar appears at right because the length of the
expanded parts of the hierarchy exceeds the length of the window.

You can expand and collapse NT hierarchies yourself by respectively
clicking on "+" or "-" signs in the hierarchical display, rather than the terms
they occur next to. You can thus see other hierarchical relationships in the
thesaurus without changing the current term.

The change in the mechanics hierarchy shown between figures 3 and 4 was
a minor one because in each case the immediate hierarchies surrounding the
current terms were small. If, however, in figure 3 you were to click on
fluid dynamics the change in the mechanics hierarchy would appear
drastic and disorienting, as shown in figure 5. The causes of this sudden
increase in the complexity of the display have to do with the structure of the
thesaurus itself and with how the current version of the interface software
displays it.

Figure 6: INSPEC Thesaurus display with fluid dynamics as the
current term. The redundant NT hierarchy under the upper occurrence of
fluid dynamics has been collapsed, revealing more of the overall
hierarchy.

The structure of the thesaurus places fluid dynamics into a
polyhierarchy, which complicates not only the conceptual space surrounding it
but also the way in which the interface software must display it. Fluid
dynamics has two BTs: dynamics and fluid mechanics, meaning
that the entire NT hierarchy under fluid dynamics occurs twice. Since
the displayed NT hierarchy for fluid dynamics is identical in each case,
we can collapse the redundant NT hierarchies by clicking on the "+" signs next
to all but one of the redundant occurrences of the polyhierarchic term. Figure
6 illustrates how collapsing one of the redundant NT hierarchies under fluid
dynamics simplifies the hierarchical display somewhat, though it is still
long enough to require a scrollbar. Automatic collapsing of redundant NT
hierarchies could be added easily enough to the thesaurus interface software.

Polyhierarchy comes in two distinct flavors in the INSPEC thesaurus. Fluid
dynamics occurs twice under the TT mechanics, but surface waves
(fluid) occurs twice under the TT mechanics (due to it being an NT
of fluid dynamics) and also under the TT surface phenomena (which
is not in the mechanics hierarchy). When a term has more than one TT,
the interface software puts the TTs in a pulldown box at the top of the
hierarchical display, allowing you to choose the TT for which you would like to
see the hierarchy. Figure 7 illustrates the term surface waves (fluid)
in the mechanics hierarchy, with the pulldown being used to select the
other TT, surface phenomena. Figure 8 illustrates the term surface
waves (fluid) in the surface phenomena hierarchy.

Terms in the thesaurus can appear in more than one hierarchy because they often
fit into more than one conceptual scheme.

Figure 7: The INSPEC descriptor surface waves(fluid) has two
top terms (TTs), mechanics and surface phenomena. Its occurrence
in the mechanics hierarchy is shown, with the surface phenomena
hierarchy about to be selected.

You can navigate through the thesaurus only if you can find a place to enter
it. While the entry vocabulary is sometimes sufficient to provide a launch
point for locating a search term, often you will not find an entry point into
the thesaurus even by typing what seems to be a perfectly reasonable word or
phrase for the subject you seek.

Figure 8: INSPEC Thesaurus display with surface waves (fluid) as the
current term, shown under the surface phenomena hierarchy.

Short of extending the entry vocabulary of the thesaurus, a useful tool
in these circumstances would be a lexical venue into the controlled as well as
entry vocabulary of the thesaurus. KeyWord-Out-of-Context (KWOC, also known as
just "keyword") lists and KeyWord-In-Context (KWIC) lists can be useful for
this purpose.

Figures 9 and 10 illustrate the use of Keyword and Keyword in Context lists to
help find thesaurus terms containing the word stem "computer." The "Keywords"
list continually tries to match the current word that the user is typing with a
word in the keyword database compiled from all INSPEC Thesaurus descriptors and
access vocabulary terms. As such, it can also act as a spell-checker. You can
also use it to transfer a word from the Keywords list to the search entry area
by double-clicking (or dragging-and-dropping). Once you have typed "zeu" for
example, you can transfer the whole word "zeugmatography" into the search form
directly.

Returning to the example illustrated in figures 9 and 10, the Keywords in
Context list activates after you type a complete word. In figure 9, the
Keywords in Context list has returned all INSPEC Thesaurus descriptors and
access terms containing the word stem "computer." The list scrolls in case
there are more items in it than can be shown in the list window (both the
Keywords list and Keywords in Context list are in resizable windows).

In figure 10, computer industry has been selected from the Keywords in
Context list, and the Thesaurus has responded by displaying that term in its
context. In this case computer industry is an access term, the
preferred term being DP industry, as indicated in the note area below
where you would normally enter the term. With the Keywords in Context list in
use, you can scroll through it and click on any number of terms displayed
therein, displaying the thesaurus entry for each in turn.

Figure 9: The full word "computer" has been typed into the text entry box
at upper left. It is displayed in the Keywords list, and all INSPEC
Thesaurus descriptors and access terms containing the word "computer" are shown
in the Keywords in Context list.

Figure 10: The INSPEC access term computer industry has been clicked
on in the Keywords in Context list. The INSPEC Thesaurus display
automatically traces the UF link to the descriptor DP industry.

You can use the Keywords and the Keywords in Context list separately or
together, depending on your situation. If you know of a term in the thesaurus
but aren't sure about how one of its words is spelled, you can use the Keywords
list to check your spelling. If you want to pick out a thesaurus descriptor or
access term from a word contained in it (as in figures 9 and 10), you can use
the Keywords in Context list with or without the Keywords list.

We feel that our current solution to keyword access, with free-floating,
resizable, modelessly accessible Keyword and Keyword in Context lists, offers
the best of both querying and browsing without requiring that each have its own
mode of operation. You can summon or dismiss each list independently at any
point during the typing of your search term, and using them does not complicate
the search form (you access the lists through a floating control palette, not
shown in figures 9 and 10). An added bonus to this generic application of
Keyword and Keyword in Context lists is that they can also be used, with the
same programming code and same interface elements, for other kinds of searching
such as title, author, or full text. The list software merely has to know to
switch keyword lists when you decide to use a different search method. And
once you learn how to use Keyword and Keyword in Context lists in one kind of
search, you know how to use them in any other kind of search as well, because
the lists look and work exactly the same in each case.

Figure 11: Dragging the INSPEC Thesaurus descriptor research
initiatives to the Hold File, and dropping it there.

In earlier designs for the interface, the Keywords and Keywords in
Context lists were a fixed part of the thesaurus interface, with fixed places
on the screen and several modes of display. This produced several problems
that made the use of the interface unsatisfactory.

First, it was difficult, from a design standpoint, to decide where best to
place the respective list displays. Given how we wanted the hierarchical and
related term displays to work, there was no good place to put them to begin
with, and we even considered at one point doing without the lists altogether.
The only solution, keeping the lists as part of the thesaurus display form,
would have been to elongate it in a manner that would only allow it to be used
on a 1024x768 or higher resolution screen, and that was not in keeping with our
purpose of providing an interface that would work on public terminals with
screen resolutions of only 640x480, as well as on notebook computers.

Second, as useful as Keywords and Keywords in Context lists may be, you don't
always want to use them. This is why almost all OPACs and other information
retrieval systems maintain the distinction in their interfaces between browsing
and querying: you should be able to browse when you want to, but when you know
an item is there, or know the correct subject heading for your search, you just
don't want to be bothered with scrolling through lists of choices [7]. Even
when, as with our system, you can still type your request when the Keywords
and/or the Keywords in Contexts lists are displayed, they are still a
distraction when you don't want to use them.

Third, realities of server loads and network traffic mean that during periods
of peak use, or when using a keyword server located some distance from your
client machine, the performance of both the Keywords and Keywords in Context
lists can be quite sluggish. This is tolerable if you really need them to help
you spell a word or check for the existence of a heading or phrase before
submitting a query to an even more heavily loaded bibliographic database
server. But putting up with a sluggish keyword list that encumbers your typing
when you don't need it is intolerable.

Thesauri such as the one provided by INSPEC provide a richly interlinked
conceptual environment which, when brought to life by our interface, allows you
to quickly broaden, narrow, or go sideways (via RTs) through a wide range of
concepts as quickly as you can see them on the screen. The sheer amount of
subject descriptors visible at any time, however, engenders the Art Museum
Phenomenon, where peripherally interesting terms can distract you from your
present search goal. Like a visit to an art museum, in this environment you
can easily get sidetracked by the sheer number of possibilities for a search
term, and end up spending a great deal of time wandering around in it without
finding what you came for [3, 5].

To combat this phenomenon without limiting the number of displayed terms or the
means to navigate the thesaurus, we have implemented a generic tool called the
"Hold File," into which you can place thesaurus terms (as well as free text
descriptors, author names, and other bibliographic information objects) for
later use. The metaphor is that of an index card file into which you can place
search items that you might want to try later.

Figure 11 illustrates use of the Hold File. In this example, the descriptor
"research initiatives" is of interest, but not immediately. So as not to lose
track of it while continuing a search for something else, you can use the mouse
to drag it into the Hold File (represented by a card file icon) on the main
search form (where queries and short record lists are displayed). Dragging a
subject descriptor to the Hold File stores a copy of the descriptor there, and
does not remove it from the thesaurus display.

Like the Keyword and Keyword in Context lists, the Hold File is a generic
interface tool. It can hold title and full-text keywords and author names as
well as thesaurus descriptors, and you use it the same way for each. The Hold
File automatically keeps each type of data in a separate list, but also allows
you to move items between lists, as when you want to use a thesaurus descriptor
in a title or full-text search.

The INSPEC Thesaurus interface described in this article was implemented in
Microsoft Visual Basic 3.0 and runs under Microsoft Windows. The program which
converts the INSPEC ISO 2709 records (format for bibliographic information
interchange on magnetic tape, based on the Library of Congress MARC format)
into the optimized database format used by the interface was also implemented
in Visual Basic 3.0.

Development of both the thesaurus interface and the off-line processing
software will continue for the length of the Digital Library project. There
are several problems in particular that need further work [4]. The RT
placement algorithm, for instance, does not yet consistently distribute terms
of arbitrary lengths in a way which appears spatially balanced.

Preliminary usability studies have been done to record user problems and
acceptance of display features and tools (e.g., Keyword lists, Keyword in
Context lists, and ways of transferring information objects between interface
forms, such as drag and drop). Findings from these studies have and will
continue to help us improve the interface. Among other preliminary findings,
users seem to prefer looking at the "cloud" of RTs rather than the subject
hierarchy, even though to us the hierarchy seems to contain more interesting
and useful information. Some users even question what the hierarchy is, and
try to navigate using only the RT cloud ("cloud" is the word users most often
used when asked to describe the RT display). The "Done" button also seems to
need a different label.

The INSPEC Thesaurus with this interface is only a portion of our plan to
create a searcher's thesaurus. We also intend to display an automatic
thesaurus made from term co-occurrences in abstracts and full text documents in
the Digital Library Initiative testbed. We think linkages between these two
types of thesauri will provide more lead-in vocabulary for the user and direct
access to portions of the full-text documents. Navigation through the document
retrieval space will be hypertextual as well.

We also plan to apply this interface to other science and engineering thesauri
besides INSPEC and to thesauri in other disciplines. This will allow us to
evaluate its robustness and general applicability to features of other
thesauri.