ABSTRACT

Document selection is an integral element of every information seeking task. Each act of selection, in turn, implies a preliminary user judgment that the document being retrieved may be relevant to meeting a given information need. The present paper is based on the hypothesis that user retrieval behavior represents a valid measure of those preliminary judgments. It proposes the creation of an adaptive hypertext system that is capable of recording user retrieval behavior and reranking result sets based on prior retrieval history. Analysis of preliminary data indicates that document lists subjected to such reranking are generally judged by users as superior to unranked lists of the same documents. Several suggestions for additional validation, deployment, and future research are included, particularly as a collaborative tool for homogeneous groups of highly technical users.

Introduction

Within every information retrieval (IR) task, users must select documents for review. Document selection, in turn, is a complex, multideterminate process. Selection is influenced by a wide array of multidimensional, idiosyncratic user judgments regarding the documents under consideration. Indeed, every user judgment regarding document selection is itself situated within a larger context of a dynamic, opportunistic process of information seeking. (See Schamber, 1994, for a detailed discussion of the inherent complexity of user relevance judgments.)

The myriad factors affecting users' judgments about the documents they review makes the task of understanding and exploiting those judgments seem impossibly large. But successful document selection is literally an everyday occurrence. Thus for all the complexity and apparent murkiness of the underlying judgments, users are quite capable of performing the task, even if they can't always elucidate it.

One necessary step in the process of document selection is a user's preliminary judgment about each document's likely suitability for addressing a specific information need. Like every other human behavior, the process of document selection is unquestionably imperfect. The process is rife with "false positives." Users select and retrieve a document, only to decide upon examination that it does not contribute in a substantive way to filling their information need. "False negatives," while impossible to measure with any precision, are also an everyday part of the process of document selection.

However, document selection is also unquestionably purposeful, motivated behavior. Implicit in the goal-directed nature of document selection is at least a suggestion of a user's preliminary judgment about the document in question. Indeed, the simple behavioral act of segregating documents into two groups--scrutinize these/don't scrutinize those--implies both the existence and the application of a user's preliminary judgment regarding members of the document pool. By implication, this process of segregation promises observable behavioral correlates to user judgments about the documents under review.

Among possible candidate behaviors, document retrieval is among the more likely surrogates for capturing user judgment. Precisely because the act of retrieval is undeniably nonrandom behavior, it is reasonable to hypothesize that retrieval is at least partially correlated to users' preliminary judgment regarding candidate documents.

Document retrieval is also a discrete behavior and is thus easily observable and, within certain contexts, equally easy to measure and record. Within the context of hypertext systems, for example, document retrieval can be operationalized with a simple click of a hyperlink. The mere existence of the hyperlink between two documents implies some extant relationship between them, and a user's choice to follow a particular link suggests that he or she believes that that relationship may be relevant to an information need.

Background

Indeed, several connectionist models of semantic relatedness have been proposed. Collins and Loftus' (1975) spreading activation theory proposes a measure of semantic relatedness based on interactions among nodes in network-based knowledge systems. More recently, distributed network theory also represents concepts as patterns of overlapping activity among semantic network elements. (See Plaut, 1995, for a discussion of distinctions between spreading activation and distributed network theories.)

By implication, it should be possible to create a hybrid hypertext/IR system that is capable of monitoring and capturing users' retrieval behavior, simply by recording their use of available hyperlinks. If document retrieval is indeed a valid measure of user judgments, then, by inference, capturing retrieval behavior would allow a system to indirectly capture user judgments about the candidate documents under scrutiny.

Regardless of its specific placement within the larger system, such a tool would constitute what Bruza (1990) calls a hyperindex. That is, it is a hypertext-based intermediary index layer, inserted between a result set and the user. Precisely because it is "middleware," such a tool would be capable of performing its own independent modifications on the result sets presented to users.

Among the more intriguing implications for such a hyperindex is that it could alter its own behavior, over time, based on an accumulated history of recorded user behaviors. It could, for example, rerank query results presented to users, based on the frequency with which various documents have been retrieved by other users. More frequently fetched documents would appear "higher" within the result set returned by the system. If the surest test of usability is use, then a self-modifying hyperindex promises potentially significant improvements in IR system usability.

The growth of the World Wide Web greatly simplifies the task of testing and deploying hypertext-based research tools. Thus it's unsurprising that the Web contains at least one example of an experimental hypertext system aimed at capturing user judgments regarding semantic relatedness. The Adaptive Hypertext Experiment (Bollen and Heylighen, 1996a), found at Principia Cybernetica Web, is a self-modifying hypertext system that monitors and records user judgments about the relatedness of word pairs within a series of 150 nouns (Bollen and Heylighen, 1996b). Within the context of the system, more frequently traversed links are taken as indicative of stronger semantic relationships between any two terms.

The results of this experiment (Bollen and Heylighen, 1996c) are mixed but intriguing. Based on its prior use, the system is often able to recapitulate users' associations among obviously related concepts. For instance, the word research activates such obviously related terms as question, idea, and problem. Even more intriguing are the more subtle semantic relationships identified among various terms: research also activates doubt, experience, and language. These results offer a suggestion that an adaptive hypertext system may indeed be able to capture and model complex subtleties of user judgment, based solely on observed user behavior.

An adaptive hyperindex may be subject to inevitable confounds. Users in a "berrypicking" mode (Bates, 1989) may follow extraneous links. Indeed, as suggested by Harter (1992), users' information needs and therefore their relevance judgments are themselves subject to change, based on the documents reviewed. All such changes are necessarily invisible to the automated components of any IR system.

Ultimately, such confounds are an unavoidable element of any naturalistic capture of user behavior. By definition, a system that alters its operation based on user behavior will incorporate some of the imperfections of that behavior. But the system need not be flawless to be useful and valid. One key method for validating an adaptive hyperindex would be to compare its results to those of IR tools already in common, everyday use.

Currently, many ubiquitously available IR systems (Web search engines, for example) compute "relevance" rankings based primarily on word count, document structure, or both (AltaVista, 1997; Excite, 1997). Yet word count and structure-driven approaches to document ranking are both inherently limited. Rankings based solely on word count are rightly suspect; the ranking of a given document within the result set becomes inevitably correlated to document length. To the extent that document length is a random variable, then length-based rankings must inevitably mirror some element of that randomness. Similarly, rankings based on document structure are heuristic in nature, and thus their utility is necessarily limited by the validity of the heuristic(s) employed. In contrast, an adaptive system that dynamically ranks query results based on frequency of prior use would rely solely on user behavior and thus, indirectly, user judgment.

Method

The present paper explores the feasibility of applying principles and methods of adaptive hypertext to IR systems. More specifically, its goal is to illuminate both practical and methodological details of creating and validating an adaptive hypertext intermediary for IR systems.

At its core, such an intermediary would simply monitor and record users' retrieval of documents within a given set of query results. Each new retrieval would increment a query-specific retrieval counter. On subsequent uses, more frequently accessed documents would appear "higher" on the list returned to users. Of course, users frequently execute searches based on multiple search terms. Indeed, the hyperindex experiments of Bruza and Dennis (1996) demonstrate that query reformulation (including addition, deletion, and substitution of search terms) is among the most common behaviors for folks using Web search engines.

Thus an adaptive hyperindex would also have to be capable of distributing a single act of retrieval among multiple search terms. One possibility is to adapt the algorithm used by Bollen and Heylighen (1996b). In their Adaptive Hypertext Experiment, when a user follows links from term A, through B, and finally to C, a "full strength" link is created between terms A & B, as well as between terms B & C. An additional "partial-strength" link between terms A and C is also created. Indeed, the use of these intermediate-strength links may be an integral part of their system's apparent ability to capture subtle, complex user judgments.

The simplest solution, at least initially, would be to allow the adaptive hyperindex to partition the strength of users' judgments equally among the search terms entered. Because every search involves an arbitrary number of search terms (N), the system could simply assign a value of 1/N "hits" to each of the constituent search terms.

A second set of processing accommodations is likely to be necessary to correct for volatility of information, and to allow the system to correct for user error--instances where a user simply follows the wrong link. This corrective could take the commonly used form of a temporal decay factor, introduced into the processing algorithm. Links would be allowed to weaken with time, resulting in a a natural "pruning" effect that would help to screen out false trails, along with outmoded or obsolete documents. Indeed, even the original theory of spreading activation calls for semantic connections to decay over time.

In a preliminary attempt to validate the technique of the adaptive hyperindex, six information science graduate students were recruited for participation in a pilot study, as part of a classroom demonstration of the technique. Each subject was asked to identify a topic for a mock information retrieval task, which was then submitted to a database of 13 mock documents, indexed by arbitrary keywords.

In the first phase of the study, each subject's keyword was submitted as a query to the document database, and the resulting matching documents were displayed in alphabetical order. Each subject was then asked to select the document most relevant to his or her topic. Once all subjects had completed this first phase, the order of presentation of candidate documents was reranked, based on the accumulated history of document selection. In the second phase of this study, each query was resubmitted to the database, and each subject was asked to judge the ranked query results as either superior to, identical to, or inferior to the unranked list.

Results

Preliminary analysis of these data indicate that most users judged the query results ranked by frequency of selection superior to the unranked list. Four subjects judged the ranked list superior, while two judged the results to be identical. No subjects judged the ranked list to be inferior to the unranked list. While the small N of this pilot study precludes statistical significance, analysis of these results yielded an obtained binomial probability value of .10.

Discussion

There are several additional possibilities for validating the technique of the adaptive hyperindex. All are predicated on the use of a controlled population of documents, and the existence of a set of query results to which the techniques of the adaptive hyperindex have been applied. For example, subjects could be assigned randomly to one of two conditions: control subjects would be presented with the raw, unprocessed result set, while the experimental group would be presented with lists that had previously been subjected to the adaptive hyperindex technique. The actual documents in both lists would be otherwise identical. Subjects could be assigned a time-constrained information-synthesis task, requiring the identification and retrieval of multiple related documents. The resulting protocols could then be analyzed for accuracy and comprehensiveness. If the method of the adaptive hyperindex is valid, then one would predict that experimental subjects' protocols would be both more accurate and more complete.

Subject to empirical validation, the adaptive hyperindex could be deployed in any one of several server-side forms. (As used here, the term "server" is intended to identify a remote repository of information, rather than to describe an actual system architecture.) For example, the hyperindex could be integrated directly into the search engine itself, or inserted as a server-side intermediary. It could also be deployed as true "middleware"--inserted as an intermediary layer between the search engine's server and users' client software (e.g., the Metacrawler search tool).

Once deployed, the system could monitor users' document retrieval in real time, and offer suggestions for new candidate documents, based on the documents already selected by the user. It could also, upon user request, provide a listing of the most frequently retrieved documents for a given topic or search term.

Once validated, an adaptive hyperindex would be extremely easy to implement at a very low cost. It would also present several immediate and obvious practical benefits. The spiraling growth of the World Wide Web makes document selection an increasingly cumbersome task for users presented with hundreds or even thousands of search hits. But a self-restructuring hypertext intermediary could offer a "discount" means for capturing and recapitulating other users' judgments.

Adaptive hyperindex technologies may prove to be especially effective as collaborative tools within relatively homogeneous user populations, where extensive subject matter expertise improves the statistical reliability of relevance judgments. For example, it may be possible to deploy the technology in conjunction with intranets and with IR systems used primarily or exclusively by specialists in highly technical disciplines, such as law or engineering. In such contexts an adaptive hyperindex may prove to be an invaluable tool for collaborative knowledge development. Indeed, Salton and Buckley (1988) note that interdocument assocations are generally more valid when based on locally-defined, context-dependent relationships among specific documents within controlled collections.

Among the more intriguing applications for an adaptive hyperindex is as a tool for relevance feedback, in that it relies on other users' judgments for what constitutes "more like this." Of course, the adaptive hyperindex may prove incapable of capturing and recapitulating collective judgments from large user populations. But even then, it may still be possible to employ the technology on an individual basis, as a personalized search intermediary. Indeed, elementary prototypes of adaptive "bookmark" tools for Web browsers have already been tested (Debevc and Spasovski, 1996).

However, full-scale client-side implementation of adaptive hyperindex technology would require extensive negotiation and communication between the client and server regarding the candidate documents. Therefore, client-side deployment of a fully functional adaptive hyperindex is necessarily contingent upon the creation of a nonproprietary, standardized language for metadata communication. Several draft proposals for metadata communications standards, such as the World Wide Web Consortium's draft Resource Description Framework (W3C, 1997), are currently under active review.

Ultimately, optimally effective and usable IR systems must find a way to mirror users' inherent ability to make judgments about candidate documents. And all such designs must necessarily rely on behavioral surrogates for those user judgments. Thus the surest foundation for the creation of effective IR systems lies with empirically validated methods for capturing and incorporating user behavior into system design.

Debevc, M. and Spasovski, J. (1996). Simplified decision algorithm for document filtering on the WWW. In Proceedings of the Fifth International Conference on User Modelling: Workshop on User Modelling for Information Filtering on the World Wide Web [Online]. Available: http://teja.uni-mb.si/personal/matjaz/problem.html [1997, 1 November].