TOIS 2001 Volume 19 Issue 1

Techniques for automatic query expansion from top retrieved documents have
shown promise for improving retrieval effectiveness on large collections;
however, they often rely on an empirical ground, and there is a shortage of
cross-system comparisons. Using ideas from Information Theory, we present a
computationally simple and theoretically justified method for assigning scores
to candidate expansion terms. Such scores are used to select and weight
expansion terms within Rocchio's framework for query reweigthing. We compare
ranking with information-theoretic query expansion versus ranking with other
query expansion techniques, showing that the former achieves better retrieval
effectiveness on several performance measures. We also discuss the effect on
retrieval effectiveness of the main parameters involved in automatic query
expansion, such as data sparseness, query difficulty, number of selected
documents, and number of selected terms, pointing out interesting
relationships.

This paper presents a formal definition for HMBS (Hypermedia Model Based on
Statecharts). HMBS uses the structure and execution semantics of statecharts to
specify both the structural organization and the browsing semantics of
hypermedia applications. Statecharts are an extension of finite-state machines
and the model is thus a generalization of hypergraph-based hypertext models.
Some of the most important features of HMBS are its ability to model hierarchy
and synchronization of information; provision of mechanisms for specifying
access structures, navigational contexts, access control, multiple tailored
versions,and hierarchical views. Analysis of the underlying statechart machine
allows verification of page reachability, valid paths, and other properties,
thus providing mechanisms to support authors in the development of structured
applications.

This paper proposes a framework for the handling of spatio-temporal queries
with inexact matches, using the concept of relation similarity. We initially
describe a binary string encoding for 1D relations that permits the automatic
derivation of similarity measures. We then extend this model to various
granularity levels and many dimensions, and show that reasoning on
spatio-temporal structure is significantly facilitated in the new framework.
Finally, we provide algorithms and optimization methods for four types of
queries: (i) object retrieval based on some spatio-temporal relations with
respect to a reference object, (ii) spatial joins, i.e., retrieval of object
pairs that satisfy some input relation, (iii) structural queries, which
retrieve configurations matching a particular spatio-temporal structure, and
(iv) special cases of motion queries. Considering the current large
availability of multidimensional data and the increasing need for flexible
query-answering mechanisms, our techniques can be used as the core of
spatio-temporal query processors.

TOIS 2001 Volume 19 Issue 2

The proliferation of searchable text databases on corporate networks and the
Internet causes a database selection problem for many people. Algorithms such
as gGLOSS and CORI can automatically select which text databases to search for
a given information need, but only if given a set of resource descriptions that
accurately represent the contents of each database. The existing techniques for
a acquiring resource descriptions have significant limitations when used in
wide-area networks controlled by many parties. This paper presents query-based
sampling, a new technique for acquiring accurate resource descriptions.
Query-based sampling does not require the cooperation of resource providers,
nor does it require that resource providers use a particular search engine or
representation technique. An extensive set of experimental results demonstrates
that accurate resource descriptions are crated, that computation and
communication costs are reasonable, and that the resource descriptions do in
fact enable accurate automatic database selection.

Today, when searching for information on the WWW, one usually performs a
query through a term-based search engine. These engines return, as the query's
result, a list of Web pages whose contents matches the query. For broad-topic
queries, such searches often result in a huge set of retrieved documents, many
of which are irrelevant to the user. However, much information is contained in
the link-structure of the WWW. Information such as which pages are linked to
others can be used to augment search algorithms. In this context, Jon Kleinberg
introduced the notion of two distinct types of Web pages: hubs and authorities.
Kleinberg argued that hubs and authorities exhibit a mutually reinforcing
relationship: a good hub will point to many authorities, and a good authority
will be pointed at by many hubs. In light of this, he devised an algorithm
aimed at finding authoritative pages. We present SALSA, a new stochastic
approach for link-structure analysis, which examines random walks on graphs
derived from the link-structure. We show that both SALSA and Kleinberg's Mutual
Reinforcement approach employ the same meta-algorithm. We then prove that SALSA
is equivalent to a weighted in degree analysis of the link-structure of WWW
subgraphs, making it computationally more efficient than the Mutual
reinforcement approach. We compare that results of applying SALSA to the
results derived through Kleinberg's approach. These comparisons reveal a
topological Phenomenon called the TKC effect which, in certain cases, prevents
the Mutual reinforcement approach from identifying meaningful authorities.

The use of markup languages like SGML, HTML or XML for encoding the
structure of documents or linguistic data has lead to many databases where
entries are adequately described as trees. In this context querying formalisms
are interesting that offer the possibility to refer both to textual content and
logical structure. We consider models where the structure specified in a query
is not only used as a filter, but also for selecting and presenting different
parts of the data. If answers are formalized as mapping from query nodes to the
database, a simple enumeration of all mappings in the answer set will often
suffer from the effect that many answers have common subparts. From a
theoretical point of view this may lead to an exponential time complexity of
the computation and presentation of all answers. Concentration on the language
of so called tree queries -- a variant and extension of Kilpelainen's Tree
Matching formalism -- we introduce the notion of a "complete answer aggregate"
for a given query. This new data structure offers a compact view of the set of
all answer and supports active exploration of the answer space. Since complete
answer aggregates use a powerful structure-sharing mechanism their maximal size
is of order &sgr;(d*h*q) where d and q respectively denote the size of the
database and the query, and h is the maximal depth of a path of the database.
An algorithm is given that computes a complete answer aggregate for a given
tree query in time &sgr;(d*log(d)*h*). For the sublanguage of so-called rigid
tree queries, as well as for so-called "nonrecursive" databases, an improved
bound of :&sgr;(d*log(d)*q) is obtained. The algorithm is based on a specific
index structure that supports practical efficiency.

TOIS 2001 Volume 19 Issue 3

We identify crucial design issues in building a distributed inverted index
for a large collection of Web pages. We introduce a novel pipelining technique
for structuring the core index-building system that substantially reduces the
index construction time. We also propose a storage scheme for creating and
managing inverted files using an embedded database system. We suggest and
compare different strategies for collecting global statistics from distributed
inverted indexes. Finally, we present performance results from experiments on a
testbed distributed Web indexing system that we have implemented.

The wealth of information on the web makes it an attractive resource for
seeking quick answers to simple, factual questions such as "who was the first
American in space?" or "what is the second tallest mountain in the world?" Yet
today's most advanced web search services (e.g., Google and AskJeeves) make it
surprisingly tedious to locate answers to such questions. In this paper, we
extend question-answering techniques, first studied in the information
retrieval literature, to the web and experimentally evaluate their performance.
First we introduce Mulder, which we believe to be the first general-purpose,
fully-automated question-answering system available on the web. Second, we
describe Mulder's architecture, which relies on multiple search-engine queries,
natural-language parsing, and a novel voting procedure to yield reliable
answers coupled with high recall. Finally, we compare Mulder's performance to
that of Google and AskJeeves on questions drawn from the TREC-8 question
answering track. We find that Mulder's recall is more than a factor of three
higher than that of AskJeeves. In addition, we find that Google requires 6.6
times as much user effort to achieve the same level of recall as Mulder.

WebQuilt is a web logging and visualization system that helps web design
teams run usability tests (both local and remote) and analyze the collected
data. Logging is done through a proxy, overcoming many of the problems with
server-side and client-side logging. Captured usage traces can be aggregated
and visualized in a zooming interface that shows the web pages people viewed.
The visualization also shows the most common paths taken through the web site
for a given task, as well as the optimal path for that task, as designated by
the designer. This paper discusses the architecture of WebQuilt and describes
how it can be extended for new kinds of analyses and visualizations.

In recent years, the World Wide Web has shown enormous growth in size. Vast
repositories of information are available on practically every possible topic.
In such cases, it is valuable to perform topical resource discovery
effectively. Consequently, several new ideas have been proposed in recent
years; among them a key technique is focused crawling which is able to crawl
particular topical portions of the World Wide Web quickly, without having to
explore all web pages. In this paper, we propose the novel concept of
intelligent crawling which actually learns characteristics of the linkage
structure of the World Wide Web while performing the crawling. Specifically,
the intelligent crawler uses the inlinking web page content, candidate URL
structure, or other behaviors of the inlinking web pages or siblings in order
to estimate the probability that a candidate is useful for a given crawl. This
is a much more general framework than the focused crawling technique which is
based on a pre-defined understanding of the topical structure of the web. The
techniques discussed in this paper are applicable for crawling web pages which
satisfy arbitrary user-defined predicates such as topical queries, keyword
queries, or any combinations of the above. Unlike focused crawling, it is not
necessary to provide representative topical examples, since the crawler can
learn its way into the appropriate topic. We refer to this technique as
intelligent crawling because of its adaptive nature in adjusting to the web
page linkage structure. We discuss how to intelligently select features which
are most useful for a given crawl. The learning crawler is capable of reusing
the knowledge gained in a given crawl in order to provide more efficient
crawling for closely related predicates.

A metasearch engine is a system that supports unified access to multiple
local search engines. Database selection is one of the main challenges in
building a large-scale metasearch engine. The problem is to efficiently and
accurately determine a small number of potentially useful local search engines
to invoke for each user query. In order to enable accurate selection, metadata
that reflect the contents of each search engine need to be collected and used.
This article proposes a highly scalable and accurate database selection method.
This method has several novel features. First, the metadata for representing
the contents of all search engines are organized into a single integrated
representative. Such a representative yields both computational efficiency and
storage efficiency. Second, the new selection method is based on a theory for
ranking search engines optimally. Experimental results indicate that this new
method is very effective. An operational prototype system has been built based
on the proposed approach.

TOIS 2001 Volume 19 Issue 4

Experimental approaches are widely employed to benchmark the performance of
an information retrieval (IR) system. Measurements in terms of recall and
precision are computed as performance indicators. Although they are good at
assessing the retrieval effectiveness of an IR system, they fail to explore
deeper aspects such as its underlying functionality and explain why the system
shows such performance. Recently, inductive (i.e., theoretical) evaluation of
IR systems has been proposed to circumvent the controversies of the
experimental methods. Several studies have adopted the inductive approach, but
they mostly focus on theoretical modeling of IR properties by using some
metalogic. In this article, we propose to use inductive evaluation for
functional benchmarking of IR models as a complement of the traditional
experiment-based performance benchmarking. We define a functional benchmark
suite in two stages: the evaluation criteria based on the notion of
"aboutness," and the formal evaluation methodology using the criteria. The
proposed benchmark has been successfully applied to evaluate various well-known
classical and logic-based IR models. The functional benchmarking results allow
us to compare and analyze the functionality of the different IR models.

The rapid evolution of XML from a mere data exchange format to a universal
syntax for encoding domain-specific information raises the need for new query
languages specifically conceived to address the characteristics of XML. Such
languages should be able not only to extract information from XML documents,
but also to apply powerful transformation and restructuring operators, based on
a well-defined semantics. Moreover, XML queries should be natural to write and
understand, as nontechnical persons also are expected to access the large XML
information bases supporting their businesses. This article describes XML-GL, a
graphical query language for XML data. XML-GL's uniqueness is in the definition
of a graph-based syntax to express a wide variety of XML queries, ranging from
simple selections to expressive data transformations involving grouping,
aggregation, and arithmetic calculations. XML-GL has an operational semantics
based on the notion of graph matching, which serves as a guideline both for the
implementation of native processors, and for the adoption of XML-GL as a
front-end to any of the XML query languages that are presently under discussion
as the standard paradigm for querying XML data.

We propose a genre taxonomy as a knowledge repository of communicative
structures or "typified actions" enacted by organizational members. The genre
taxonomy is intended to help people make sense of diverse types of
communicative actions and provide ideas for improving work processes that
coordinate the communication of information. It engages several features to
achieve this objective. First, the genre taxonomy represents the elements of
both genres and genre systems as embedded in a social context reflecting the
communicative questions why, what, who, when, where, and how (5W1H). In other
words, the genre taxonomy represents the purpose, content, participants,
timing, location, and form of communicative action. Second, the genre taxonomy
distinguishes between widely recognized genres such as a report and specific
genres such as a particular company's technical report, because the difference
sheds light on the context of genre use. Third, the genre taxonomy represents
use and evolution of a genre over time to help people understand how a genre is
used and changed by a community over time. Fourth, the genre taxonomy
represents aspects of information coordination via genres, thus providing ideas
for improving work processes using genres. We have constructed a prototype of
such a genre taxonomy using the Process Handbook, a process knowledge
repository developed at MIT. We have included both widely recognized genres
such as the memo and specific genres such as those used in the Process Handbook
itself. We suggest that this genre taxonomy may be useful in the innovation of
new document templates or methods for communication because it helps to clarify
different possible uses of similar genres and explicates how genres play a
coordination role among people and between people and their tasks.