Executive Summary

Background and Goals: This workshop brought together people
involved with information server technologies, search technologies, and
directory and online services, to discuss areas of mutual concern where
repository interface standards could provide better approaches to distributed
indexing and searching.

The first day of the workshop consisted of three technical sessions,
each of which began with two invited talks intended to stir controversy
on a particular topic, followed by a breakout session to discuss the topic
in smaller groups. The technical session areas were Distributed Data Collection,
Data Transfer Formats, and Distributed Search Architectures. Slides from
the talks the first day are available (see the links in the Agenda),
as are notes from the plenary sessions.

The morning of the second day was spent in a plenary session summarizing
the preceding discussions and then culling and adjusting the topic list
to provide charters for a final breakout session to writeup workshop recommendations.
The workshop turned out to focus more on indexing than on searching, and
in particular on collecting information needed for indexing.

Recommendations: The first
breakout/writeup session focused primarily on determining the set of
servers to which a query should be routed. Most of this discussion focused
on centroids, which the group abstractly characterized as a table used
to determine if a particular query should be sent to a particular service.
The group felt the Whois++ framework specified in RFCs 1913
and 1914 was appropriate,
although there might be shortcomings for particular applications - for
example, there are currently no provisions for indicating that a centroid
was created using stemming or stop lists. Alternatively, the informal
agreement that Stanford is coordinating proposes centroids that include
the content summaries that we agreed upon (i.e., the words in the collection
plus their document frequencies), together with information about stemming
and stop words, for example.

The group established several short term goals: additions/modifications
to RFC 1913/1914 to specify stop lists, stemming, language, administrative
contacts, and field names/attribute keys; the creation of a mailing list
for interested parties; and tools for creating centroids. For longer term
goals the group recommended creating a centroid standard that interoperates
among search engine vendors, perhaps starting with Bunyip's Digger
software as a reference implementation of the Whois++ protocol; working
out a stemming specification for centroids; measuring the size and computational
costs of using centroids, perhaps as part of a proposed prototype implementation
using the MetaCrawler Web search
service, or as part of the ongoing University
of California Whois++ testbed; considering ways to extend centroids
for use with non-text databases; expanding the header generality in RFCs
1913 and 1914; adding support for comments; and adding data specification
support so that clients can rank services, and to allow per-collection
word frequency counts.

This breakout session also discussed the question of how to conduct
searches. There the group focused on engine identifiers and merging heterogeneous
result sets. They suggested defining a data structure and transport mechanism
to allow clients to formulate queries and interpret results. Some of the
pieces they considered included URIs, collection descriptions, query language
and output formats, and support for active code. The group also felt it
would be important to consider standards for query languages and refinement
and the role of Z39.50, but they ran out of time for those discussions.

The second breakout/writeup
session addressed the problem of defining a simple convention for embedding
metadata within HTML documents without requiring additional tags or changes
to browser software, and without unnecessarily compromising current practices
for robot collection of data. The group noted that a registry may be a
necessary feature over time, but suggested that deployment proceed in the
short term without requiring a registry. The group then went on to define
an encoding scheme using META tags, gave examples of how the scheme might
be used, and proposed a convention for linking to a schema's reference
definition. Finally, they suggested that the semantics for metadata elements
be related to existing well known schemas whenever feasible, to promote
consistency among schemas.

The third breakout/writeup
session focused on mechanisms to allow information servers to notify indexers
when content changes. They separated this issue from the choice of how
bulk data transfer is performed, and noted that there are three ways to
maintain an index: (a) retrieval without prior coordination (e.g., as used
by current robots), retrieval after notification, and notification followed
by a provider push. They suggested five areas where standards are needed:
a bulk collection protocol on top of HTTP, a collection packaging format,
notification and registration protocols, notification event scheduling,
and a protocol for clients and servers to negotiate whether to push or
pull updates. The group then proposed a basic design for this set of standards.

The group expressed some concern about the ability for the transport
layer to handle large scale notifications, especially in the case of personal
agents requesting notifications. The availability of authentication mechanisms
could reduce this problem by allowing providers to limit notifications
to a specific set of indexing services.

The group discussed the use of Netscape's Resource Description Message
(RDM) extension to the Harvest
SOIF
format for performing incremental updates and bulk transfers. Darren
Hardy has made a preliminary
specification of RDM available.

Finally, the group observed that registration and bulk transfer standards
should be open, to encourage competitive value addition by parties other
than information providers and indexers.

Z39.50: There was a fair amount of discussion
of Z39.50 at the workshop. Some participants felt there should be a
standard information retrieval protocol for queries and that Z39.50 was
a good choice; others felt that Z39.50 is too large and complex, and suggested
that the Z39.50 community develop a lightweight rendering of Z39.50. The
Library of Congress Z39.50 representative at the workshop agreed to work
towards this goal.

BOFs: Two Birds-Of-a-Feather (BOF) sessions were also held at
the workshop. The first BOF
stated that the Z39.50 Implementors Group community agreed to help the
Stanford Digital Library Project
produce a Z39.50 profile for the Stanford
informal agreement and two alternative implementations -- one using
an ASCII encoding and the other using BER (Basic Encoding Rules).

While the overall workshop goal was to determine areas where standards
could be pursued, the second BOF
attempted to reach actual standards agreements about some immediate term
issues facing robot-based search services. The agreements fell into four
areas: a ROBOTS meta-tag, meant to provide a per-document mechanism for
users who cannot control the robots.txt file at their sites; a DESCRIPTION
meta-tag, providing text that can be used by a search service when printing
the document summary; a KEYWORDS meta-tag (which some workshop attendees
felt was not appropriate for this BOF to specify without the participation
of other parties that have been working on this meta data issue); and a
list of of other issues with ROBOTS.TXT that should be resolved in future
discussions: ambiguities in
the current specification, a means of canonicalizing sites, ways of supporting
multiple robots.txt files per site, ways of advertising content that should
be indexed (rather than just restricting content that should not be indexed),
and information about the maximum acceptable speed and parallelism when
indexing a site.

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.