ISSN 1082-9873

The explosion in the amounts and variety of information
has made the knowledge about its existence, location, and the means
of retrieval very confusing. The information explosion has further
accelerated with the acceptance of the World Wide Web
[ber92],
causing a universal rush to create Web pages and use them to provide
on-line access to the vast legacy of existing heterogeneous information.
Such information ranges from documents in a variety of proprietary
representation formats to engineering and financial databases, and
is often accessible only through specialized vendor tools and locally
developed applications. Moreover, rapidly increasing sophistication in
presenting information on the Web is already forcing us to treat
ftp and gopher information sources, and even early
HTML pages as parts of the same legacy.

The main focus of this article is the
InfoHarnessTM
system [shk94,shk95-1],
which is designed to provide Web access to existing heterogeneous
information without any relocation, reformatting and restructuring of data.
InfoHarness
has been productized and is now a part of
Bellcore's ADAPT/X product line.
It has been designed with an open, extensible, and modular architecture.
A prototype extension of InfoHarness, called
GeoHarness1,
is being developed by the members of the
USDAC
Consortium2 for accessing geospatial data.
It is also used to support advanced Web presentation of existing heterogeneous
information in other domains, e.g., at Rutgers University for accessing
judicial opinions from Federal Appeals courts and the
U. S. Supreme Court (see the example in Section 3.3).
The current prototype supports the largely automatic
generation of InfoHarness and
GeoHarness repositories, and provides
access to raw data from Mosaic, Netscape and other Web browsers through
a gateway program.

In Section 2, we discuss
current methods and tools for providing Web access to existing
heterogeneous information.
We see the most promising approach in building logical data
models and using them to support all kinds of sophisticated
presentation of the original information on the World Wide Web.
Section 3 provides a general description of the
InfoHarness system.
In Section 3.1, we briefly describe the object model.
In Section 3.2, we discuss the generation of
InfoHarness repositories, followed
by an example in Section 3.3.
Section 4.0 provides a brief summary and a discussion
of our current work.

There have been numerous attempts to provide partial remedies for data
heterogeneity by implementing a variety of ever-changing
filters for format conversions. The filters are used to generate
HTML documents either dynamically, using the
Common Gateway Interface
(CGI)
mechanism, or off-line. The off-line approach
requires substantial human and computing resources for the initial conversion
and maintenance of information. Maintaining the repositories presents the
additional dilemma of either creating new and updating existing information
in HTML, or
continuously managing evolving data in multiple formats.
The dynamic approach helps to postpone the conversion until the information
is requested and eliminates problems with the initial processing and
maintenance of information. However, the access-time conversion may not be
appropriate for some rich document formats (framemaker, etc.) for the
following reasons:

Using the Multipurpose Internet Mail Extensions
(MIME) [bor93],
supported by most Web browsers, helps to avoid data conversion
through the use of third-party presentation tools. However, it may require
renaming the original files because MIME's
type recognition mechanism relies on file extensions.
Even though the mapping of file extensions into
MIME types is customizable, it is still
fixed for every given server, unless the type
assignment is performed by some specialized gateway program.
Adding support for new MIME
types often requires end users to obtain and install third-party tools.
Further, MIME alone does not provide any
support for logically linking together relevant documents.

Nevertheless, many systems rely on
MIME as their primary presentation
mechanism.
The OMNIS system
[cla95] has been
designed to provide access to library information that includes both
catalogs and digitized texts.
The scanned-in documents may contain images, postscript or other formatted
information, and are stored in a database. At presentation time, the
OMNIS
gateway converts textual information to
HTML, while images are converted
to common MIME types before
being passed to the browser. This is quite feasible because
OMNIS has full
control over the format and representation of information that is stored
in its database.

Harvest [bow94]
provides support for extracting summaries
from distributed heterogeneous information and for executing searches
over these summaries.
Once the resources have been identified, the responsibility of accessing them
is handed over to the Web browsers.
Harvest
provides efficient and flexible methods of indexing widely distributed
information. MIME mappings
are used to provide access to the wide variety of information, so the
problems that were described earlier still persist.

There have been a number of attempts to build logical models of
distributed heterogeneous information and use these models to support
advanced Web presentation.
The Multimedia-Oriented Repository Environment
(MORE) [eic94]
was designed as a set of CGI programs
that operate in conjunction with a stock
httpd server to provide
access to a relational database containing meta-information, which
specifies how to retrieve physical data. The meta-information is
entered into the database off-line by the human librarians.

WebMake [bae95]
introduces methods for building Web structures over existing software,
e.g. source and object code for software systems.
In WebMake, meta-level
structural documents are used to create abstractions by
logically combining software modules or other structural documents.
A set of tools has been developed to provide a distributed
software development environment by utilizing the
CGI mechanism.
A specialized Web client is required to obtain full access to the
WebMake functionality.

HyperG [and95]
uses an object-oriented database layer to provide information modeling
and model maintenance facilities in addition to integrated attribute
and content-based search. The system supports logical grouping of
documents into collections that may span multiple
HyperG servers. Special
cluster collections are used to group together related multimedia
and multi-lingual information. HyperG
uses its own HyperG Text Format
(HTF) that is converted to
HTML by the
HyperG servers when they respond to
HTTP requests.

The objective of the InfoHarness
system [shk94, shk95-1]
is to provide Web access to large amounts of
heterogeneous information in a distributed environment without any
relocation, restructuring, or reformatting of data.
Like MORE and
HyperG,
InfoHarness
uses metadata for search and retrieval of heterogeneous
information (Figure 3).
It provides advanced search and browsing capabilities without
imposing constraints on information suppliers or creators.
InfoHarness utilizes stable abstract class
encapsulation and presentation hierarchies that need not be modified to
add terminal classes that accommodate new kinds of information and new
indexing technologies. InfoHarness
provides tools for the automatic generation of meta-data based on user
inputs and the analysis of existing information.

Closely related to this effort is our work on defining an Information
Repository Definition Language (IRDL)
[shk95-2] - a high-level language for describing
information resources and the desired logical structure of information
repositories. The language provides high flexibility in imposing
abstractions on heterogeneous information.
Presently, the IRDL
interpreter generates InfoHarness
metadata entities. With the emergence of
Web objects, it should become possible to perform the
direct generation of Web data structures.

The InfoHarness server, which uses
metadata to traverse, search, and retrieve the original information.

The CGI gateway, which is used to
pass requests from HTTP clients to
the InfoHarness server
(via an HTTP server) and responses
back to the clients.

The metadata generator, which supports the off-line generation of
the InfoHarness metadata entities
representing the desired logical structure and organization of the
original information. This metadata is used by the
InfoHarness server to support
dynamic search and presentation of raw data.

At access-time, the Web clients issue query, traversal, or retrieval requests
that are passed on to the gateway, which performs the following operations:

Parses the request, and reads input information when the request is
associated with an HTML form.

Establishes a socket connection with the
InfoHarness server, generates
and sends out a request, and waits for a response.

Parses the response, converts it to a combination of
HTML forms and hyperlinks, adds
an HTTP header, and passes the
transformed response to an HTTP browser.

The InfoHarness architecture is open,
modular, extensible and scalable.
The InfoHarness server implements the
abstract class presentation hierarchy that does not have to be modified
to support a new data type, or a new
indexing technology [shk95-1]. The methods associated
with abstract classes are general enough because they are data-driven and
can invoke independent programs.
The definitions of terminal classes are also data-driven and are not part of
the implementation, which makes the system capable of supporting arbitrary
information access and management tools (e.g., browsers, indexing
technologies, access methods).

An object that contains references to a set of other
InfoHarness objects may
be either a collection or a composite object. Only composite objects may
contain an encapsulation unit (Figure 2). A sample composite object both
encapsulates an abstract of a paper and contains references to objects that
encapsulate text, HTML, postscript and Latex versions of the full paper.
Collection objects may contain references to independent indices
that in turn reference their child objects (Figure 3). An index may be created
either from the encapsulated contents of child objects or from the values of
their attributes (an information source of the index). By an abuse of
notation, we will refer to such collection objects as indexed collections,
and say that an InfoHarness object belongs
to an indexed collection if it is a child of a collection object.

An indexed collection contains information about the index source, type,
and the location of the associated index structures.
The type ensures proper selection of query and mapping
methods, the latter responsible for mapping selected information into
InfoHarness objects (Figure 3). Consequently,
any indexed collection may make use of external data retrieval methods that
are not parts of InfoHarness,
making it possible to utilize existing heterogeneous index structures.

An InfoHarness repository is a set of
objects that are known to a single
InfoHarness server.
Any object may be a member of an arbitrary number of collections (its parents).
An object that has one or more parents always
contains unique object identifiers of its parent objects. An object that
does not have any parent is unreachable and may only be accessed
if used as an initial starting point (or entry point) in the traversal.

To illustrate the concepts discussed in Sections 3.1
and 3.2, we discuss how to use
InfoHarness for advanced
search and presentation of judicial opinions from the U.S. Supreme Court
that are available as a collection of plain text files at
ftp.cwru.edu
(the life demo
is available).
Here, information related to a single court case may be distributed
between multiple files.

Given the location of the original information, the desired access-time
presentation of individual cases, and the desired full-text indexing
technology, we have implemented a twenty-line (including declarations)
IRDL program that generates the repository of the Supreme Court cases
by performing the following steps:

Creates simple objects that encapsulate individual judicial opinions
(one per file). The encapsulation method determines the case
numbers for the opinions and stores them as attributes of the
encapsulating objects.

For each object created in step one, finds other objects related to
the same case, encapsulates them together with the presentation type
Case, and excludes them from any
further consideration. The presentation method for this type is
responsible for generating the internal hyperlinks to individual
opinions and the external hyperlinks to related information
(the Supreme Court photo, bios of the judges, etc.),
Figure 5.

Creates an indexed collection of the objects created in step 2 using
the Latent Semantic Indexing technology
[dum88].

In this example, each indexed object is a composite object
(Section 3.1). Consequently, when presenting
the results of a query (Figure 4), for each case we
see not only a hyperlink for its content but also hyperlinks for the
individual opinions. The internal hyperlinks for individual opinions are
also available when presenting an individual case
(Figure 5).

Advancing Web technology is likely to rapidly antiquate the existing Web
structures, including images, applets, and static and dynamic
HTML pages.
These structures represent a tremendous investment and can not be
recreated with every new step in the technological advance.
Consequently, modeling methods that support advanced presentation of
existing heterogeneous information have to progress as well.
Similar methods should be applicable to building virtual Webs,
with both navigation and presentation controlled by personalized
meta-information.