Intelligent Search on the Internet - the Vševěd System

During the last few years, the World-Wide Web has become one of the most widespread
technologies of information presentation. Because of its enormous growth, the task to find
an information about a specific topic can be very hard. Search engines were developed to
automatize this process. Nevertheless, “classical” search engines can have significant
drawbacks for a common unexperienced user (different locations, different interface etc.).

The aim of the Vševěd metasearch system is to simplify search for information (Web
pages) on the Internet. The system was inspired by some existing metasearch systems,
mainly by AskJeeves. Our system differs mainly in its focus on czech environment (czech
language, czech search engines).

Search and metasearch

During the last few years, the World-Wide Web has become one of the most widespread
technologies of information presentation. Because of its enormous growth, the task to find
an information about a specific topic can be very hard. Search engines help to automatize
this task by looking for a Web page according to words/phrases given by the user. Search
is based on off-line (manually - e.g. Yahoo, or automatically - e.g AltaVista) created
indexes. Standard way of using a search engine consists of a sequence of actions:

invoke the search engine

input the keywords

if search results not satisfactory, use another search engine

The use of search engines can have significant drawbacks for a common (unexperienced)
user. There is a number of search engines at different locations each having its own ways
of interaction with users. So the user must remember their URL and the way how to work
with them. So the idea of “metasearch” emerges to help users to find relevant
information in a more convenient way. The basics of all metasearch engines is to give
access to more than one search engine. Sharing this idea the abilities of different
systems can vary:

system asks all search engines it can (e.g. MetaCrawler
[Etzioni, 1997])

system selects only some of accessible search engines (e.g. SavvySearch [Howe, 1997])

system asks both local database of frequent questions and search engines (e.g. Ask Jeeves)

Advantages of metasearch are:

simultaneous submission of the query to different search engines,

access to search engines unknown to the user,

single interface on user’s side,

sometimes postprocessing of returned information.

The Aim of the Vševěd System

The aim of the Vševěd metasearch system is to simplify search for information (Web
pages) in the Czech Internet. The system was inspired by AskJeeves, a metasearch engine,
that uses both local database and “standard” search engines (e.g. AltaVista, Lycos,
Infoseek, Excite) to find infromation relevant to users query. Unlike AskJeeves, our
system is oriented on czech environment (czech language, czech search engines).

Our work is focused on pre- and postprocessing. With the exception of CBR module, we do
not solve the question of search in Internet.

Fig. 1 Overall scheme of the system

The main scheme of Vševěd is shown on Fig. 1. The core modules are working now, the
intelligent modules are rising gradually. Full line denotes already realized modules,
dashed modules are under development and dotted line means planned modules.

The Vševěd has following components:

the Query Preprocessing (QP) module

the Case Base Reasoning (CBR) module - the database of direct answers

the interface to other searching modules

the Postprocessing module - process the results from other searching engines

The components of the system

Following the rapid prototyping paradigm of developing software, we deployed a rather
simple first version, which will be gradually extended. This section describes both
current state of the modules and their further development.

The minimal realised core of Vševěd consists of:

a simple preprocessing of the query

querying other searching engines

a simple postprocessing of the found links

a simple presentation of the found links

The “intelligent” part of the system is spread in following modules:

CBR - Case Base Reasoning module

WWW ontology based knowledge - a representation of background knowledge about www

LP - Language Preprocessing module

ML - Machine Learning module

Preprocessing of the query

Preprocessing of the query seems crucial for bridging the gap between unskilled users
and search systems. The usual way of querying - input of sequence of keywords - cannot
benefit from some smart capabilities of different search engines. By now, the
preprocessing consists of

changing Czech characters to ASCII,

transposing all characters to lower ones.

Further extensions will be

to send query to the engines in their language - AND, OR, space has different meaning,
ect.

to try to understand natural language ( language preprocessing).

Language preprocessing

The processing of the query given in plain (czech) language necessarily includes a
solution of these tasks:

Including a thesaurus for expanding the query by equivalent, broader/narrower or
otherwise related terms might certainly be of use. We can benefit form previous work on
automatic translantion of queries into databases [Strossa].

The result of such preprocessing will be a formula using both boolean operators (and,
or, not) and proximity operators (near).

Querying other searching engines

Standard search engines are the main source of inforamtion, the user is looking for. In
the current implementation, the user must select the engines he wants to use from the
given list (see Fig. 2). The user may choose to test the accessability of pages, but this
option takes some additional time.

Possible extension of this module is to select (according the query and Vševěd’s
experience) only the most relevant search engines. Such approach is used e.g. in the
system SaavySearch.

The result of the search is a list of relevant links. We use the information about
URL’s and ‘names’ (content of the HTML tag TITLE) for further postprocessing.

Fig. 2 Vševěd input form

Postprocessing

The postprocessing module should better organize the results obtained from different
search engines. This can be achieved by

removing duplicities and dead links

rearranging results according to the relevance

grouping results (according to their content, location etc.)

giving more information about the pointed page (e.g. type of page, similar pages, etc.)

Some of this postprocessing steps have been already implemented, another are under
development. By now, the postprocessing is based only on the syntax of the links (URLs).
We decompose the URL returned from search engines into server, path, owner, file, first
and extension part.

remove links to some services of used search systems (e.g. links to altavista server),

remove multiple links to the same page ( the files in the same path with the same first
part of the, filename that differ in the rest of the filename),

group more retrieved pages from one directory into the link to this directory.

We make also an URL-based analysis of the type of the page, e.g.:

personal homepage is indicated by the symbol ‘~’ in the part owner,

homepage of an organization is indicated, if URL consisits only of the part server

For each retrieved link (page) we compute the relevance rel as:

relmax

if link points to a directory with more relevant pages,

relmin

if corresponding page has empty TITLE tag,

w1 (a/a+b) + w2 (a/a+c)

elsewhere.

where a is the number of the words occuring both in the query and page description
a+b is the number of the
words in the query
a+c is the number of the
words in the page description TITLE

The relevance is used to create a sorted output.

Instead of symetric relation between words in the query and in the document description
(e.g. a/a+b+c) we give more weight on the first term a/a+b. So we prefer the documents
that are described using all words in the query. The idea behind is that a common,
“lazy” user gives only the necessary terms in his query.

In the current implementation, w1 = 100 andw2 = 1
(this setting corresponds to sorting using a/a+b as the primary key and a/a+c as the
secondary key, rel being from the interval (0, 101]), relmax = 102
(links to a directory obtain the highest relevance) and relmin = 0 (links with
empty TITLE obtain the lowest relevance).

Further extensions of the postprocessing will include

more syntactical rules

usage of the WWW ontology (to recognize types of page, or similar pages)

grouping links according to the contents of the documents

WWW Ontology based knowledge

For knowledge-based tasks related to Internet (and WWW, respectively), knowledge
modeling techniques seem to suit well for capturing static domain knowledge describing
e.g. the typical internal structure and mutual links among WWW pages. Heuristics based on
such reusable conceptual knowledge bases (ontologies) will be used for postprocessing of
the results of search (recognizing type of page, recommending similar pages..).

Creating WWW ontologies is a subject of related research [Šimek, Svátek 1998].

The CBR Module

The CBR module should give direct answers to some common questions. Instead (or
parallel) to the search in the Internet, the CBR module will search its local case base of
links. The case base will be organized in a tree in a way similar to Yahoo-like indexes.
The is-a hierarchy will be used to retrieve cases that not only correspond to the user
query but are similar in the sence of general/specific relation.

First version of the CBR module is based on the idea that both query nad case
description consists of set of words. On this word level (when looking just for word
matches), we can obtain

exact answer (terms in the query are the same as terms in the case description)

In case of partial answer and no answer, no case will be retrieved from the case base.

For each retrieved case, we compute the relevance as

(a/a+b+c)

where a is the number of the words occuring both in the query and case description
a+b is the number of the
words in the query
a+c is the number of the
words in the case description

The prototype consists of two parts:

The mysql database of direct answers

CBR system realised in CLIPS

The module searches all records from mysql database that contains some term from the
query, their ‘parents’, ‘siblings’ and ‘children’ in the hierarchy. These
records are processed by CLIPS rules. We don’t have the taxonomy of Czech words now so
we have synonyms in related terms.

The further development of the CBR module will be towards better organization of cases
(DAGs instead of trees), better reprezentation of case description and better similarity
measures.

The case base is created manually by now, in further step, we assume to use machine
learning methods to update this base automatically.

The presentation of results

The answers from all engines are shown on one HTML page as a (index) list of links
together with some additional information obtained during postprocessing (relevance, types
of page). The links are sorted according to the estimated relevance. The links followed by
the user are stored for further analysis in (not yet realized) ML module.

Fig. 3 Vševěd results

Possible development

grouping related links according to the context

Implementation notes

Vševěd is written in PERL, uses the database mysql and CLIPS (C Language Integrated
Production System). Because of its rule-based programming paradigm, CLIPS is suitable for
encoding of the postprocessing knowledge. It’s drawback is that the computation is
rather slow.