D-Lib MagazineDecember 1998

ISSN 1082-9873

The Internet allows for the efficient dissemination of texts, thereby
creating a rich hypertextual environment that is potentially conducive to
stimulating the free exchange of ideas in a manner worthy of the modern
scholar. However, the fact that any user whatsoever may disseminate texts
in this manner presents two distinct problems. First, finding relevant
resources on the Internet may take a fair amount of time and, second, once
resources are found, determining their reliability is often difficult if
the user is not already an expert in the field of the resource under
consideration. These problems -- efficiency in searching and academic
quality-control -- are surmountable with existing technology, and many
laboratories around the world are working hard to shape this technology
into a variety of academic information retrieval services.

Some of these efforts depend on developing a system of meta-tags that
extend the html markup language to communicate effectively with search
engine databases so that no manual data entry is needed. While these tags
will ultimately be necessary to make any wide-spread academic information
retrieval system efficient, their use at this point in history suffers
from a serious setback: until a standardized tagging system becomes
accepted and implemented by a large group of users, any search engine that
uses them will be restricted to a relatively small set of Internet
resources. Also, the use of meta-tags does not solve the problem of
quality-control, so that in addition to meta-tags, some means of
determining which files to include in a search engine index is needed. The
issues of standardized tagging systems and quality-regulating mechanisms
are related but independent problems.

In 1995, a small team at the University of Evansville began to address
these problems, looking for ways to allow cataloging of the Internet
immediately in its current state of disarray while preserving some sort of
quality-control. Since then their efforts have undergone considerable
revision, each time producing a mechanism better than the previous. These
efforts have now been consolidated into the newly-formed Internet
Applications Laboratory (IALab), temporarily funded by the University of
Evansville, with the express mission of providing free access to worthy
academic resources for the global community. The means by which the IALab
seeks to do this is by developing Internet filtering mechanisms that
couple well with search engine technology. One guiding principle of the
IALab is that the Internet allows scholars to disseminate their own
research in an academically meaningful manner by placing it on servers at
their host institutions. Consolidating efforts, such as the filtering
mechanisms just mentioned, act on this research by adding a procedure for
validation or accreditation that ensures a measure of reliability for
users who find those resources through one of these filters.

The Argos Model

The first attempt from what is now the IALab to provide a
quality-regulating mechanism was implemented with Argos
(http://argos.evansville.edu), a limited area search engine (LASE)
dedicated to ancient and medieval studies, but applicable in other disciplines, put on-line in October of 1996.
Argos uses a very simple crawling procedure to limit the scope of return
sets to collections of resources hand-selected by scholars working in the field. As an example of its effectiviveness, in 1996, AltaVista returned 44,000 hits for a
search of the word "Plato," including references to a few software
packages, an ale, a consulting firm, a small town in Illinois and the
Spanish word for "plate"; Argos returned about 300 hits, all of which were
pertinent to the Plato that lived and worked in ancient Greece.

To determine the scope of Argos, we enlisted the help of the major index
sites in ancient and medieval studies that were already established on the
Internet. These sites consisted primarily of pages of links to
hand-selected academic resources. Then we built a web crawler to search
each of these "associate sites," plus each page to which they link.
Special html extensions were devised for the use of the associate sites
whereby they can instruct the crawler not to follow a link or to follow a
link to a second or third level, provided that these secondary and
tertiary links stay on the same server and are located further down the
directory chain. (Why we did this is a story in itself, but one for
another article.) The net result of the procedure is that it passes
editorial control over the contents of Argos to the editors of the
associate sites. When they add a link to their sites, Argos picks it up,
and when they remove a link, it automatically falls out of the Argos
search window, so that the procedure guarantees the user that any resource
found through Argos was selected by a professional academician.

The model is variable in ways the IALab has not yet implemented. Voting
mechanisms could easily be added whereby a resource must show up in more
than one associate site before it is included in the search engine, and a
variety of additional html tags could be devised to allow the editors of
the associate sites to classify resources or to direct the crawler with
more control. In the absence of these additions, Argos suffers from a lack
of features. It allows users single word, Boolean and phrase searching,
but always across the entire dataset, and since it pulls database
information off the pages that are actually searched, the entries in a
return set do not share a common format, particularly the title.

In addition, a lack of a common vision among the associate sites makes
the return sets divergent in their quality. Of course, this may not be a
problem with a different editorial board, though the difficulties in
securing agreement and cooperation across the Internet on issues such as
these should not be underestimated. To this day, one of the associates,
the most scholarly and well-established in fact, continues to deliver a
picture of a dog decked out for Christmas to Argos' quality-controlled
database, and it took us four months to get another associate to block a
hidden link to a triple-x sex site.

Differences of vision can never be entirely eliminated when Argos is workng with many, independent associates. In another implementation, which we are
calling the single associate model, they disappear. Bernard Hibbitts'
Jurist: The Law Professor's Network (http://jurist.law.pitt.edu)
is based on this single associate idea. In this case, IALab sends a
special LASE crawler over just his site allowing Professor Hibbitts
complete editorial control over the contents of the database. This has
proven quite effective, though it still has the some of the shortcomings
of Argos mentioned above.

The general Argos model has the advantage of being easy to implement in other subject areas. To test this, IALab
built Hippias: Limited Area Search of Philosophy on the Internet
(http://hippias.evansville.edu,), edited by Peter Suber, a professor of
philosophy at Earlham College. It was built in one weekend after the initial associate
sites had been selected. We would have made many more LASEs based on the
Argos model, if we had had system resources to support them at the time.

As it turns out, that we did not was for the best. IALab's next
experiments do not have the limitations of the Argos model; they are based
on a database model that allows for the categorization of links and the
manipulation of the page descriptions in return sets into a standard format
from the search-engine side of the equation. I will say more about this in
a moment. In the meantime, I should point out that the Argos model could
be made to provide these features as well, if there were a standardized system of
meta-tags, provided that the authors of indexed files use them
consistently. If our experiment with Argos has taught us anything,
however, it is that indexing procedures must remain fairly free-form, at
least with current technology. Even in the face of clear instructions, it
is unlikely that a common usage of standards will emerge soon, even if
agreement is ever reached about what the standards should be. The later
IALab models take this problem into account as well.

The Noesis Model and the Encyclopedic Vision

The high quality return sets from Argos compared to the major search
engines led us to start reconceptualizing the search engine. The term
"search engine" is a bit of a misnomer, if the device also pre-filters the
Internet, and calling a list of links to selected pages an "index" really
does a disservice to that enterprise as well. A bibliography, when it
provides ready access to the sources that it lists, is an encyclopedic
collection of content, and when that content is peer-reviewed and rendered
searchable by a LASE, the result is an "encyclopedia" that is collectively
maintained by scholars around the world.

To demonstrate this, we devised a thought-experiment to show that what
was standing in the way of this realization was largely a psychological
phenomenon. Without advocating that this be the case, we started to
imagine how a project such as Argos would appear, if all of the pages in a
return set appeared with standardized cataloguing information and were
formatted in the same page layout. The result would appear to be a unified
effort to disseminate scholarship freely on the part of the scholars
around the world. This still remains the case, even though the pages are,
in fact, formatted differently. What the experiment shows, however, is
that our failure to think along these lines earlier was due to the
psychological expectation of common format only and not to the limitations
of the technology. With this in mind, we started to think in smaller
terms. How many quality links would it take to make even a large
encyclopedia? Nothing on the order of 14,000, then the size of the Argos
dataset. At this point, we turned our sights back to the discipline of
philosophy. Instead of linking to all the biographies on Plato, for
instance, a better service could be provided users by listing only two or
three of the best.

So, instead of running the crawler across associate sites, we revised
the single-associate model used with the Jurist project and started
cataloguing URLs one at a time in our own database. We track the author
and his or her institutional affiliation, the title, the resource type,
that is, whether the resources is an essay directed at professional
audiences, a lecture for undergraduate students, a book review, an image,
a primary text or a research tool, and a few other pieces of information.
We also added a hierarchical system of classifying resources. Though the
prospects of the one-link-at-a-time approach sounded daunting at first,
the reality of the situation turned out quite the contrary. We wrote a
special user interface to allow a development team to catalogue resources.
It takes less than a minute to catalogue a link, much less than the time
it takes to process a book in a standard library. Furthermore, once that
link is catalogued, it is done so for the entire Internet world and we
don't have to handle any books. Special procedures allow us to edit this
database easily. A robot deals with dead links, and we are writing a
variety of procedures to report conditions that suggest when an entire
website may have moved, when a resources has changed significantly, and so
on.

To filter the quality, we began by classifying only resources written
by Ph.D.s in philosophy, though we announced that we would revisit this
policy later. Its role has been temporary to allow us a measure of
quality-control. We are now in the process of adding a professional users'
module that allows professionals to configure their own personal research
link libraries from the Noesis dataset. As they do this, a robot will
evaluate their decisions and automatically accredit resources according to
a variety of variables that can be manipulated by the site editor. In
addition, the personal research modules will be evaluated according to
topic, and another robot will classify resources according to how they are
actually used by professionals rather than by imposing an exterior system,
like that of the Library of Congress, on them.

The result is what we are calling the Noesis model. Users can see a
manifestation of it at http://noesis.evansville.edu. It allows users to
search only essays, lectures, book reviews, images, primary texts, or
research tools, or any combination of these, starting at any point in our
hierarchical tree downwards, using simple word searching or phrase
searching with or without boolean criteria. This makes the site effective
for those who are new to philosophy and yet useful for more seasoned
academicians. (Noesis: Philosophical Research On-Line does not yet
include a topic tree, though one is available at another manifestation of
the model at Exploring Plato's Dialogues. (See
http://plato.evansville.edu). The Plato site cross-links several text
files directly with the search engine thereby producing what we are
calling a virtual learning environment. To learn more, see the information
file at that site.)

The Noesis model overcomes many of the limitations of the Argos model.
It is more precise in its filtering mechanism, and it offers a wider array
of search options for the user. It also has a higher degree of
reliability. It's chief disadvantage is that it requires human
intervention to maintain the database. So far, this hasn't been a problem.
Two student workers have done this effectively. Given a standard work
week, they can easily catalogue over 1,000 links a week, if that were
necessary, while maintaining the dataset. As the Internet grows, this may
become a problem, and procedures will be needed so that users can modify
their entries from their end. We are anticipating this with software
innovations for the Goliath Project discussed below. Even so, it is worth
pointing out that we have made significant headway in cataloguing the
portion of the Internet dedicated to professional philosophy long before
any standards have been reached.

David and Goliath

Noesis enacts its quality-control mechanism on the search engine side.
The Goliath Project, a joint venture between the IALab and the
International Consortium for Alternative Academic Publication (ICAAP),
http://www.icaap.org, uses the traditional means of peer-review by
indexing only the independent journals that are springing up on the
Internet.

Its procedures represent a synthesis of the database model used with
Noesis and a meta-tag system developed by an ICAAP team headed by Mike
Sosteric, a sociologist at the Centre for Global and Social Analysis,
Athabasca University. The crawler mechanism used for the Goliath Project
goes by the name of DAVID, a dedicated accrediting variable indexing
device. It is accrediting in that it can promise users that any item
appearing in a return set has undergone a procedure of true peer-review,
and it is variable because it uses a database requiring human intervention
for pages without the standardized tags and automatically defaults to a
meta-tag system for pages with them. It can easily be adapted to
accommodate a variety of meta-tagging systems, thereby allowing
full-coverage cataloging of independent periodicals on the Internet long
before any universal agreement is reached concerning meta-tagging
standards.

Goliath will allow a full range of search options, like Noesis, but it
will cut across all the academic disciplines. Though the software is still
under construction, it will be finished by the end of January at which
point we will begin populating the database. By the end of 1999, the
collection will be sizeable enough to be significantly useful for
professional academicians and student scholars.

The hope of the IALab and the ICAAP is that Goliath will stimulate the
proliferation of independent journals on the Internet that operate without
economic interest. The price of this technology is inexpensive enough to
create an Internet in which quality information is disseminated
efficiently to the global community free of charge. In a matrix where
authors have traditionally not been paid for their contributions to
journals, we hope that authors will respond positively to these
independent journals as well. Goliath means a wider readership, because
access is free and efficient; and because it provides mechanisms for the
validation of resources, Internet publication should start to "count" in
promotion and tenure decisions. Furthermore, Goliath will work to bridge
the gap between the general public and the university, allowing scholars
the more traditional role of informing society rather than being subject
to its economic whims.

Conclusions

What we have learned from these experiments is that, in no uncertain
terms, it is technologically possible and economically feasible to
build a system of dissemination for academic resources that is completely
administrated by the scholarly world without the intervention of economic
interests. If the IALab has not yet demonstrated this fully in the
concrete, this is only because we have been operating on a very small
budget in an inexpensive lab that employs undergraduate Interns under the
direction of a single faculty advisor. (This should underline the economic
feasibility of enterprises like the ones discussed above.) It is not
because standards must first be reached for meta-tags, nor is it because
the problem is technologically difficult, though a considerable part of
the paper paradigm must be rethought. We fully believe that the new
Internet technology offers the academic community improvements to the
existing system of dissemination as long as it does not wait for the
corporate sector to solve these problems for it.

Corrected the spelling of Mike Sosteric's name. He is the head of the International Consortium for Alternative Academic Publication (ICAAP) team mentioned in the story. The Editor, January 4, 1999, 8:15 AM.