Subject Categorization of Web Resources

Lois Mai ChanSchool of Library and Information Science
University of Kentucky, USA

Expanding Roles of Classification

Traditional classification schemes were developed to handle a large amount of library materials. In the Web environment, again, we are facing a large amount of resources, and classification has the potential, and to a certain extent, has already been demonstrated, to expand beyond their traditional roles. For a long time, classification schemes have been used in libraries, particularly those in the United States, primarily as a shelf-location device for physical items. In classified catalogs and bibliographies, classification schemes have been used to organize bibliographic records and entries. In the online catalog, class numbers have been used also as access points to cataloging records and now extending to other types of metadata records. In the networked environment, existing classification schemes, as well as custom-built hierarchical or classification-like devices, have been adopted as subject browsing and navigational tools in portals to electronic resources. In addition, classification also has the potential of being a switching language between and among different retrieval languages in a multi-lingual environment.

In this presentation, I would like to discuss two aspects of classification: its use in metadata records and its use as an organizing tool for Web resources.

Classification Data in Metadata Records

The Subcommittee on Metadata and Subject Analysis, a subcommittee of
American Library Association/Association of Library Collections and Technical Services,
was established in 1997, with the charge to:"Identify and study the major issues surrounding the use of metadata in the subject analysis and classification of digital resources. Provide discussion forums and programs relevant to these issues."

Initial deliberations focused on the Dublin Core Metadata Scheme; later discussions were
broadened to include metadata schemes in general. A copy of the complete report (ALCTS/CCS/SAC/Subcommittee on Metadata and Subject Analysis 1999) is available at: http://www.ala.org/alcts/organization/ccs/sac/metarept2.html.

The Subcommittee considered both subject vocabulary and classification data in metadata records. With regard to the use of classification in metadata records, the following questions were considered by the Subcommittee:

Existing schemes vs. new scheme(s)

Depth and breadth of scheme

Notation

The first question regards the choice of scheme. Should we encourage users to adopt, adapt, or modify existing schemes or develop new ones? How suitable are existing schemes for use in metadata records? To these questions, the Subcommittee recommends:

Classification data should be included in the metadata record by those who have the expertise to do so. For those not trained in the use of classification, further development and improvement of mechanisms for automatic assignment of classification data from different schemes and sources should be encouraged. (4.1)

The use of as many existing classification schemes (DDC, LCC, NLM, etc.) as useful and feasible even within a particular implementation should be allowed. Multiple class numbers should be allowed in the same record to bring out different topics and aspects treated provided that they are properly designated and coded. (4.2)

For classification data, the Subcommittee recommends adopting an existing scheme with or without modification. (5.2.1)

Regarding the second issue, depth and breadth of scheme, the question is: Do we need close classification or will broad classification serve the purpose just as well? The Subcommittee recommends:

Criteria for choosing classification schemes should be based on subject domain, the nature and scope of the collection being described, and the user community being served. (5.2.1)

Classification data at the most exhaustive or specific level should be encouraged. (5.2.2)

In order to serve as effective access points to Web resources, certain issues regarding classification need to be re-examined. Adaptability of classification schemes can take the form of flexibility and extensibility in the depth of hierarchy and variability in the collocation of items on a particular level of the hierarchy. The requirement of depth varies from application to application. As a tool for shelf-location and bibliographic arrangement, considerable depth in classification is required, as evidenced in the continuing expansion and growth of the Dewey Decimal Classification (DDC), the Library of Congress Classification (LCC), and the Universal Decimal Classification (UDC). On the other hand, as a browsing and navigating tool, typified by the subject categorizing schemes used in library portals and popular Web directories, broad schemes are often sufficient.

Another reason for flexibility in depth is to ensure the amenability of existing classification schemes to the creation of classifications focusing on specific subject domains, including specialized subject taxonomies built around specific disciplines (e.g., art, anthropology, education, human environmental sciences, mathematics, engineering), industries: (e.g., petroleum, manufacturing, entertainment), consumer-oriented topics (e.g., automobiles, travel, sports), and mission- or problem-oriented topics (e.g., environment, juvenile delinquency). Developing these mini-classifications or taxonomic modules with a view of fitting them as nodes, even on a very broad level, into the overall classification structures of meta-schemes such as DDC and LCC can go a long way to ensure their future interoperability.

The third question relates to the use of notation: Should class numbers and/or captions be included in the metadata record? The Subcommittee recommends:

Classification notation should be included. However, item (e.g., non-topical Cutter) numbers are not necessary because classification data are not used as a shelving device in this context.

In the metadata record, captions (i.e., the text accompanying the class numbers) need not be included. If desired, captions could be built in through systems design. (5.2.3)

Classification/Subject Categorization on the Web

As classification is used increasingly as a device of subject categorization for organizing and managing Web resources, other issues also need to be considered. Known by various names such as subject guides, Web guides, subject categories, subject directories, subject hierarchies, pathfinders, and so on, what many of these hierarchical schemes have in common is that they manifest the traditional classification principles of hierarchical structure, domain partition, subordination of the specific to the general, and array of related subjects.

To study and evaluate the use of classificatory framework in organizing Web resources, the ALCTS/CCS/SAC/Subcommittee on Metadata and Classification, a subcommittee paralleled to the Subcommittee on Metadata and Subject Analysis, was also established. Its final Report (ALCTS/CCS/SAC/Subcommittee on Metadata and Classification 1999) is available at:
http://www.ala.org/alcts/organization/ccs/sac/metaclassfinal.pdf
Functions of classification on the Web have been identified by the Subcommittee as:

browsing

hierarchical movement

identification

retrieval

limiting

partitioning

profiling

A survey of the hierarchical structures now functioning as Web directories shows considerable variation in complexity and sophistication, in subject scope and depth of coverage, and in the number of items they cover. They also vary in the classification patterns on which they are based. In some cases attempts have been made to adapt existing schemes such as DDC, LCC, and UDC to the Web environment. Examples include:

subject guides or directories devised by popular Web search services such as Google, Yahoo!, Lycos, Infoseek, Excite, and others;

schemes devised by individual libraries to facilitate access to the Web resources they have selected and included in their local systems or portals; and,

Web organizers and directories based on existing schemes, for instance, OCLC's Netfirst based on DDC, CyberStacks and Scout Report Signpost based on LCC

Advantages of Using Hierarchical Structure on the Web

Using hierarchical or classification-based formats to categorize Web resources could have significant advantages, among which, as Traugott Koch and Michael Day (Koch and Day) have pointed out, are improved subject browsing facilities, potential multi-lingual access and improved interoperability with other services. A hierarchical structure is like a conceptual map--either of the entire universe of knowledge or of a particular domain therein. Such a map sorts information resources into related groups and their subgroups and thus allows searchers to confine to defined areas where similar material is concentrated.

There are other advantages of using classification in the Web environment where different conditions from those of the print environment prevail. In traditional systems, subject data (including classification numbers and indexing terms) are typically embedded in their sources, either in the documents themselves (e.g., call numbers on spines) or in their surrogates (cataloging or other metadata records such as the Dublin Core). In contrast, in the Web environment, subject data often are separate from or reside outside the resources themselves. Only the links (urls, etc.) are attached to the scheme. Instead, such information can be stored in directories or other types of Web interfaces that link subject data to the resources but do not affect them otherwise. In other words, individual links are made from the subject provisions in the Web directory to the resources through urls. The advantage of "linking-to" rather than "storing-with" is flexibility. With a linked system, if a classification or other subject organization scheme is revised, it is only the links that have to be changed or moved: the Web pages and sites are not affected in any way. This feature minimizes the need for constant maintenance; reclassification is not a problem. Of course, there is still the problem of persistent urls, but that is a problem not directly related to our discussion. Another advantage of using classification structure on the Web is flexibility. Different interfaces for different user groups, e.g., Yahoo!'s regional editions (Chan, Lin, and Zeng 2000) show flexible arrangements of categories:

Yahoo! Singapore: Buddhism, Christianity, Hinduism, Islam, Sikhism.

Yahoo! UK&Ireland -Ireland Only: Christianity, Mysticism, Paganism.

Yahoo! HongKong (English version): Christianity, Company.

Subordinate topics under the same main category may vary from region to region and from time to time:

Yahoo! (USA or World):

Arts & Humanities

Literature, Photography...

Yahoo! (Canada, HK)

Arts & Humanities

Fashion, Photography,Literature ...

Yahoo! Australia&NZ:

Arts and Humanities

Artists, Photography, Literature...

Yahoo! France:

Art et culture

Littérature, Cinéma,Musique, Musées

Yahoo! (USA or World):

Recreation & Sports

Sports, Travel, Autos, Outdoors...

Yahoo! Canada:Recreation & Sports

Sports, Outdoors, Travel, Autos ... ..

Yahoo! Australia&NZ:

Recreation & Sports

Sport, Travel, Motoring, Outdoors... .

Yahoo! France:

Sports et loisirs

Sports, Tourisme, Auto/Moto, Jeux

Yahoo! Germany

Sport & Freizeit

Autos, F1, Fußball, Spiele, Reisen...

Yahoo! India

Recreation & Sports

Sport, Cricket, Travel, Hobbies...

Yahoo! Italy

Sport e tempo libero

Calcio, Sport, Motori, Viaggi...

Yahoo! Mexico

Deportes y entretenimiento

Futbol, Deportes, Turismo

Furthermore, the scope and the depth of any given scheme can be easily adjusted on the basis of literary warrant, whether the warrant be popular, consumer-oriented, or academic/scientific. For example, common categories found in popular subject guides include automobiles, entertainment, family, sports, and travel, while the most commonly found categories in academic Web guides are humanities, social sciences, science, technology, and law. In addition, the Web guides can also be easily adapted to local or regional needs, or modified for the needs of a specific user community.

Operational Requirements of Web Organizers

As the volume of Web resources continues to grow, one may expect corresponding growth and refinement in ways to organize them. At this point in time, it is perhaps not too early to consider some of the operational requirements of Web organizers. The desirable characteristics may be summarized in this way: a scheme designed for organizing Web resources should be:

Methods for Categorizing Subject Content

In implementing a Web organizer, the first question is whether to adapt an existing classification scheme or whether to begin afresh. Currently, it appears that those who design and build Web organizers lean toward devices that are based on their own understanding of the needs and search habits of their users.

What we see here is the difference between two different methods for categorizing subject content. Familiar classification schemes, which have a long history, typically represent a top-down approach, starting with the whole universe or an entire discipline of knowledge, determining major classes on theoretical grounds, and subdividing them hierarchically into increasingly specific levels. This approach has generally been used whether the resulting scheme is custom-tailored for specialists or designed with a large and diverse population in mind. The alternative approach is a bottom-up operation that begins with specific terms, items, or Web sites, which are then grouped and organized, first into a microcosm, finally, as coverage becomes fuller, into a macrocosm. In the Web environment, where most subject guides have also been designed with the general public in mind, it seems that many recent efforts to categorize Web resources are taking the latter, i.e., bottom-up, approach.

The question of which approach is likely to prove more effective in the Web environment does not have a definite answer. Either approach leads to a system that embodies domain partition, general/specific delineation, and array of related topics -- features that are considered important for effective retrieval from a very large resource collection. What seems likely is that time will show that top-down systems are especially suitable for highly structured established fields; bottom-up systems, on the other hand, may be particularly well suited to the mass of varied and fluctuating material that makes up so much of the Web. It seems likely, also, that the bottom-up approach works especially well for personalized or customized Web organizers. An example is Northern Light's (http://www.northernlight.com/) "Custom Search Folders," a device that categorizes the results of particular searches into broad categories.

Knowledge Class

The remaining part of this presentation consists of a demonstration of a research project on the development of a personalized knowledge organization and access mechanism called Knowledge Class. Its progress has been reported in the literature (Lin and Chan 1999).
{See example of Knowledge Class}

Purpose and Objectives

The purpose of the device called "Knowledge Class," is to provide a customized method for knowledge organization and access, to supplement and complement existing devises for Web retrieval. In a widely cited paper, Clifford Lynch suggests: "Combining the skills of the librarian and the computer scientist may help organize the anarchy of the Internet" (Lynch 1997). In our project, we have been exploring the possibility of combining existing methods of knowledge organization with advanced Web technology to create an easy-to-use framework for individual Web users. Preliminary results have been reported in the literature (Lin and Chan 1997). Here I will briefly summarize the major characteristics of Knowledge Class and report on the latest progress.

Components

Knowledge Class contains two basic components: an organizing framework and an interface for access to and retrieval of Web resources.

The conceptual organizing framework consists of a classified mini-thesaurus, i.e., a hierarchically structured collection of terms on a specific topic (e.g., adaptive technology, investment, etc.) or a particular discipline (e.g., chemistry, physics, etc.) of interest or concern to an individual user.

The interface for access to and retrieval of Web resources serves as an interactive mechanism between the user and the terms in the organized framework as well as between the user and Web resources. Through this device, the user can initiate searches in a chosen search engine by selecting the display terms or by using pre-stored search strategies, which often contain synonyms. The user can also connect to specific sites previously discovered by clicking on links with pre-stored urls, i.e., a bookmark-like feature.

Conceptual Basis

In Knowledge Class, we try to recapture some of the advantages of traditional methods for efficient and effective information storage and retrieval and apply them to the Web environment. Specifically, three aspects are considered:

controlled vocabulary features, particularly the control of synonyms and homographs for the purpose of improving recall and precision; and,

search strategies formulated and pre-stored for the purpose of optimizing search results and current awareness.

Improving both subject browsing and precision of retrieval are the two main goals of our research on Knowledge Class. In the first stage of our work, we introduced the mini-thesaurus-like device. We emphasized that:

a knowledge structure can be built on principles of classification and bibliographical organization;

this structure could be seamlessly integrated with search engines for access to Web resources; and,

an easy-to-use graphical interface could be constructed to support user interactions not only with the organizing structure but with the relevant resources discovered and retrieved through search engines.

Conceptual Design of Knowledge Class

We set out to design Knowledge Class in such a way that it:

organizes concepts and terms on a specific subject or topic into a logical structure showing subject relationships;

facilitates browsing of subject terms and their relationships;

stores useful search terms and strategies so they are available for future use;

allows the addition of synonyms for better recall and qualifiers to resolve ambiguities or distinguish among homographs;

initiates searches using pre-stored terms and strategies in a chosen search engine; and,

stores urls of specific sites for future use

In other words, we hope to take information service one step further, beyond what has been available so far. In online retrieval, a great deal of emphasis has been put on retrieval results, and rightly so. But, after retrieval, there is also the need for organizing related information and, in a sense, "storing" it for future use and re-use. This can be done by providing the means for re-visiting the sites and, equally important, for retracing the steps used to find the resources in the first place.

System and Interface Design

Certain principles underlie the design of the interface:

Maximizing the benefits of both manual and automatic indexing.

Creating easy-to-use interface for effective retrieval.

Connecting the display terms with search engines automatically and "smartly"