Plotting a new course for metasearch

Computers in Libraries
[Feb 2005]

Breeding, Marshall
.

Copyright (c) 2005 Information Today

Abstract: Breeding discusses the limitations of distributed search and the advantages of centralized search, which are the competing approaches of metasearch. He also weighs on what approach should library-oriented metasearch products rely on.

Since the launch of Google Scholar on Nov. 19, 2004, the library world has been abuzz with speculation and concerns for how it will affect us. I've been thinking a lot about the technologies and products in the metasearch genre lately and can't help but see that Google Scholar (and similar services that will inevitably be offered by its competitors) will have a significant impact on the field. It might be the right time to seriously reconsider how we approach the problem of creating a search environment for library-provided electronic resources.

The recent debut of Google Scholar has convinced me that the architecture that underlies the traditional library approach toward search and retrieval cannot succeed as the sole system that librarians rely on to simultaneously search multiple electronic resources. It now seems clear to me that the current strategy of metasearch that depends on live connections casting queries to multiple remote information sources cannot stand up to search systems based on centralized indexes that were created in advance based on harvested content. I think of these competing approaches as "distributed search" and "centralized search," respectively.

The entry of Google into the realm of scholarly information encroaches deeply into territory that librarians once considered their own. Leveraging the same simple, addictive interface of Google Web search, Google Scholar may well become the default interface that students, faculty, and the public at large embrace when looking for scholarly information. The more that I think about the way that Google works, the more convinced I become that the prevailing design underlying metasearch will not scale to the level needed for a successful library information environment.

Today's world demands an expansive search environment. The universe of information resources is immense and is growing rapidly. The content needed for research and scholarship is dispersed among publishers, aggregators, repositories, library catalogs, e-print servers, and servers throughout the Web. Users do not want to jump from one interface to another as they search for information. They prefer a single access point that takes them to all the resources in their area. Metasearch-the ability to search multiple resources simultaneously-is an essential component of a successful information-seeking environment.

The Limitations of Distributed Search

To their credit, librarians have long seen the need to create interfaces that search multiple resources. The Z39.50 search-and-retrieval protocol emerged largely to allow researchers to search multiple library catalogs simultaneously. In practical experience, Z39.50 works well when the broadcast search targets fewer than a dozen information resources. While theoretically extensible to other kinds of information, it works best with bibliographic or citation data in MARC format. In its more modern instantiation, the Search and Retrieve Web Service (SRW), Z39.50 fits better into the Services-Oriented Architecture that is rapidly becoming a preferred interoperability framework. However, it still suffers from many of the limitations associated with the multicast search model.

In the last few years, a bevy of metasearch products based on the distributed search model have emerged. These products aim to provide a seamless interface for searching a wide range of information resources simultaneously. The current generation of metasearch products can work with a reasonable number of the information resources to which libraries typically subscribe. The targets of these metasearch products include library catalog's, abstract-and-indexing databases, as well as aggregations of full-text e-joumals. In practical terms, the number of resources that these applications can search simultaneously is limited.

The "distributed search" architecture behind the current generation of metasearch products involves real-time queries cast to multiple remote resources. This model depends on live connections to multiple remote resources that use some type of search-and-retrieve protocol. The metasearch application receives the results from each remote resource, parses and processes the records returned, and displays the results to the searcher.

This real-time distributed search model suffers from a number of inherent limitations. For one thing, the number of live connections that can be sustained simultaneously is limited. Also, the slowest-performing remote service defines the best performance of the overall search transaction.

Large result sets cause problems. A broad search may involve thousands of hits from each target. In order to effectively perform collation, de-duplication, sorting, or relevancy ranking, ideally you want to have complete result sets from each resource. Because of the time that would be needed to pull complete result sets, these functions are typically performed based on a small initial result set from each remote resource.

Though the software may be tweaked to mitigate each of these limitations, I don't believe that solutions exist to effectively address these inherent limitations overall. Simpler, faster searchand retrieve protocols will help increase the efficiency of the current metasearch products, but I'm convinced that an entirely new approach is needed.

The Advantages of Centralized Search

The centralized search model involves gathering data on the universe of interest in advance and processing it into indexes that can provide instant results to searchers' queries. Web search services prove that this approach can scale to match the largest imaginable data sets. Today, Google's general Web search service indexes more than 8 billion items, sustains thousands of requests per second, and still delivers almost-instant responses.

This model of searching scales well because all of the major work happens in advance of each search request. Harvesting all possible items, or the metadata that describes them, allows a search service to populate a comprehensive index that is able to respond to all possible queries. With all the time-consuming and computationally intensive work done ahead of time, real-time queries receive very rapid responses.

All of the Web search engines rely on the centralized search model. Software robots systematically request Web pages from any known Well server so that their content can be indexed. (Anyone who runs a Web site will see a significant level of activity recorded in the server logs representing page requests from these Web crawlers. ) In earlier forms, the search engines indexed only basic Web pages, but in the last year or so, they have expanded to include all varieties of content-PDF, Word, Excel, PowerPoint, and even dynamically generated database content.

Searching on the scale of the Web could not possibly function in a distributed search model. Could yon imagine a search service that depended on a dynamic query sent to all known Web servers? It just wouldn't work. Searching a large number of targets demands pre-built indexes.

In earlier times, placing the onus of the search process entirely on a centralized service was more expensive and technically complex. The cost and capacity of storage and computing power were limiting factors. In today's environment, very-large-scale storage is more affordable than ever; clusters of computers deliver almost limitless processing power. While limitations of hardware in the previous phase of computing history demanded a distributed environment, today's almost-unlimited bounds of hardware and software favor centralized or consolidated services.

Centralized search has gained a presence in the library world through the Open Archives Initiative and its Protocol for Metadata Harvesting (OAI-PMH). This protocol and search model has been applied to a variety of digital library applications and stands as one of the fastest-growing technologies. While I won't go into all the details of the OAI-PMH here, it works by using an optimized approach for harvesting metadata that allows a service provider (the one building the centralized federated search service) to systematically harvest metadata from data providers (repositories of content) efficiently, allowing both a full transfer of metadata as well as incremental updates of those added or changed since the last harvest. [Editor's Note: For a concise explanation of OAI-PMH, see Marshall's feature "Understanding the Protocol for Metadata Harvesting of the Open Archives Initiative," Sept. 2002, pp. 24-29.]

Making the Switch

So, if centralized search is so superior, why do the library-oriented metasearch products rely on distributed search? The issue lies in the availability of metadata and content needed to build centralized indexes. A centralized search service would require that all the content of the potential resources be exposed to the library's search engine so a comprehensive index that spans all the resources could be created. To date, this just hasn't been possible. Having the publishers of content resources expose their entire collections for metadata harvesting and document indexing just hasn't been practical from a technical or a business perspective.

However, the emergence of Google Scholar demonstrates that creating a centralized search service of library-oriented scholarly resources may be more attainable than previously expected. Libraries may be able to find opportunities of their own through doors initially opened by Google, et al. As the large Web-search players convince scholarly content publishers to expose their documents and metadata for indexing in their search services, similar opportunities should also emerge for library-managed scholarly search services.

While I know that a tremendous amount of effort would be required to build a library-managed search service of content from the suite of publishers and aggregators that supply resources to the library community, the results could be far better than what we are getting from the current suite of distributed search products. It would be an enormous undertaking fraught with challenges that would need to be addressed in many different areas. Likewise, the concept of a library-managed comprehensive scholarly search service would provide vast opportunities for libraries, such as more conspicuous branding for libraries as the providers of electronic content and integration into other library-provided services.

One of the main drawbacks of the initial version of Google Scholar is its lack of an effective way to direct a researcher to the appropriate copy of any given item of content. A researcher may be linked to a copy of an article that must be paid for on demand even though it may be available through a subscription that the library has already paid for. Libraries have long been aware of this "appropriate copy" problem and have effectively solved it through OpenURL-based link revolvers. A scholarly search service designed by the library community would naturally integrate this technology and many others that would benefit library users.

The Stakes Are High

As large forces such as Google begin to step into the arena of scholarly information, it seems important for the library community to be proactive. In my own mind, I've drawn some conclusions about the status quo in the library metasearch realm relative to the changes at hand. My view of the eventual transition of the distributed metasearch model toward services based on centralized indexes may or may not resonate with others. However, now that alternatives outside the library stand ready to claim what we once considered to be our territory, I do see that the stakes are high enough for us to move more aggressively toward providing better search interfaces.

In closing, I don't want to discourage librarians from making good use of the metasearch products currently available. While not perfect, they go a long way toward the goal of providing user-friendly ways to search the electronic resources provided by libraries. My goal is more to think about what better solutions might be developed for the next generation, which seems to be just around the corner.