Growing Up With Federated Search

This is the story of how one organization of the Federal government came to recognize the potential of federated search and then set out to deploy it and encourage its maturation.

Along the way, considerable progress has been made. More science is freely findable on the web today than has ever before been available to the public. Yet, much more progress remains to be made.

Before the Web

Before the web, the Office of Scientific and Technical Information[2] (OSTI) of the Department of Energy[3] used the technology then available to maximize communication about the results of the Department's research and development program. For example, OSTI created microfiche and sent it to hundreds of depository libraries. It also partnered with on-line vendors like Dialog. Hard copies were made available via a partnership with the National Technical Information Service.

Enter the Web

With the advent of the web, it quickly became clear that the new medium offered tremendous potential to communicate science. Thus, OSTI set out to develop cutting-edge web tools to share e-prints, technical reports, conference proceedings, and other forms of scientific and technical information (STI). Because each form of STI comes from a distinct source, each form follows a distinct pathway which needed to be accommodated, which naturally led to a separate information product for each form.

The Need to Integrate Web Applications

Within a couple years, OSTI had developed a suite of web based databases and was also linking to similar databases offered by other agencies. It was apparent, however, that a suite of tools is not a library. What was needed was a way to integrate all of these databases so that patrons need not search them one at a time. Fortunately, the concept of federating separate sources was just then being introduced to the web. It was an affordable alternative to other integration technologies, such as creating a data warehouse.

OSTI set out to federate its web applications so that all the databases could be searched simultaneously via a single query.

The Power of Relevance Ranking

Along the way, OSTI took every opportunity to encourage the rapid maturation of federated search technology. Most notable was the development of relevance ranking in a federated environment. Before relevance ranking, federated search results were presented in long lists: a set of hits from source A would be followed by a set of hits from source B, and then from source C, and so on. Soon, the patron was overwhelmed with sets of hits. As with surface web search engines, like Google, relevance ranking was a major advance in meeting the needs of patrons.

The challenge was that the technology behind relevance ranking for Google does not work in a federated environment. So, new relevance ranking had to be invented.

The Current Situation

Today, several federations of web applications are available to everyone with internet access. ScienceAccelerator.gov[4] federates key DOE databases. But OSTI progress extends beyond DOE to include Science.gov[5], which federates U.S. federal agency science information, and WorldWideScience.org[6], which federates national databases and portals from around the globe. The latter two web applications are actually federations of federations. In addition, OSTI web applications also combine crawling technology, such as used by Google, and federation of databases into a single web application. See http://www.osti.gov/eprints[7].

While OSTI has successfully advanced and deployed the progression of federated search technology, that technology is new and remains immature.

The Near Future

OSTI has made considerable progress. For example, WorldWideScience, which was conceived, developed and deployed by OSTI, makes findable about the same quantity of science as does Google. It differs from Google in that the content of WorldWideScience has been deemed worthy of publishing by a national government, and much of that content is inherently non-Googleable. Such progress would not be possible were it not for federated search.

Progress has been so rapid that it is not feasible to make reliable predictions beyond the near term. One opportunity in the near term is for private sector organizations to take advantage of government science federations and integrate them together with proprietary content. In this way, a vision for truly enormous science collections, i.e. a billion pages, might become real.