Information retrieval

Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources, and the part of information science, which studies of these activity. Searches can be based on metadata or on full-text (or other content-based) indexing.

Contents

Calvin Mooers was a participant in early developmental work on digital computers, a researcher, author, and implementer of applications in information retrieval; and a prophet in the 1950s describing the future importance of what is now called computer networks and distributive processing, and daring to predict that machines could simulate thought processes in retrieving computerized information. In 1947, he proposed the Zator, an electronic, film-scanning retrieval machine, and made the first proposal to use the Boolean operations or, and, and not to prescribe selections in retrieval machines. He developed his own Zatocoding System in 1948 using superimposed subject codes on edge-notched cards. He coined the term "Information Retrieval" in 1950, and went on from there to obtain several patents in information retrieval and signaling, produce a text-handling language (TRAC), author some 200 publications, and form one of the first companies whose only concern was information. His thinking has affected all who are in the field of Information and his early ideas are now incorporated into today's reality.

The index of a search engine can be thought of as analogous to the stars in [the] sky. What we see has never existed, as the light has traveled different distances to reach our eye. Similarly, Web pages referenced in an index were also explored at different dates and they may not exist any more.

The WWW project merges the techniques of information retrieval and hypertext to make an easy but powerful global information system. The project started with the philosophy that much academic information should be freely available to anyone. It aims to allow information sharing within internationally dispersed teams, and the dissemination of information by support groups.

Tim Berners-Lee (1991), Usenet article <6487@cernvax.cern.ch> (the first public announcement of CERN's "WorldWideWeb" system, made this day in 1991).

Information retrieval consists of four main stages: Identifying the exact subject of the search; Locating this subject in a guide which refers the searcher to one or more documents; Locating the documents; Locating the required information in the documents.

Whether computers are used for engineering design, medical data processing, composing music, or other purposes, the structure of computing is much the same. We are extremely short of talented people in this field, and so we need departments, curricula, and research and degree programs in computer science... I think of the Computer Science Department as eventually including experts in Programming, Numerical Analysis, Automata Theory, Data Processing, Business Games, Adaptive Systems, Information Theory, Information Retrieval, Recursive Function Theory, Computer Linguistics, etc., as these fields emerge in structure... Universities must respond [to the computer revolution] with far reaching changes in the educational structure.

'Information management' is a term that is preferred to 'information retrieval' by System Development Corporation. Information management is defined as the establishment and utilization of effective procedures for controlling the generation, processing, flow, and use of information.

Government Reports Announcements (1966) Vol 41, Nr 9-12. p. 11.

Brian Campbell Vickery was an enormously influential figure in the field of classification and information retrieval, a powerful force in the development of faceted classification and retrieval theory, and a prolific writer and researcher throughout his life.

It can be useful to distinguish between knowledge and information and data; it is also difficult and contentious. Four points should be made. [First] Knowledge, information and data is what the systems to be discussed are for: by storing it in an organized manner, they are intended to enable it to be found when needed. Secondly, there is a spectrum of increased size and organization between data, where the units are quite small, through to knowledge, where the units are large and distinguished by their complex internal structure and relationships, and overlap with other units... Meunier (1987) presents a typology of levels of representation which is useful for the breath of its approach and its classification of relationships. Thirdly, "information" in the expression "information retrieval" is generally abused, because what is retrieved is not information, but bibliographic details of sources in which desired information potentially exists. Very many information retrieval systems are at best document retrieval systems, and more usually they are systems which retrieve surrogates for documents... Finally, although the expression knowledge retrieval is particularly associated with artificial intelligence and expert systems, it should not be forgotten that this is what cataloguers, indexers and bibliographers have been doing, and devising systems for, for many years.

Biological classifications have two major objectives: to serve as a basis of biological generalizations in all sort of comparative studies and to serve as a key to an information storage system... Is the classification that is soundest as a basis of generalizations also most convenient for information retrieval? This, indeed, seems to have been true in most cases I have encountered.

The problem of directing a user to stored information, some of which may be unknown to him, is the problem of "information retrieval"… In information retrieval, the addressee or receiver rather than the sender is the active party. Other differences are that communication is temporal from one epoch to a later epoch in time, though possibly at the same point in space; communication is in all cases unidirectional; the sender cannot know the particular message that will be of later use to the receiver and must send all possible messages; the message is digitally representable; a "channel" is the physical document left in storage which contains the message; and there is no channel noise because all messages are presumed to be completely accessible to the receiver. The technical goal is finding in minimum time those messages of interest to the receiver, where the receiver has available a selective device with a finite digital scanning rate.

An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have it.
Where an information retrieval system tends not to be used, a more capable information retrieval system may tend to be used even less.

The task of keeping up with scientific literature is becoming an impossible one and is in turn leading to inefficiency and to a certain amount of frustration in scientific research and in the application of science.

Vickery commented: "The crux of the retrieval problem is that selecting documents to read grows ever more difficult, and new techniques are continually needed."

Information retrieval is a wide, often loosely-defined term but in these pages I shall be concerned only with automatic information retrieval systems. Automatic as opposed to manual and information as opposed to data or fact. Unfortunately the word information can be very misleading. In the context of information retrieval (IR), information, in the technical meaning given in Shannon's theory of communication, is not readily measured (Shannon & Weaver). In fact in many cases, one can adequately describe the kind of retrieval by simply substituting "document" for "information". Nevertheless, "information retrieval" has become accepted as a description of the kind of work published by Cleverdon, Salton, Spark Jones, Lancaster and others. A perfectly straightforward definition along this line is given by Lancaster 2: "Information retrieval is the term conventionally, though somewhat inaccurately, applied to the type of activity discussed in this volume. An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request". This specifically excludes Question-Answering systems as typified by Winograd 3 and those described by Minsky 4. It also excludes data retrieval systems such as used by, say, the stock exchange for on-line quotations.

An information retrieval system is therefore defined here as any device which aids access to documents specified by subject, and the operations associated with it. The documents can be books, journals, reports, atlases, or other records of thought, or any parts of such records—articles, chapters, sections, tables, diagrams, or even particular words. The retrieval devices can range from a bare list of contents to a large digital computer and its accessories. The operations can range from simple visual scanning to the most detailed programming.

Four years ago, when the first edition of this book was written, information retrieval was beginning to crystallize out as a unified discipline. The process has gone further today. Several other books... have also offered a general survey, although each has contributed its own special emphasis. Many conferences on the subject have been held, and a constant stream of new articles has appeared, both in documentation journals and in those in the data processing field.
Information retrieval is now recognized as a discipline, and further advances in theory are being made, What I described in the first edition as the key operation in retrieval — the subject description of documents — is being explored theoretically and experimentally, although we are still a long way from reducing this operation to rule (Chapter 3). There has been less new work on the design of descriptor languages, although ideas on the display of descriptor relations through thesauri and 'semantic maps' have been developed (Chapter 4). Access to files has been examined, particularly by those experienced in data processing.

Information retrieval is now an accepted part of the new discipline of information science and technology... I have concentrated on the field with which I am most familiar, the problems of bibliographic description and subject analysis.

In 1958 the classification ideas in it were felt to controversial, needing to be championed. A few years before, the {[w|Classification Research Group}} had issued a memorandum proclaiming "the need for a faceted classification as the basis of all methods of information retrieval'. As part-author of this memorandum, I must now judge the claim to have been too bold, even brash.

What surprised me, which Google was part of, is that superficial search techniques over large bodies of stuff could get you what you wanted. I grew up in the AI tradition, where you have a complete conceptual model, and the information retrieval tradition, where you have complex vectors of key terms and Boolean queries. The idea that you can index billions of pages and look for a word and get what you want is quite a trick. To put it in more abstract terms, it's the power of using simple techniques over very large numbers versus doing carefully constructed systematic analysis.