Self-Indexing Inverted Files for Fast Text Retrieval

Status

ACM Trans. Information Systems,
4(4):349-379, October 1996.

Abstract

Query processing costs on large text databases are dominated by the
need to retrieve and scan the inverted list of each query term.
Retrieval time for inverted lists can be greatly reduced by the use of
compression, but this adds to the CPU time required. Here we show that
the CPU component of query response time for conjunctive Boolean
queries and for informal ranked queries can be similarly reduced, at
little cost in terms of storage, by the inclusion of an internal index
in each compressed inverted list. This method has been applied in a
retrieval system for a collection of nearly two million short
documents. Our experimental results show that the self-indexing
strategy adds less than 20\% to the size of the compressed inverted
file, which itself occupies less than 10\% of the indexed text, yet can
reduce processing time for Boolean queries of 5--10 terms to under one
fifth of the previous cost. Similarly, ranked queries of 40--50 terms
can be evaluated in as little as 25\% of the previous time, with little
or no loss of retrieval effectiveness.