Statistical properties of terms in information retrieval

As in the last chapter, we use Reuters-RCV1 as our
model collection (see Table 4.2 ,
page 4.2 ). We give some term and postings
statistics for the collection in Table 5.1 .
``'' indicates the reduction in size from the
previous line.
``T%'' is the cumulative reduction from unfiltered.

The table shows the
number of terms for different levels of preprocessing
(column 2). The number of terms is
the main factor in determining the
size of the dictionary. The number of nonpositional
postings (column 3) is an indicator of the expected size of the
nonpositional index of the collection. The expected size of
a positional index is related to the number of positions it
must encode (column 4).

In general, the
statistics in Table 5.1 show that preprocessing affects the size of the
dictionary and the number of nonpositional postings greatly. Stemming and
case folding reduce the number of (distinct) terms
by 17% each and the number of nonpositional
postings by 4% and 3%, respectively. The treatment of the
most frequent words is also important.
The rule of 30 states that the 30 most common words account for 30% of
the tokens in written text (31% in the table). Eliminating
the 150 most common words from indexing (as stop words;
cf. Section 2.2.2 , page 2.2.2 ) cuts
25% to 30% of the nonpositional postings. But, although a stop
list of 150 words reduces the number of postings by a
quarter or more, this size reduction does not carry over to the size
of the compressed index. As we will see later in this
chapter, the postings lists of frequent words require only a
few bits per posting after compression.

Table 5.1:
The effect of preprocessing on
the number of terms,
nonpositional postings, and tokens for Reuters-RCV1.
``'' indicates the reduction in size from the
previous line, except that ``30 stop words'' and ``150 stop
words'' both use ``case folding'' as their reference
line. ``T%'' is the cumulative (``total'') reduction from unfiltered.
We performed stemming with the Porter stemmer
(Chapter 2 , page 2.2.4 ).

tokens (number of position

(distinct) terms

nonpositional postings

entries in postings)

number

T%

number

T%

number

T%

unfiltered

484,494

109,971,179

197,879,290

no numbers

473,723

2

2

100,680,242

8

8

179,158,204

9

9

case folding

391,523

17

19

96,969,056

3

12

179,158,204

0

9

30 stop words

391,493

0

19

83,390,443

14

24

121,857,825

31

38

150 stop words

391,373

0

19

67,001,847

30

39

94,516,599

47

52

stemming

322,383

17

33

63,812,300

4

42

94,516,599

0

52

The deltas in the table are in a range typical of large
collections. Note, however, that the percentage reductions
can be very different for some text collections. For
example, for a collection of web pages with a high
proportion of French text, a lemmatizer for French
reduces vocabulary size much more than the Porter stemmer
does for an English-only collection because French is a
morphologically richer language than English.

The compression techniques we describe in the remainder of
this chapter are lossless ,
that is, all information is preserved. Better compression
ratios can be achieved with lossy compression , which
discards some information. Case folding, stemming, and stop
word elimination are forms of lossy compression. Similarly,
the vector space model (Chapter 6 ) and dimensionality
reduction techniques like latent semantic indexing
(Chapter 18 ) create compact representations from which we
cannot fully restore the original collection. Lossy
compression makes sense when the ``lost'' information is
unlikely ever to be used by the search system. For example,
web search is characterized by a large number of documents,
short queries, and users who only look at the first few
pages of results. As a consequence, we can discard postings
of documents that would only be used for hits far down the
list. Thus, there are retrieval scenarios where lossy
methods can be used for compression without any reduction in effectiveness.

Before introducing techniques for compressing the
dictionary, we want to estimate the number of distinct terms
in a collection. It is sometimes said that languages have a
vocabulary of a certain size. The second edition of the
Oxford English Dictionary
(OED) defines more than 600,000 words. But the
vocabulary of most large collections is much larger than
the
OED.
The OED does not include most names of people, locations,
products, or scientific entities like genes. These names
need to be included in the inverted index, so our
users can search for them.