We now combine the definitions of term frequency and inverse document frequency, to produce a composite weight for each term in each document. The tf-idf weighting scheme assigns to term a weight in document given by

(22)

In other words,
assigns to term a weight in document that is

highest when occurs many times within a small number of documents (thus lending high discriminating power to those documents);

lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);

lowest when the term occurs in virtually all documents.

At this point, we may view each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by (22). For dictionary terms that do not occur in a document, this weight is zero. This vector form will prove to be crucial to scoring and ranking; we will develop these ideas in Section 6.3 . As a first step, we introduce the overlap score measure: the score of a document is the sum, over all query terms, of the number of times each of the query terms occurs in . We can refine this idea so that we add up not the number of occurrences of each query term in , but instead the tf-idf weight of each term in .