The vector space model
procedure can be divided in to three stages. The first stage is the document
indexing where content bearing terms are extracted from the document text.
The second stage is the weighting of the indexed terms to enhance retrieval
of document relevant to the user. The last stage ranks the document with
respect to the query according to a similarity measure.

The vector space model has been criticized
for being ad hoc. For a more theoretical analysis of the vector space model
see [28].

Document Indexing

It is obvious that many of the words in a
document do not describe the content, words like the, is. By using
automatic document indexing those non significant words (function words)
are removed from the document vector, so the document will only be represented
by content bearing words [4]. This indexing
can be based on term frequency, where terms that have both high and low
frequency within a document are considered to be function words [18,4,11].
In practice, term frequency has been difficult to implement in automatic
indexing. Instead the use of a stop list which holds common words to remove
high frequency words (stop words) [11,4],
which makes the indexing method language dependent. In general, 40-50%
of the total number of words in a document are removed with the help of
a stop list [4].

Non linguistic methods for indexing have
also been implemented. Probabilistic indexing is based on the assumption
that there is some statistical difference in the distribution of content
bearing words, and function words [11].
Probabilistic indexing ranks the terms in the collection w.r.t. the term
frequency in the whole collection. The function words are modeled by a
Poisson distribution over all documents, as content bearing terms cannot
be modeled. The use of Poisson model has been expand to Bernoulli model
[24]. Recently, an automatic indexing
method which uses serial clustering of words in text has been introduced
[25]. The value of such clustering
is an indicator if the word is content bearing.

Term Weighting

Term weighting has been explained by controlling
the exhaustivity and specificity of the search, where the exhaustivity
is related to recall and specificity to precision [11].
The term weighting for the vector space model has entirely been based on
single term statistics. There are three main factors term weighting: term
frequency factor, collection frequency factor and length normalization
factor. These three factor are multiplied together to make the resulting
term weight.

A common weighting scheme for terms within
a document is to use the frequency of occurrence as stated by Luhn [18],
mentioned in the previous section. The term frequency is somewhat content
descriptive for the documents and is generally used as the basis of a weighted
document vector [17]. It is also possible
to use binary document vector, but the results have not been as good compared
to term frequency when using the vector space model [17].

There are used various weighting schemes
to discriminate one document from the other.In general this factor is called
collection frequency document. Most of them, e.g. the inverse document
frequency, assume that the importance of a term is proportional with the
number of document the term appears in [4].
Experimentally it has been shown that these document discrimination factors
lead to a more effective retrieval, i.e., an improvement in precision and
recall [17].

The third possible weighting factor is
a document length normalization factor. Long documents have usually a much
larger term set than short documents, which makes long documents more likely
to be retrieved than short documents [17].

Different weight schemes have been investigated
and the best results, w.r.t. recall and precision, are obtained by using
term frequency with inverse document frequency and length normalization
[17,26].

Similarity Coefficients

The similarity in vector space models is determined
by using associative coefficients based on the inner product of the document
vector and query vector, where word overlap indicates similarity. The inner
product is usually normalized. The most popular similarity measure is the
cosine coefficient, which measures the angle between the a document vector
and the query vector. Other measures are e.g., Jaccard and Dice coefficients
[27].