To understand the new statistics, let's pretend we've indexed the
following two example documents, each with only one field "title":

document 1: The Lion, the Witch, and the Wardrobe

document 2: The Da Vinci Code

Assume we tokenize on whitespace, commas are removed, all terms are
downcased and we don't discard stop-words. Here are the statistics
Lucene tracks:

TermsEnum.docFreq()

How many documents contain at least one
occurrence of the term in the field; 3.x
indices also save this (TermEnum.docFreq()). For term
"lion" docFreq is 1, and for term "the" it's 2.

Terms.getSumDocFreq()

Number of postings, i.e. sum
of TermsEnum.docFreq() across all terms in the field.
For our example documents this is 9.

TermsEnum.totalTermFreq()

Number of occurrences of this term
in the field, across all documents. For term "the" it's 4, for term
"vinci" it's 1.

Terms.getSumTotalTermFreq()

Number of term occurrences
in the field, across all documents; this is the sum
of TermsEnum.totalTermFreq() across all unique terms in the
field. For our example documents this is 11.

Terms.getDocCount()

How many documents have at least one
term for this field. In our example documents, this is 2, but if for example one
of the documents was missing the title field, it would be 1.

Terms.getUniqueTermCount()

How many unique terms were seen
in this field. For our example documents this is 8. Note that this
statistic is of limited utility for scoring, because it's only
available per-segment and you cannot (efficiently!) compute this
across all segments in the index (unless there is only one segment).

Fields.getUniqueTermCount()

Number of unique terms
across all fields; this is the sum
of Terms.getUniqueTermCount() across all fields. In
our example documents this is 8. Note that this is also only
available per-segment.

Fields.getUniqueFieldCount()

Number of unique
fields. For our example documents this is 1; if we also had a body field
and an abstract field, it would be 3. Note that this is also only
available per-segment.

3.x indices only store TermsEnum.docFreq(), so if you want
to experiment with the new scoring models in Lucene 4.0, you should
either re-index or upgrade your index
using IndexUpgrader. Note that the new scoring models all
use the same single-byte norms format, so you can freely switch between them
without re-indexing.

In addition to what's stored in the index, there are also these
statistics available per-field, per-document while indexing, in the FieldInvertState passed to Similarity.computeNorm method for
both 3.x and 4.0:

length

How many tokens in the document. For document 1 it's 7; for document 2 it's 4.

uniqueTermCount

For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.

maxTermFrequency

What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.

In 3.x, if you want to consume these indexing-time statistics, you'll
have to save them away yourself (e.g., somehow encoding them into the
single-byte norm value). However, since 4.0
uses doc
values for norms, you have more freedom to encode these statistics
however you'd like. Your custom similarity can then pull from these.

From these available statistics you're now free to derive other commonly used statistics:

Average field length across all documents is Terms.getSumTotalTermFreq() divided by maxDoc
(or Terms.getDocCount(), if not all documents have the field).

Average within-document field term frequency is FieldInvertState.length divided by FieldInvertState.uniqueTermCount.

Average number of unique terms per field across all documents is Terms.getSumDocFreq() divided by maxDoc
(or Terms.getDocCount(field), if not all documents have the field).

Remember that the statistics do not reflect deleted documents, until
those documents are merged away; in general this also means that
segment merging will alter scores! Similarly, if the field omits term
frequencies, then the statistics will not be correct (though they will
still be consistent with one another: we will pretend each
term occurred once per document).

You should pull a DocsEnum for that field + term, advance to the docID and call .freq() to get that count. This is also available in term vectors (if you want to get the count for all terms within a single doc).

HiI'm trying to figure out in Lucene 4 the document length (total number of keywords it contains) from the norm vector at query run-time. For some (unclear) reason the encode method of the norm values is not public.Any way of doing it?

Which similarity are you using? Just the default (DefaultSimilarity)? If so, then you can get the docLength from the norm by calling DefaultSimilarity.decodeNormValue. Note that if you boosted the field during indexing it will change this document length (higher boost makes doc length smaller). Also note that this is heavily quantized: we use only a single byte to encode the length by default.

If the quantization or mixin of boost is a problem ou can also make your own Similarity impl and store docLen yourself (eg as an int).

I'm experimenting with BM25 and the new Language models based similarities in Lucene4. It turns out that the method decodeNormValue() is public for TFIDFSimilarity while protected for the rest. Of course I can expand these similarities classes to publicize this method, so my question is whether there is a special reason why it should not be publicized by default.

In general, I really like the new flexibility with the similarity measures provided by Lucene 4. I would however wish to have a more simple and intuitive API to extract (exact) doc-length from the index. I think these values should provided by the API and should not be handled by the application (just a wishful thinking)

I think it's actually an artifact from back when TFIDFSim was the onlySimilarity in Lucene, that its decodeNormValue is stillpublic... really how the sim stores its stats (and what stats itstores) are private implementation details to it.

If you really want the int per field X doc ... you could cloneBM25Sim, but then instead of norm.setByte(...) in computeNorm, usenorm.setInt (and then remove all the encode/decodeNormValue stuff).Then in the Exact/Sloppy scorers, pull the int[] docLengths instead ofbyte[] norms. The downside of course is 4X the RAM per field X doc.

You need to first get a TermsEnum (Terms.iterator(null)), seek to that term (termsEnum.seekExact: verify you got back "true" indicating that the term exists), then use its .advance method to skip to that docID (verify the returned docID is the one you advanced to), then call freq().

Hi Mike,I'd like to know if there is any Analyzer that (at indexing time) doesn't tokenize on white spaces, but allows me to manage multi word phrases as single terms so that the API described in this post will keep woking as expected. In my use case I've a set of phrases that I need TermsEnum.docFreq(). Is something possible? In the case there isn't anything like that, would be trivial to implement a custom analyzer? Many Thanks

Subscribe To

About Me

Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.