Details

Description

One of the reasons that the direct files are so large for our modern corpora is that the termids are poorly assigned. In particular, for classical indexing, they are assigned by order of observation. This is fine, as more frequent terms are more likely to be met earlier in the corpus. On the other hand, for single-pass inverted indices (which can then be re-inverted), the termids are assigned increasing lexigraphically. This results in inferior compression for the direct file.

In this issue, we reassign termids before the inverted2direct processes take place.