Description

Query normalization is supposed to scale all scores uniformly by a simple
multiplier, but the child nodes in complex queries are presently getting
"extra" normalization applied to them. This has the effect of scaling
different subqueries by different amounts, changing the balance of the
subqueries within a complex query, interfering with IDF weighting and subtly
degrading relevancy.

Activity

The supplied "normalize.patch" file adds a boolean "subordinate" parameter to
Query_Make_Compiler() which defaults to false and indicates whether a Query is
a child of another. If "subordinate" is indeed false (because the Query is a
top-most node), then Query_Make_Compiler() implementations are now supposed to
invoke Compiler_Normalize() on the newly created Compiler object.

Giving Make_Compiler() an extra parameter and responsibility for invoking
Normalize() is technically an API change, but it should have little impact on
code in the wild. The only query types with ranking affected by this bug are
those with child nodes, such as ANDQuery, ORQuery and RequiredOptionalQuery,
but I'm unaware of anyone who has subclassed those. Single-node Query
subclasses (e.g. LucyX::Search::WildCardQuery) will need to have their
normalize() calls moved from their Compiler constructors to make_compiler(),
but IDF for a WildCardQuery is an imprecise notion to begin with.

This patch is also safe for any existing Searcher subclasses or other classes
which currently invoke Make_Compiler() – the default value of "false" for
"subordinate" is correct for such situations.

The patch applies cleanly against both trunk and the 0.2 branch; I plan to
commit it against both in a day or so.

Marvin Humphrey
added a comment - 11/Oct/11 23:14 The supplied "normalize.patch" file adds a boolean "subordinate" parameter to
Query_Make_Compiler() which defaults to false and indicates whether a Query is
a child of another. If "subordinate" is indeed false (because the Query is a
top-most node), then Query_Make_Compiler() implementations are now supposed to
invoke Compiler_Normalize() on the newly created Compiler object.
Giving Make_Compiler() an extra parameter and responsibility for invoking
Normalize() is technically an API change, but it should have little impact on
code in the wild. The only query types with ranking affected by this bug are
those with child nodes, such as ANDQuery, ORQuery and RequiredOptionalQuery,
but I'm unaware of anyone who has subclassed those. Single-node Query
subclasses (e.g. LucyX::Search::WildCardQuery) will need to have their
normalize() calls moved from their Compiler constructors to make_compiler(),
but IDF for a WildCardQuery is an imprecise notion to begin with.
This patch is also safe for any existing Searcher subclasses or other classes
which currently invoke Make_Compiler() – the default value of "false" for
"subordinate" is correct for such situations.
The patch applies cleanly against both trunk and the 0.2 branch; I plan to
commit it against both in a day or so.

That snippet shows a problem with highlighting, which is almost certainly
related to LUCY-182 rather than this issue.

The test failure in question also includes score differences, which
are indeed likely a result of the changes made in this issue. However,
scores should be more accurate now, so there's shouldn't be a need
to reopen. The downstream test failures are regrettable, but the exact
scores generated by Lucy are not part of our public API.

Marvin Humphrey
added a comment - 08/Nov/11 17:38 That snippet shows a problem with highlighting, which is almost certainly
related to LUCY-182 rather than this issue.
The test failure in question also includes score differences, which
are indeed likely a result of the changes made in this issue. However,
scores should be more accurate now, so there's shouldn't be a need
to reopen. The downstream test failures are regrettable, but the exact
scores generated by Lucy are not part of our public API.
In contrast, it looks like we're going to have to reopen LUCY-182 .