Details

Description

I think we should extend the query language capabilities of Terrier. A specific operator that should be added is the "synonym" operator, which allows the user to state that a given set of terms should be treated as synonyms. This is important for some languages other than English (e.g. Arabic where there might be several transliterated words for a given term). Supporting wildcards might also be useful for some NLP applications (e.g. identification of relations between entities) Another area of improvement is the proximity operator, which should allow the user to have several variant operators in relation to the order of the terms and the distance constraint.

Activity

Mega patch committed. This adds the PostingListManager to handle the creation of IterablePostings (including for synoynm groups of terms), changes the Matching classes to use the PostingListManager for creating IterablePostings and scoring. ORIterablePosting and friends handle the combination of postings. Old Matching classes are retained as "FullNoPLM".

Craig Macdonald
added a comment - 30/Mar/11 11:09 PM Mega patch committed. This adds the PostingListManager to handle the creation of IterablePostings (including for synoynm groups of terms), changes the Matching classes to use the PostingListManager for creating IterablePostings and scoring. ORIterablePosting and friends handle the combination of postings. Old Matching classes are retained as "FullNoPLM".

Gianni, If we wish to do nested boolean expressions as queries, then surely we need to multiple matchings for every query, and then combine the results?

The present solution (which is mostly unary expressions) uses multiple TermScoreModifiers and DocumentScoreModifiers, often with second passes of the inverted file to ensure that the correct expressions are matched. For nested operators, that strategy would not be possible.

Craig Macdonald
added a comment - 17/Feb/09 2:01 PM Gianni, If we wish to do nested boolean expressions as queries, then surely we need to multiple matchings for every query, and then combine the results?
The present solution (which is mostly unary expressions) uses multiple TermScoreModifiers and DocumentScoreModifiers, often with second passes of the inverted file to ensure that the correct expressions are matched. For nested operators, that strategy would not be possible.

It would be necessary to perform also efficient boolean retrieval with nested boolean formulas, when one wants to activate it. I am also thinking to Google retrieval ( + terms) or to the legal track where queries are complex boolean queries.

Gianni Amati
added a comment - 17/Feb/09 1:38 PM It would be necessary to perform also efficient boolean retrieval with nested boolean formulas, when one wants to activate it. I am also thinking to Google retrieval ( + terms) or to the legal track where queries are complex boolean queries.

Ok, let's use this issue to discuss all of the proposed operators, but implementations are likely to come in other separate issues. A discussion should encapsulate the proposed syntax of the operators, and the semantics they encapsulate.

Firstly, it's probably worth reiterating the existing query constructs. The lack of amiguety here is caused by the use of best match semantics in combination with constructs which suggest filtering of some form.

syntax

semantics

scoring terms

a

retrieve documents containing a

a

a b

retrieve documents containing a and/or b

a b

+a b c

retrieve documents containing a and possibly containing b and/or c

a b c

-a b c

retrieve documents containing b and/or c, but no a

b c

f1:a

retrieve documents containing a in field f1

a

f1:a b

retrieve documents containing a in field f1, and possibly containing b

a b

-f1:a b

retrieve documents containing b, but where a does not occur in field f1

b

"a b" c

retrieve documents containg a and b as an adjacent phase, which may or may not contain c

a b c

f1:"a b" c

retrieve documents containg a and b as an adjacent phase within field f1, in a document which may or may not contain c

a b c

"a b"~10

retrieve documents which contain a and b within 10 tokens of each other

a b

c -"a b"

retrieve documents which contain c, and which do not contain a or b as an adjacent phase

c

c -(a b)

retrieve documents which contain c, but do not contain a or b

c

c -f1:(a b)

retrieve documents which contain c, but which do not contain a or b in field f1

c

There is also the ^ (hat) operator for controlling the weights on an individual term.

Craig Macdonald
added a comment - 16/Feb/09 11:27 PM Ok, let's use this issue to discuss all of the proposed operators, but implementations are likely to come in other separate issues. A discussion should encapsulate the proposed syntax of the operators, and the semantics they encapsulate.
Firstly, it's probably worth reiterating the existing query constructs. The lack of amiguety here is caused by the use of best match semantics in combination with constructs which suggest filtering of some form.
syntax
semantics
scoring terms
a
retrieve documents containing a
a
a b
retrieve documents containing a and/or b
a b
+a b c
retrieve documents containing a and possibly containing b and/or c
a b c
-a b c
retrieve documents containing b and/or c, but no a
b c
f1:a
retrieve documents containing a in field f1
a
f1:a b
retrieve documents containing a in field f1, and possibly containing b
a b
-f1:a b
retrieve documents containing b, but where a does not occur in field f1
b
"a b" c
retrieve documents containg a and b as an adjacent phase, which may or may not contain c
a b c
f1:"a b" c
retrieve documents containg a and b as an adjacent phase within field f1, in a document which may or may not contain c
a b c
"a b"~10
retrieve documents which contain a and b within 10 tokens of each other
a b
c -"a b"
retrieve documents which contain c, and which do not contain a or b as an adjacent phase
c
c -(a b)
retrieve documents which contain c, but do not contain a or b
c
c -f1:(a b)
retrieve documents which contain c, but which do not contain a or b in field f1
c
There is also the ^ (hat) operator for controlling the weights on an individual term.