Jan Høydahl
added a comment - 19/Apr/12 00:48 The query is parsed into:
+(id:name:test^10.0 | text:name:test^0.5 | cat:name:test^1.4 | manu:name:test^1.1 | name:name:test^1.2 | features:name:test | sku:nametest^1.5)
suggesting that eDismax falls back to literal matching because the field name (name is not found. Perhaps introduced by SOLR-3026 ?
A temporary workaround seems to be to insert a space before all field names:
q= ( name:test )

A suggestion, how would it be to do a simple "cleanup" as a first step of query string handling by removing the outermost parantheses (if there are any)?
This won't touch the inner edismax logic and produce another bad sideeffect.

Bernd Fehling
added a comment - 19/Apr/12 08:25 A suggestion, how would it be to do a simple "cleanup" as a first step of query string handling by removing the outermost parantheses (if there are any)?
This won't touch the inner edismax logic and produce another bad sideeffect.

Bernd Fehling
added a comment - 27/Apr/12 14:46 You're right.
New patch is doing the job now but I'm not very happy with it.
The patch should be a more general solution but this is currently a specific solution.
At least it does not change the original query as a replacement from "(" to "( + space" would do.

Bernd Fehling
added a comment - 30/Apr/12 09:29 Now this is a general patch which passes all tests.
It can handle one or more parentheses directly before a valid field name.
Please test and give feedback.

Bernd Fehling
added a comment - 10/May/12 13:02 If there are no further complains about the patch can it then be committed to 3.6.1 / trunk and the issue closed?
Who is doing the commit / has the permissions to commit?

I think we need better test coverage before this is ready.
We should add a bunch of tests with queries involving parens, to verify that they behave as expected. Both tests involving parens as intended grouping for boolean precedence as well as parens not intended as boolean sugar but as plain text pasted from somewhere:

q=(foo OR title:bar) AND (title:foo OR bar)
q=Meeting at noon (room:Auditorium)

The first should obey the instructed boolean order, while the last should return docs with the literal token "room:Autirium" in any of the qf fields.

The key goal of dismax is to be very robust so people can paste in all kind of garbage, and get matches. So if the query parses as valid boolean logic, that should be used.

Jan Høydahl
added a comment - 10/May/12 13:55 I think we need better test coverage before this is ready.
We should add a bunch of tests with queries involving parens, to verify that they behave as expected. Both tests involving parens as intended grouping for boolean precedence as well as parens not intended as boolean sugar but as plain text pasted from somewhere:
q=(foo OR title:bar) AND (title:foo OR bar)
q=Meeting at noon (room:Auditorium)
The first should obey the instructed boolean order, while the last should return docs with the literal token "room:Autirium" in any of the qf fields.
The key goal of dismax is to be very robust so people can paste in all kind of garbage, and get matches. So if the query parses as valid boolean logic, that should be used.

Shoot me an enhanced unit test which covers your requests and i will look into this.

But, while looking through all the test cases I think we really need a clear definition of rules and define a BNF syntax description for edismax and then implement the BNF syntax. This has two advantages, the user knows how to construct a valid query and we can clean up the patch work inside edismax. This can also obey fallback mechanism to always return a valid query.
How about that?

Bernd Fehling
added a comment - 11/May/12 08:22 Shoot me an enhanced unit test which covers your requests and i will look into this.
But, while looking through all the test cases I think we really need a clear definition of rules and define a BNF syntax description for edismax and then implement the BNF syntax. This has two advantages, the user knows how to construct a valid query and we can clean up the patch work inside edismax. This can also obey fallback mechanism to always return a valid query.
How about that?

Bernd Fehling
added a comment - 05/Jul/12 07:09 - edited I was willing to supply a final fix to this and was hoping that it will make it to release 4.x.
But unfortunately:
I got no enhanced unit test
noone comitted this/my patch either
the problem is still there
So I said "was willing", thats true, I gave up on this and thinking now about switching to ElasticSearch because they really appreciate any help.

I have followed it through this far but have not yet had the chance to go the last mile until committing, but I am definitely keen to pick it up again after summer holidays and parental leave, hopefully before. The reason I unassigned myself is to signal to the other committers that I'm not actively working on this and let others step in if they wish.

This is the way Apache works - we are all volunteers, and I am sure that with some patience this will make it through in time for 4.0 final. You've done a great job so far with the patch. It may be "final" and good to go, but personally I'd write some more tests since this particular area has been lacking - before committing.

Jan Høydahl
added a comment - 05/Jul/12 10:05 Bernd, I agree that this is a bug that absolutely should be fixed.
I have followed it through this far but have not yet had the chance to go the last mile until committing, but I am definitely keen to pick it up again after summer holidays and parental leave, hopefully before. The reason I unassigned myself is to signal to the other committers that I'm not actively working on this and let others step in if they wish.
This is the way Apache works - we are all volunteers, and I am sure that with some patience this will make it through in time for 4.0 final. You've done a great job so far with the patch. It may be "final" and good to go, but personally I'd write some more tests since this particular area has been lacking - before committing.

I took a quick look at the current edismax code.
The major flaw seems to be the attempt to check user fields in splitIntoClauses(). That method was never meant to be an exact replication of Lucene query parsing. Is there a reason we aren't checking user fields in the parser itself (the ExtendedSolrQueryParser.getFieldQuery)?

edit: thinking a little more about it, I guess one reason is so whitespace or other potentially significant syntax isn't discarded?

Yonik Seeley
added a comment - 11/Jul/12 15:12 - edited I took a quick look at the current edismax code.
The major flaw seems to be the attempt to check user fields in splitIntoClauses(). That method was never meant to be an exact replication of Lucene query parsing. Is there a reason we aren't checking user fields in the parser itself (the ExtendedSolrQueryParser.getFieldQuery)?
edit: thinking a little more about it, I guess one reason is so whitespace or other potentially significant syntax isn't discarded?

Thanks Bernd, this looks like an improvement.
After some ad-hoc testing, it seems we still have problems with q=(+id:42)

Another minor concern: the change to clause.field to exclude things like '(' also means that when it's not a valid lucene query, our reconstructed query will currently drop the paren.

Example: A query of (a:b with a qf=id correctly produces id:"(a:b"
but a query of (id:b produces id:b
That type of thing should really only affect exact match type fields where punctuation isn't dropped - not sure how much of an issue it really is.

Yonik Seeley
added a comment - 12/Jul/12 17:36 Thanks Bernd, this looks like an improvement.
After some ad-hoc testing, it seems we still have problems with q=(+id:42)
Another minor concern: the change to clause.field to exclude things like '(' also means that when it's not a valid lucene query, our reconstructed query will currently drop the paren.
Example: A query of (a:b with a qf=id correctly produces id:"(a:b"
but a query of (id:b produces id:b
That type of thing should really only affect exact match type fields where punctuation isn't dropped - not sure how much of an issue it really is.

Yonik Seeley
added a comment - 13/Jul/12 09:38 I fixed the issue with +,- and committed to trunk and 4x
http://svn.apache.org/viewvc?view=revision&revision=1361091
note: I'll start adding the commit URL since JIRA is not currently linking up the commit automatically.

Checking CHANGES.TXT and the revision history, I don't think this fix got backported to 3.6.1 when it was backported to 4.x. I would lobby for it to be included in 3.6.2 since it is such a serious problem.

Jack Krupansky
added a comment - 14/Aug/12 01:34 Checking CHANGES.TXT and the revision history, I don't think this fix got backported to 3.6.1 when it was backported to 4.x. I would lobby for it to be included in 3.6.2 since it is such a serious problem.

Shawn Heisey
added a comment - 08/Oct/12 18:40 I have the relevant pieces of schema.xml ready to include, but I will wait until I know whether this is the appropriate issue, or a new issue should be filed.

Leonhard, your use case seems rather different from that of this Jira.

I presume that you are referring to the generated phrase query boost being a little odd, or maybe that the phrase boost should not occur when the terms are queried against fields not listed in the "pf" parameter. Feel free to raise that as a separate issue.

You refer to "splitting", but I don't see any term splitting in this example.

Jack Krupansky
added a comment - 28/Nov/12 14:00 Leonhard, your use case seems rather different from that of this Jira.
I presume that you are referring to the generated phrase query boost being a little odd, or maybe that the phrase boost should not occur when the terms are queried against fields not listed in the "pf" parameter. Feel free to raise that as a separate issue.
You refer to "splitting", but I don't see any term splitting in this example.

Ok, I understand.
The phrase boost queries are separated from the normal query expansion via the qf paramter.

But, all terms are (equally) qualified by a field (field sw for the terms a and b, field ti for the terms c and d).
Why do the eDismax handler only use the terms b and d to build the phrase boost query?
Isn't it a bug?

Leonhard Maylein
added a comment - 28/Nov/12 15:35 Ok, I understand.
The phrase boost queries are separated from the normal query expansion via the qf paramter.
But, all terms are (equally) qualified by a field (field sw for the terms a and b, field ti for the terms c and d).
Why do the eDismax handler only use the terms b and d to build the phrase boost query?
Isn't it a bug?

Jack Krupansky
added a comment - 28/Nov/12 16:29 Yes, it looks like a bug, but distinct from this current Jira. Actually, two bugs:
1. Fielded terms should not be used in phrase boost except for the specified field.
2. Some terms appear to have been skipped for phrase boost.