Description

QueryParser cannot handle escaped quotes when inside a phrase. Escaped quotes not in a phrase are not a problem. This can be added to TestQueryParser.testEscaped() to demonstrate the issue - the second assert throws an exception:

thanks for finding this problem in the queryparser. I attach a patch file that fixes this bug. Now the queryparser does not recognize an escaped quote inside a phrase as the phrase terminator anymore.

I also updated the testcase org.apache.lucene.queryParser.TestQueryParser to test escaped quotes within phrases. This testcase fails with the old version of the queryparser and runs successful with the patched version. I added the following three tests to the testEscaped() method:

Please notice that (3) is different from your second suggested assert. You assume that the queryparser unescapes the quotes inside the phrase, but the queryparser does not unescape any escaped characters inside a phrase. You can see that in (2), where the escaped + (plus) character does not become unescaped.

Michael Busch
added a comment - 07/Aug/06 23:21 Tomislav,
thanks for finding this problem in the queryparser. I attach a patch file that fixes this bug. Now the queryparser does not recognize an escaped quote inside a phrase as the phrase terminator anymore.
I also updated the testcase org.apache.lucene.queryParser.TestQueryParser to test escaped quotes within phrases. This testcase fails with the old version of the queryparser and runs successful with the patched version. I added the following three tests to the testEscaped() method:
(1) assertQueryEquals("a \\\"b c\\\" d", a, "a \"b c\" d");
(2) assertQueryEquals("\"a +b c d\"", a, "\"a +b c d\"");
(3) assertQueryEquals("\"a \\\"b c\\\" d\"", a, "\"a \\\"b c\\\" d\"");
Please notice that (3) is different from your second suggested assert. You assume that the queryparser unescapes the quotes inside the phrase, but the queryparser does not unescape any escaped characters inside a phrase. You can see that in (2), where the escaped + (plus) character does not become unescaped.
Michael

> You assume that the queryparser unescapes the quotes inside
> the phrase, but the queryparser does not unescape any
> escaped characters inside a phrase. You can see that in (2),
> where the escaped + (plus) character does not become
> unescaped.

...I'm not sure if this is a fair comaprison. Query parser may not unescape things like the + in this example....
foo "bar +baz"
...but i believe that is because there is no reason for it to do so, if what you intended was to match on the + character without a preceding s\ you could have just written...
foo "bar +baz"
...ie: the plus is already escaped by the nature of being inside the phrase bounding quotes

this same justification does not hold for the double-quote character however ... there is no way to inlcude a " character inside of a phrase without escaping, so it seems like query parser should be unescaping it automatically for you.

Hoss Man
added a comment - 20/Aug/06 20:03 > You assume that the queryparser unescapes the quotes inside
> the phrase, but the queryparser does not unescape any
> escaped characters inside a phrase. You can see that in (2),
> where the escaped + (plus) character does not become
> unescaped.
...I'm not sure if this is a fair comaprison. Query parser may not unescape things like the + in this example....
foo "bar +baz"
...but i believe that is because there is no reason for it to do so, if what you intended was to match on the + character without a preceding s\ you could have just written...
foo "bar +baz"
...ie: the plus is already escaped by the nature of being inside the phrase bounding quotes
this same justification does not hold for the double-quote character however ... there is no way to inlcude a " character inside of a phrase without escaping, so it seems like query parser should be unescaping it automatically for you.
correct?

> it seems like query parser should be unescaping it automatically for you

That's my take. \" should return the user "
To enable representation of a real backslash followed by a quote, or a backslash at the end of a line, backslashes also need escaping (with another backslash... same as C, Java, etc).

Now the question is if + should be returned as + or +
If we were starting from scratch, I'd say remove the backslash (return +) since that's most consistent with escaping mechanisms in other languages, and it does the right thing if a user escapes something they don't need to. So does anyone currently search for backslash in a phrase query???

Yonik Seeley
added a comment - 20/Aug/06 22:02 > it seems like query parser should be unescaping it automatically for you
That's my take. \" should return the user "
To enable representation of a real backslash followed by a quote, or a backslash at the end of a line, backslashes also need escaping (with another backslash... same as C, Java, etc).
Now the question is if + should be returned as + or +
If we were starting from scratch, I'd say remove the backslash (return +) since that's most consistent with escaping mechanisms in other languages, and it does the right thing if a user escapes something they don't need to. So does anyone currently search for backslash in a phrase query???

Michael Krkoska
added a comment - 18/Oct/06 10:47 I just experienced this bug in my search. The patch works for me, though I found it rather cumbersome to build lucene including javacc.
Why is the priority of this bug low? What do we have to do to include tha patch in the release?

You are right, my patch only prevents the ParseException to be thrown. However, it is still not possible to search for a phrase query that contains quotes.

I think the simple solution is to call discardEscapeChar(String) for the phrase string. Then all escaped characters within a phrase become unescaped.

Now the question is again (Yonik raised it already): Do we want this behavior or do we rather want to maintain backward compatibility for already existing phrase queries that include a backslash? If we prefer compatibility over consistency then one solution would be to add a new method discardEscapeChar(String, char[]) that only unescapes certain characters, in this case " and \.

Michael Busch
added a comment - 08/Nov/06 00:40 Sorry, it took me a while to take care of this patch.
You are right, my patch only prevents the ParseException to be thrown. However, it is still not possible to search for a phrase query that contains quotes.
I think the simple solution is to call discardEscapeChar(String) for the phrase string. Then all escaped characters within a phrase become unescaped.
Now the question is again (Yonik raised it already): Do we want this behavior or do we rather want to maintain backward compatibility for already existing phrase queries that include a backslash? If we prefer compatibility over consistency then one solution would be to add a new method discardEscapeChar(String, char[]) that only unescapes certain characters, in this case " and \.
If we can make a decision here I'm happy to provide a new patch.

Yonik Seeley
added a comment - 08/Nov/06 01:56 The number of people searching for '\' must be minuscule (to non-existent), so I'm personally OK with doing what makes the most sense.
In general, I think it's more user friendly and less surprising to allow a backslash escape of something that doesn't need it (while removing the backslash).
So
=> \
\" => "
+ => + // not a necessary escape if in a quoted string, but consistent
Then there are the "control" chars, \r\n\t, etc.
At some point we might want to allow unicode escapes, and keeping backslash as a general escape char will facilitate this.

Michael Busch
added a comment - 17/Nov/06 19:39 I actually found another problem in the current unescaping code:
junit.framework.AssertionFailedError: Query /a\\\+b/ yielded /a +b/, expecting /a+b/
The reason is that the method discardEscapeChar() cannot handle an escaped backslash followed by another escaped character.
I attach a new patch that includes the following:
the QueryParser unescapes any escaped characters inside a phrase and can handle escaped quotes inside phrases
for consistency reasons unescaping now is also being done in quoted range queries
new implementation of discardEscapeChar() that solves the above mentioned problem (escaped backslash followed by another escaped character)
new tests for TestQueryParser that tests the changes
All unit tests pass.

Yonik Seeley
added a comment - 17/Nov/06 21:09 Thanks Michael, I haven't had a chance to look at the code yet, but I agree with your description of all the changes. So unless concerns arise, I'll commit this after review.