Thanks. i got the idea for non-greedy match (as explained in your example) How will it work when using greedy match(.*)? so if i have line

line1 brown fox

and i use the perl -ne 'print if /(line.*)(?=.*fox)/' then the first dot-star (line.*) operator will match the entire string (line1 brown fox) and should do backtracking (when checking lookhead assertion). But the lookahead statement also has dot-star operator in it. Will that cause a backtracking too and how?

As far as I understand it, it is not really a question of greedy versus non greedy match, nor a question of backtracking. My first two examples used greedy match and I used non greedy match just to show how the regex engine proceeded to get the match.

The real important point was the difference between the first and the second example: can the regex engine find a way to make the negative assertion to fail (and this the overall regex to succeed)?

This is an issue of semantics of negative assertions. The regex engine is trying very very hard to get a match, and the negative assertion will make it fail only if it is bound to fail. If the regex engine is able to find a match path in which it can circumvent the negative assertion, it will use it. That's the essence of the difference between my first and second example.

Using the perl regex debugger, I hope to show why your second regex, /(line.*)(?!.*fox)/, succeeded to match the first line, line1 brown fox, when you really meant for it to fail instead. The output from the debugger is kind of dense, but I'll try to describe it to make some sense. I got some explanation of what the numbers in the output meant, (bytecode numbers for the bytecode tree the regex prepares during the compilation phase for the regex engine). See a good explanation of them in the last post in this web page here. Below is the output from the debugger for your regex - I bolded the start of the compilation phase and then the start of the execution phase.

0 <> <line1 brow> | 3:EXACT <line>(5) made the exact match (<line> in the left bracket below. Note the unmatched text is in the bracket to the right, <1 brown fo>

4 <line> <1 brown fo> | 5:STAR(7) REG_ANY can match 11 times out of 2147483647... Now the STAR * modifier will capture the complete text, (next line, <e1 brown fox>). NOTE: all the text has been captured and there is no text left for the lookahead to match. Thus, the entire match will succeed

This was a learning experience for me and I know how to read somewhat the debugger output now. :-) (at least for this somewhat simple case).

Hope you make sense of my explanation. There was no backtracking and the regex succeeded, (where you wanted it to fail), because the first '.*' ate the entire line leaving no 'fox' for the lookahead to detect and cause it to fail.

In the second code, only ?! was replaced with ?= in second capture group (?=fox). Now with this code, my first capture group (line.*) is matching upto 12 characters (line1 brown ) instead of all 15 characters (code 1). Shouldn't it match all the way to end of line? Why changes in second capture group are effecting the regex match of first group?

In line DB<4>, the regex engine CAN'T avoid matching the negative assertion so that the overall regex fails. In line DB<5>, the regex engine CAN find a way of matching the string and not matching negative assertion.

Line DB<6> shows how it can do it: the first subexpression matches the whole string "tip top" (because of greediness) and, therefore, the negative assertion can no longer match and fails, so that the overall regex succeeds. Line DB<7> shows the same with a non-greedy quantifier: the regex engines actually matches just enough with the first subexpression in order to prevent the negative assertion from succeeding.

My previous explanations were perhaps not clear enough, I hope this makes now this clearer.

I think that Chris's first reply was correct. The /.*/ should be part of the assertion, not the match.

Yes, sure, but why?

As I said, the whole point is whether the regex engine is able to find a matching path where the negative assertion fails. If the /.*/ is part of the match, then the regex engine can find a matching path where the negative assertion fails, and the regex engine will not backtrack to try to have the negative assertion to succeed (and thus the overall regex to fail).

Just another couple of examples:

Code

DB<1> $_ = "line1 brown fox\n";

DB<2> print if /line\d+\s+\w+\s+(?!fox)/;

DB<3> print if /line\d+\s+\w+.+(?!fox)/; line1 brown fox

DB<4> print if /line\d+\s+\w+\s+(?!fix)/; line1 brown fox

If the match subexpression is able to match part of the characters of the negative lookahead subexpression, the assertion will fail and the overall match will be able to succeed. I have shown in my previous post what the regex engine is able to match in such a case, both with a greedy and and non greedy quantifier in the match subexpression.

The original question was "why does one of my regex fail to get the expected result in one of three test cases?". You have shown that none of the six test cases were doing exactly what was intended. The correct results were something like the broken clock that is right twice a day.

To form a correct regex, we need an reasonably accurate statement of the problem. (How else could we possibly tell if we are right?) I propose: "The regex should match the string if the string contains 'line' and does (or does not in the negative case) contain 'fox' in the remainder of the string." Note that when we translate this sentence into regex symbols, the assertion must apply to the entire remainder, not some substring of it.

This is what I have done. The issue of greediness never arises because there are no greedy (or non-greedy) operators in the match. The following test demonstrates that this pair of regex pass all six test cases.

I think there must be a misunderstanding here. Although I was answering to your post, it did not mean in any way that I disagreed with it. I only begged to explain the reason why your code succeeded and the OP's code did not.

Quote

The issue of greediness never arises because there are no greedy (or non-greedy) operators in the match.

Agreed, the issue of greediness never arises in your code, since you have no quantifier at all in your matching part, and that's why it works easily (and, BTW, I fully agree that's the way it should be).

But my whole point since the beginning was that the OP's code:

Code

perl -ne 'print if /(line.*)(?!.*fox)/' text.txt

did not work as expected because of the ".*" within the match subexpression, which gave the regex engine an opportunity to find a match not impaired by the negative assertion (and, BTW, a non greedy quantifier, ".*?", would have exactly the same result in this specific case).

And my whole point in my various posts was not so much to provide a solution (although I think I actually did), but rather to answer the OP's deeper question and to explain the OP why his or her code did not work, namely that the ".*" within the match subexpression prevented the negative lookahead assertion from matching anything, thereby allowing the whole regex to match despite the negative assertion, which was obviously not the desired result. And all of my examples were aimed at showing this. The two examples in my very first post already showed how the presence of ".+" within the match changed completely the result (made the negative assertion to fail), whereas, without that ".+", the negative assertion succeeded (and the overall regex thus failed).

These things are not very easy to explain and quite possibly I was not clear enough (maybe it is also due to English not being my mother tongue -- sorry if my English is too poor), but I think I showed quite relatively clearly several times and already in my very first post and its code examples that the failure of the negative lookahead assertion was due to the catch-all-the-rest-of-the-line ".*" (or, in this case, equivalent ".+") within the match subexpression preceding the negative assertion.

I think that I was the one who was not clear. My first paragraph was intended to convey my agreement with you. (You answered the original question very well.) The remainder of my post was to provide information on how to go about writing the correct regex and verifying that it is correct. I suspect that the OP's original mistake had more to do with his failing to understand his own requirement than it did with his understanding of the regex engine.

Never doubt your ability with English. I wish that I could write as well as you. Until you told us otherwise, it never occurred to me that English was not your native language. Good Luck, Bill