How can I use a regular expression to match text that is between two strings, where those two strings are themselves enclosed two other strings, with any amount of text between the inner and outer enclosing strings?

I got this answer:

/outer-start.*?inner-start(.*?)inner-end.*?outer-end/

I would now like to know how to exclude certain strings from the text between the outer enclosing strings and the inner enclosing strings.

For example, if I have this text:

outer-start some text inner-starttext-that-i-wantinner-end some more text outer-end

I would like 'some text' and 'some more text' not to contain the word 'unwanted'.

In other words, this is OK:

outer-start some wanted text inner-starttext-that-i-wantinner-end some more wanted text outer-end

But this is not OK:

outer-start some unwanted text inner-starttext-that-i-wantinner-end some more unwanted text outer-end

Or to explain further, the expression between outer and inner delimiters in the previous answer above should exclude the word 'unwanted'.

6 Answers
6

Replace the first and last (but not the middle) .*? with (?:(?!unwanted).)*?. (Where (?:...) is a non-capturing group, and (?!...) is a negative lookahead.)

However, this quickly ends up with corner cases and caveats in any real (instead of example) use, and if you would ask about what you're really doing (with real examples, even if they're simplified, instead of made up examples), you'll likely get better answers.

Tola, resurrecting this question because it had a fairly simple regex solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The idea is to build an alternation (a series of |) where the left sides match what we don't want in order to get it out of the way... then the last side of the | matches what we do want, and captures it to Group 1. If Group 1 is set, you retrieve it and you have a match.

So what do we not want?

First, we want to eliminate the whole outer block if there is unwanted between outer-start and inner-start. You can do it with:

outer-start(?:(?!inner-start).)*?unwanted.*?outer-end

This will be to the left of the first |. It matches a whole outer block.

Second, we want to eliminate the whole outer block if there is unwanted between inner-end and outer-end. You can do it with:

A better question to ask yourself than "how do I do this with regular expressions?" is "how do I do solve this problem?". In other words, don't get hung up on trying to solve a big problem with regular expressions. If you can solve half the problem with regular expressions, do so, then solve the other half with another regular expression or some other technique.

For example, make a pass over your data getting all matches, ignoring the unwanted text (read: get results both with and without the unwanted text). Then, make a pass over the reduced set of data and weed out those results that have the unwanted text. This sort of a solution is easier to write, easier to understand and easier to maintain over time. And for any problem you're likely to need to solve with this approach it will be sufficiently fast enough.

If you're unsure (and even if you think you're sure), you should test your pattern locally (or on a site like codepad.org), which is why regex questions need good examples (both passing and failing).
–
Roger PateJan 2 '10 at 23:21