Advertisements

David Lees wrote:
> I forget how to find multiple instances of stuff between tags using
> regular expressions. Specifically I want to find all the text between a
> series of begin/end pairs in a multiline file.
>
> I tried:
> >>> p = 'begin(.*)end'
> >>> m = re.search(p,s,re.DOTALL)
>
> and got everything between the first begin and last end. I guess
> because of a greedy match. What I want to do is a list where each
> element is the text between another begin/end pair.

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

Advertisements

On Thu, 17 Jul 2003 04:27:23 GMT, David Lees <> wrote:
>I forget how to find multiple instances of stuff between tags using
>regular expressions. Specifically I want to find all the text between a
>series of begin/end pairs in a multiline file.
>
>I tried:
> >>> p = 'begin(.*)end'
> >>> m = re.search(p,s,re.DOTALL)
>
>and got everything between the first begin and last end. I guess
>because of a greedy match. What I want to do is a list where each
>element is the text between another begin/end pair.
>
You were close. For non-greedy add the question mark after the greedy expression:
>>> import re
>>> s = """
... begin first end
... begin
... second
... end
... begin problem begin nested end end
... begin last end
... """
>>> p = 'begin(.*?)end'
>>> rx =re.compile(p,re.DOTALL)
>>> rx.findall(s)
[' first ', '\nsecond\n', ' problem begin nested ', ' last ']

Notice what happened with the nested begin-ends. If you have nesting, you
will need more than a simple regex approach.

On Thu, 17 Jul 2003 08:44:50 +0200, "Fredrik Lundh" <> wrote:
>David Lees wrote:
>
>> I forget how to find multiple instances of stuff between tags using
>> regular expressions. Specifically I want to find all the text between a
>> series of begin/end pairs in a multiline file.
>>
>> I tried:
>> >>> p = 'begin(.*)end'
>> >>> m = re.search(p,s,re.DOTALL)
>>
>> and got everything between the first begin and last end. I guess
>> because of a greedy match. What I want to do is a list where each
>> element is the text between another begin/end pair.
>
>people will tell you to use non-greedy matches, but that's often a
>bad idea in cases like this: the RE engine has to store lots of back-
would you say so for this case? Or how like this case?
>tracking information, and your program will consume a lot more
>memory than it has to (and may run out of stack and/or memory).
For the above case, wouldn't the regex compile to a state machine
that just has a few states to recognize e out of .* and then revert to .*
if the next is not n, and if it is, then look for d similarly, and if not,
revert to .*, etc or finish? For a short terminating match, it would seem
relatively cheap?
>at this point, it's also obvious that you don't really have to use
>regular expressions:
>
> pos = 0
>
> while 1:
> start = text.find("begin", pos)
> if start < 0:
> break
> start += 5
> end = text.find("end", start)
> if end < 0:
> break
> process(text[start:end])
> pos = end # move forward
>
></F>

Share This Page

Welcome to The Coding Forums!

Welcome to the Coding Forums, the place to chat about anything related to programming and coding languages.

Please join our friendly community by clicking the button below - it only takes a few seconds and is totally free. You'll be able to ask questions about coding or chat with the community and help others.
Sign up now!