Regex teaser

Paul Makepeace writes:
> On Tue, Dec 3, 2013 at 5:03 PM, Mark Fowler <mark at twoshortplanks.com> wrote:
>> > On Tue, Dec 3, 2013 at 6:54 PM, Paul Makepeace <paulm at paulm.com>
> > wrote:
> >
> > > $ perl -le '($a = "aabbb") =~ s/b*$/c/g; print $a'
> >
> > This is where tools like Regexp::Debugger shine. Running
> >
> > perl -le 'use Regexp::Debugger; ($a = "aabbb") =~ s/b*$/c/g; print $a'
> >
> > Shows exactly why it gives the output it does (if you hit "n" for
> > next a lot)
(Enter works too.)
Except that gives 3 matches (on perl v5.14 with Regexp::Debugger
0.001016) — 15 steps to match “bbb”, then 5 steps to match at the end of
the string, then another 5 steps to match at the end of the string a
second time. Yet it only prints 2 “c”s.
Am I interpreting that wrongly? Is this a bug in Regexp::Debugger that
should be reported?
> Can't use an undefined value as an ARRAY reference at
> /Library/Perl/5.16/Regexp/Debugger.pm line 499.
Ditto, after it's finished. I'll report that bug once I've sent this
mail.
> But yeah that's neat.
It is. I hadn't used it before, so thank you Mark for demonstrating it.
> The puzzle comes down to whether the $ is part of the first b*
> capture.
Hmmm. /$/ is a zero-width assertion, so isn't really ‘part’ of a capture
(though of course it has to match); I find it easier to think of
assertions as matching at the positions between characters rather than
on them.
> IMO it is (and python seems to agree). Why the engine restarts having
> captured as much as it can to the very end strikes me as counter
> intuitive.
If the pattern had been /bbb/ then it would obviously match the “bbb” in
the input string. The last character matched would be the final “b”, and
so the position to start looking for the next match is the spot just
after the final “b”.
With the pattern /bbb$/ instead, it does ... exactly the same. The
characters from the input string matched are still “bbb”; the assertion
is asserted, but doesn't change which characters are matched. Which
means that the final matched character is still the final “c”, so the
‘next match position’ is still the spot just after the final “b”.
And of course matching at the spot after the final character has to be
allowed so that things like s/$/./ work.
> Almost, if not actually, bug-like.
I agree the behaviour isn't immediately obvious. But it does make sense
when thinking about what each component means separately.
So what does seem bug-like to me is the Python behaviour — can anybody
explain that?
Cheers
Smylers
--
The UK gov's proposed gagging law will suppress protest and campaigning by
charities and political groups. Read more: http://civilsocietycommission.info/
Please sign this petition: http://38d.gs/1cJ9ViA
Oxfam • BHA • Quakers * Countryside Alliance • League Against Cruel Sports • CND