Just another programmer's site.

Regular expressions – Backreferences

I’m going through the SPTK for the 70-536 and stopped a bit at the regular expressions sub chapter. I always avoided allocating some serious time to catching up (yeah baby I used to know this stuff better while at university) with them and I was always satisfied with only a quick fix. Blame me all you want, now this is where I want to redeem myself.

There’s an example in the book and the explanation is “superficial” enough to have let me completely in the dark. I categorized it as superficial then when I didn’t know as much as I do know.

So, how about matching the (?<letter>a)(?<letter>\k<letter>b)*regular expression which is the same thing as (?<1>a)(?<1>\1b)* against the "aababb" string.

The above regular expression "(?<letter>a)(?<letter>\k<letter>b)*" will first try to match a core of three characters: the ones highlighted in red, orange and blue. Where <letter> is in green it means a string will be assigned to it and when in orange we refer to its value.

Note: The * at the end refers to only the group between the last set of parentheses (in pink).

Now, even if we only match three characters we can consider as having a complete match (from the regular expression’s perspective) as the "*" makes it optional to have anything else matching after the first partial three-char match.

However, if there is something after and we want to be a match it better be whatever was lastly stored in <letter> and followed by a "b".

Be aware that while still trying to match the first three characters we assigned the <letter> reference two times: first with a string of one character and then a two characters string.

So, the “*” at the end of the regular expression will match something that first of all has to be three characters long (we will see how this actually changes for the next time we try to match “*”).

You can observe now how the second set of parentheses needs to match only two characters at first, to find that next time (when “*” gets involved) it will be match for a three character string.

Let us see how the actual regex engine works.

Step 1

First it will try to see if we have a match against the first group of three characters: "aab".

Grabbing the "aab" substring we can see that the first letter matches with a. Additionally, at this moment the string "a" is stored to the <letter> reference.

The next two characters substring in "aab", and that is "ab" is compared against \k<letter>bwhich is in fact "ab". Besides the fact it is a match, this string is stored to <letter> overwriting the previous content: "a". So, just to be clear, at this moment <letter> = "ab"

First three letters are a match.

Step 2

Now that we are done with matching the core let’s see what "*" means at this point. It is a repetition of whatever we have in “\k<letter>b” which is "ab" + "b" = "abb" and more it is a match for what’s next in the "aababb" string.

So, our original string is a complete match.

What should the original string be suffixed with to have one more match for the "*" and so a complete match of the whole string. Now this depends very much on the answer to the following question:

While the previous matching against "*" was the <letter> reassigned again (obviously with "abb")?

The answer I found to be: YES, while matching a "*", references assignment is still on. So, the match for the "*" will be at this point "abbb" meaning a complete matching string is: "aababbabbb".