A common source of confusion is the issue of which possible match the regular expression will find in any given instance, and this gives rise to many queries on SOPW and many different attempts to explain what happens.

A common theme in the explanations is to talk about preferences - the regular expression engine "prefers" longer or shorter matches, or leftmost matches.

The reality is however very simple: the regular expression engine will always return the first match it finds, and if you understand the order in which the engine attempts matches, you need never be confused about which match will be found.

A few simple rules (in no particular order) characterise the order in which attempts are made.

1: leftmost character first

The engine will start trying to match the pattern against the string, starting with the first character of the string. Only if it cannot find any match against the beginning of the string will it start looking for matches starting with the second character of the string, and so on until the string is exhausted.

.. will print "The", since it matches at the beginning of the string. The fact that there is a longer match to find is irrelevant - a match against the beginning of the string succeeded, so this is the match that is returned.

2: leftmost alternation first

Given a choice of alternates, the leftmost alternate is tried first. If a match is found against the leftmost alternate, the engine will attempt to match the rest of the pattern against the rest of the string; only if no match can be found will the next alternate be tried.

.. will print "Mr": the first alternate matched, and thus satisfied the entire pattern, so this is the match returned. The second alternate was never looked at.

3. Maximal (greedy)

With any normal quantifier (ie any of ? * + {1,3}), the engine will first try to match the greatest number of iterations permitted (or the most available if that is fewer than permitted), and try to match the rest of the pattern against the rest of the string. Only if the rest of the pattern fails to match will it try matching one fewer iteration, and repeat until the minimum permitted number is reached.

.. prints "bar": the engine tries to match the initial .* against "/foo/bar", then "/foo/ba", then "/foo/b", then "/foo/", then "/foo". At this point the remaining pattern matches, so this is the match returned.

4. Minimal (non-greedy)

Adding a '?' after a quantifier (ie ?? *? +? {1,3}?) requests a minimal match, so the order is reversed: the engine first attempts to match the fewest number of iterations permitted and if it can, attempts to match the rest of the pattern against the rest of the string. Only if that fails does it try matching one more iteration, and continues increasing the number of iterations matched until the upper limit is reached.

.. prints "bar/baz": the engine tries to match the initial .* against "", then "f", then "fo", then "foo". At this point the rest of the pattern matches, so this is the match that is returned.

5. Pattern components: left to right

Implicit in some of the above examples is that the components of the pattern are matched from left to right. So in the 'maximal' example above, the engine decides first (based on the rules above) how it is going to try to match the initial .*, and only after that is fixed does it start worrying about the remaining components of the pattern.

6. Nesting

This is really just more of the "left to right" rule, best explained by example:

.. prints "foo/bar". The engine reaches the outer quantifier, knows it should match it as many times as possible, and attempts to match its innards. The first iteration of the innards reaches the nested quantifier, and tries to match it as many times as possible, giving "foo/bar". At this point, the outer quantifier cannot be matched a second time but that's fine - the pattern has been fully satisfied, and so this match is returned.

That covers most of the standard cases. There are a few other possibilities to consider (eg anchors, pos(), and the various zero-width assertions). Also, the optimiser may sometimes cause the engine to skip some steps, but only if it is sure that the actual match found will be the same.

I hope this will help you work out which match your regular expression will find.

I think this a great short explanation of regex engine expectations. The only part I might tweak is the part that addresses the mostly greedy nature of the Perl engine.

Your explanation is right on with regards to the way the engine behaves. You may, however, want to change the term iterations to something else. Iterations usually refers to the repetition of a series of operations. So, while the process of attempting to match against each character could be called an iteration of the regex engine, the characters could really not.

I'm not trying to be overly critical or anything. This tutorial would be great for new (Perl|regex) users to read and i guess I just think the use of iterations might be a bit confusing to them.

Iterations usually refers to the repetition of a series of operations.

Hmm, I was thinking of the process of matching "the thing quantified" once as the series of operations being repeated. But I don't have a good name for "the thing quantified". I agree that 'iterations' isn't ideal, but I'm not sure what would be better.

I use this a lot, usually with the shorter command-line invocation ('-Dr', available if perl was built with debugging enabled). However the output was designed primarily for internals debugging, and is likely to be more scary than helpful to the casual user.

Various people have taken advantage of the information it provides to offer a friendlier interface to regexp debugging. Though I haven't tried any of them, I'm aware of ReBug (I believe this is open source, but I can't get to the website: the author gave a talk on it at the 2001 OSCON), Komodo (commercial, with "try before you buy"), and The Regex Coach ("donationware", donate if you like it) which was recently mentioned on PM.

Hi
Though not directly related to the explanation here, I think this is the most relevant place.
There are 6 Patterm Matching Rules explained for Perl Engine (Programming Perl by Larry, Tom and Randal). But in Rule 3 what is the meaning of the Left to Right Pecking order as I understand that each and every item (Assertion or Atom) of the current RegEx Alternative is searched sequentially in the String from the current position onwards.
So in case the current item does not match at current position in the string, what does it mean to try the next higher pecking order items.
Thanx
Manish Sood
manishsood14@yahoo.com

Ada Lovelace for the palindrome
Albert Einstein for having smelly feet
Alfred Nobel for his contribution to battlefield science
Burkhard Heim for providing the missing link between science and mysticism
Claude Shannnon for riding a unicycle at night at MIT
Donald Knuth for being such a great organist
Edward Teller for being the template for Dr. Strangelove
Edwin Hubble for pretending to be a pipe-smoking English gentleman
Erwin Schrödinger for cruelty to cats
Hedy Lamarr for weaponizing pianos
Hugh Everett for immortality, especially for cats
Isaac Newton for his occult studies
Kikunae Ikeda for discovering the secrets of soy sauce
Larry Wall for his website
Louis Camille Maillard for discovering why steaks taste good
Marie Curie for the shiny stuff
Nikola Tesla for the cool cars
Paul Dirac for speaking one word per hour when socializing
Richard Feynman for his bongo skills
Robert Oppenheimer for his in-depth knowledge of the Bhagavad Gita
Rusi P Taleyarkhan for Cold Fusion
Sigmund Freud for his Ménage ā trois
Theodor W Adorno for his contribution to the reception of jazz
Wilhelm Röntgen for the foundations of body scanners
Yulii Borisovich Khariton for the Tsar Bomba
Other (please explain why)