Regular Expressions are simply amazing; they are one of the sexiest aspects of computer science. However, they can get very complicated very fast. As the complexity of the regular expression increases, so does its power. A complex regular expression might be able to, in a single pass, do the work of three smaller regular expressions applied in succession. The problem with this, however, is that the level of complexity of a regular expression increases far faster than its level of power; a pattern that is 3-times more powerful might very well feel 10-times more complex.

In order to find a sweet spot between the ease of simple regular expressions and the power of complex ones, I thought I would try to create a ColdFusion user defined function that allowed multiple patterns to be applied in succession to a target string. The idea here was that we could present a "single pass" approach to the programmer that could be defined using several, smaller regular expressions. What I came up with was reMultiMatch(). Just as with ColdFusion's reMatch() function, the reMultiMatch() function is designed to extract an array of pattern matches contained within a given string. The difference, of course, being that reMultiMatch() allows more than one regular expression pattern to be passed-in:

reMultiMatch( pattern, [pattern,]* string )

Before we dive into how the function works, it might be more helpful to see how it can be used. In the following demo, we have a snippet of HTML that contains several IMG tags. From this HTML, we want to extract the SRC values of each image; but, we only want to do that if that image also contains the CSS class, "saucy." This kind of extraction could be performed using a wicked complex regular expression; however, as you'll see below, reMultiMatch() allows us to extract that matches using four much (relatively) simpler patterns:

<!--- Create a piece of demo text (some mock HTML). --->

<cfsavecontent variable="demoText">

<h2>

Images

</h2>

<ul>

<li>

<img src="very-sexy-girl.jpg" class="saucy" />

</li>

<li>

<img id="banner" src="coldfusion-ad.jpg" class="saucy" />

</li>

<li>

<img id="footerBanner" src="fbanner.png" class="footer" />

</li>

</ul>

</cfsavecontent>

<!---

Extract all of the IMG SRC values from our demo text, but

only if the IMG has the class of "saucy". To do this, we are

going to use a multi-pass regular expression match that matches

the following patterns:

1. All IMG tags.

2. IMG tags that have the Sauce class.

3. src="value" pairs.

4. The quoted SRC value.

NOTE: Since we are using the Java regular expression engine

internal to the function, we are able to make use of powerful

features like positive look-behinds.

--->

<cfset srcValues = reMultiMatch(

"<img[^>]+>",

"(?=.+?class\s*=\s*""saucy"").+",

"src\s*=\s*""[^""]+""",

"(?<="").+(?="")",

demoText

) />

<!--- Output the list of IMG SRC values. --->

<cfdump

var="#srcValues#"

label="IMG SRC Values"

/>

When we run the above code, we get the following page output:

As you can see, we have successfully extracted the SRC values of the two IMG tags that contained the "saucy" class attribute. In order to do this, we applied the following four regular expressions in succession:

"<img[^>]+>"This extracted all of the individual IMG tags.

"(?=.+?class\s*=\s*""saucy"").+"This used a positive look-ahead to make sure that collected IMG tags had the class="saucy" name-value pair.

"src\s*=\s*""[^""]+"""This extracted the SRC name-value attribute from the given IMG tag.

"(?<="").+(?="")"This used a positive look-ahead and look-behind to extract the quoted value from the SRC name-value pair.

If any of these regular expressions looks complex on its own, just image how insanely complex it would be to try and merge these four expressions into a single pattern.

Now that you see how reMultiMatch() might be used, let's take a look at how this ColdFusion user defined function is actually built. Underneath the hood, it compiles each regular expression down into an instance of the Java Pattern class. This gives us a more robust regular expression feature set as well as a quicker execution than you'd find in the standard reMatch() function.

<cffunction

name="reMultiMatch"

access="public"

returntype="array"

output="false"

hint="I return array of regular expression matches defined by the first N-1 patterns applied in sequence to the given string.">

<!--- Define arguments. --->

<!---

The first N-1 arguments will be regular expressions. The

last argument will be the target tring to which the regular

expressions will be applied.

--->

<!--- Define the local scope. --->

<cfset var local = {} />

<!---

Check to make sure at least two arguments were passed into

the function. If not, we don't have at least one regular

epxression pattern to apply.

--->

<cfif (arrayLen( arguments ) lt 2)>

<!--- Invalid argument list. --->

<cfthrow

type="InvalidArguments"

message="This function expects at least 2 arguments."

detail="This function expects (N GT 1) regular expressions followed by the target string to which the regular expressions should be applied."

The underlying code is not too bad; essentially, it's just hiding the grunt work of having to manually apply each regular expression pattern in succession.

When it comes to string parsing, regular expressions can feel both like a gift and a curse. Hopefully, with a function like reMultiMatch(), we can keep the complexity of our regular expressions lower while still experiencing the power that a more complex regular expression would provide. And of course, the more straightforward our patterns are, the easier they are to read. And if you've ever had to debug a complex regular expression, the readability of smaller patterns might be reason enough to try this approach.

Reader Comments

Hi BenInteresting function. Some observations:* Wouldn't an array of regexes in the first argument be slightly more logical / predictable / tidy, than 1->n string arguments?

* Also, isn't it normal to have the required args first? IE: one always needs the target string, so it would be more natural to have that as an argument before the 2->n regexes (ie, obviously the first regex string is required too)? And accordingly, to keep things logically grouped, have the target string argument first, then the regex argument(s) after that? I realise you're trying to match the arguments for reMatch(), but I think what you've ended up with is a bit unnatural given your approach. it'd be less unnatural if you passed an array of regexes, not individual arguments though.

Hmmm, this could be a nice shortcut function for progressively narrowing down, but I must object to your example using HTML!

Regex is a great tool, but it is for parsing text, *not* for parsing HTML.

Complexity aside, the problem is that HTML is very flexible - its text representation can change without the HTML itself changing, and the HTML can change in ways that might not matter - and these both cause problems for the relative explicitness of regex. For example:

<img src='very-sexy-girl.jpg' class="saucy girl" />

That causes problems, and requires much more complex regex, even with the reMultiMatch approach.

And that just seems like too much work when there's already a much better way of doing this:

I think absolutely an array of regular expressions would be much nicer. The reason I didn't go that way was because implicit array creation cannot be used directly in function calls:

fn( [ val ] )

... until ColdFusion 9 (finally added, woohooo!!!!). As such, people would need an intermediary array to hold the patterns:

p = [ val ]fn( p )

... and I was just trying to keep this as streamlined as possible to keep the whole "single pass" feeling.

That said, variable-length arguments always makes me feel a bit funny, so I think we're on the same page.

As far as the patterns being first, as you've concluded, the only reason I went that way was to try and keep this someone in step with the native reMatch() and reFind() functions which both take the regular expression as the first argument and the target string as the latter argument. Of course, reReplace() takes the target string as the first argument, so perhaps consistency is a moot point.

All in all, I think the inline, implicit array of patterns would be the nicest approach.

@Peter,

HTML was the only example that I could think of :) Also, someone had sent me a question of a similar nature which was how I happen to get this idea, so I was starting in a bit of biased place.

@PeterHow can you use jQuery to parse a document server side? My understanding was it did client side work? The only other way I thought it could be done was to parse it as an XML document, but the same reasons you give for using regex to pattern match HTML apply to parsing HTML as an XML document. At least with regex it won't fail to validate the document and abort before you start; so you start out with a better chance of success.

@Bento accont for double/single quotes, make this tiny mod to look for " or ' [""|']

Mike, jQuery is a JavaScript library, and JS generally runs inside a browser client, but that's a convention not a restriction. There's a server-side JS project called Rhino which I think can run jQuery.Also, there are other (non-JS/jQuery) selector tools starting to come out that will work on the server-side, for example:http://github.com/chrsan/css-selectors/tree

I'm not sure what you're saying with the XML comment. Yes, using regex against XML has similar problems as using regex against HTML.Using XPath against a HTML DOM is possible (even if the original code isn't valid XHTML), but even so that never seems to work well.

Also important to note that in regex the | character indicates alternation in most cases, but not inside a character class, where it is literal. So ["|'] means " or | or ' rather than just the quotes.And don't forget that quotes are optional in HTML in many places. :)

This is coolness and very useful. I did something similar in JavaScript for the XRegExp.matchChain method ( http://xregexp.com/api/#matchChain ), with the added ability to pass forward a specific backreference to the next regex. I also used recursion to keep the code nice and lightweight (which of course is less imperative in CF-land than JS). Incidentally, our usage examples are also very similar. :)

Also, my idea to use a positive look-ahead to check for a sub-string before I matched something else is specifically taken out of your RegEx Cookbook :) When I saw that recipe, it kind of blew my mind.

Actually, my example maps pretty closely to that. I am checking for a given class - you are checking for an external link. Then, I'm checking for the SRC, you are checking for ALT. We can probably rework the demo to match your situation:

reMultiMatch(

"<img[^>]+>",

"(?=.+?src\s*=\s*""http://(?!YOUR_DOMAIN_NAME)).+",

"alt\s*=\s*""[^""]+""",

"(?<="").+(?="")",

demoText

) />

Maybe something like that? You'd have to replace the YOUR_DOMAIN_NAME with your local domain name. The negative look-ahead checks to make sure the src value doesn't start with your domain.

Of course, this could get complicated if your local source values don't have HTTP in them. Perhaps. It depends on what kind of data you're dealing with.