Regular Expressions in .NET

The purpose of this article is to build upon the existing pool of regular expression articles by
providing an overview of the new regular expression features found in .NET and to offer some guidelines as to
when and how to use them. The reader of this article should be familiar with what regular expressions
are and their base features.

Introduction

Although I was familiar enough with the basic concepts of regular expressions to use them in VBScript
and JScript, I noticed that I was struggling to understand many regular expressions I found in
examples and documentation. Some of the new features such as lookaround and named capturing left me
feeling more than a little overwhelmed. In addition to this, the documentation for regular expressions
was scant and quite often with little or no sample code. Because of this, I initially
steered away from using regular expressions in my .NET projects altogether.

In this article I hope to highlight some of these new areas and hopefully de-mystify them in such a
way that you won't find yourself in the position that I did.

Matching: Groups and Named Captures

From previous regular expression authoring you will likely be familiar with the concept of referencing
parenthesized captures via the $1...$N notation - these are referred to as
backreferences. To demonstrate this, consider the following VB.NET sample:

The above pattern matches two words separated by a comma and a space, captures the surname and the
firstname of a user and formats them in firstname, surname order. The result is that the value
"Darren Neimke" would be displayed in the browser.

In the Replace statement the $N notation refers to the Nth group of
parenthesis (captures). An important point to note is that, in .NET the zeroth element ($0)
refers to the entire matched text - "Neimke, Darren" in the case of the above example.

The Regex class now offers some convenient shared (static) members that allow simple statements to
be in-lined, thus reducing the need for unneccessarily bulky code structures such as the one shown
above. The useful static members are: IsMatch, Match, Matches,
Replace and Split. Using this syntax allows for the previous code to be
reduced to:

The reduced code benefits can be further seen with another example, using IsMatch() to
ensure that a string contains a Decimal number pattern before executing some code:

If Regex.IsMatch( userInputString, "\d+(\.?\d+)" ) Then
' perform some conversion and math operations here
End If

Prior to .NET, a regular expressions Match object contained many SubMatches. This has
remained the same in .NET although they are now referred to as Groups. Groups
are a collection property of a Match object and each captured group can be accessed via
it's index (remembering that index 0 refers to the entire match), like so:

This would display the text "Darren" as it is the captured Group at index 2.

Named Captures

Additionally Groups can be assigned names via the new (?<nameOfGroup>...) or
(?'nameOfGroup'...) syntax. For consistency with other flavors of regular expressions -
such as Perl - I prefer the first syntax and it is the one that is most commonly used. Assigning
names to groups helps to make your code more self-describing and can lead to improved maintainability.
Here's an example of naming the two captures:

Non-Capturing

While captures provide a lot of power, they can incur quite a performance hit. With regular expressions
in VBScript and JScript, capturing occurred whenever you used parenthesis in a regular expression pattern.
Sometimes, though, you need to use parenthesis, but you don't need capturing. For example, if you
wanted to match either "Let's go this way" or "Let's go that way" you could use the following regular
expression:

Let\'s go th(is|at) way

The parentheses with the pipe indicate an option. The pattern matches either "is" or "at" after the
"th". Unfortunately, this regular expression incurs an unneeded performance hit because the
captured text (either "is" or "at") is remembered via a backreference.

Fortunately, .NET regular expressions provide the (?:...) syntax, which allows for grouping
to be done without incurring the performance hit of captured text being "remembered" as a backreference.
Using this syntax, the above regular expression could be changed to:

Let\'s go th(?:is|at) way

That pattern would match either:

"Let's go this way"

"Let's go that way"

But would only contain one captured group, referenced as Groups(0). This can obviously
lead to significant performance gains, especially when complex patterns are applied to even moderately
large bodies of text.

Lookaround

Lookaround is a feature that is partially implemented in JScript but not in VBScript. There are two
directions of lookaround - lookahead and lookbehind - and two flavors of each direction - positive
assertion and negative assertion. The syntax for each is:

(?=...) - Positive lookAHEAD

(?!...) - Negative lookAHEAD

(?<=...) - Positive lookBEHIND

(?<!...) - Negative lookBEHIND

Understanding look(ahead|behind) requires an understanding of the difference between matching text
and matching position. To help with this understanding I should state first that lookaround
assertions are non-consuming. To see what I mean, let's look at the following simple example.

pattern = "test"
text = "testing"

When the above pattern is applied to the text the "context" of the parser sits at a position in the
text between the "t" and the "i" in the word testing. This is because the regular expression parser
bumps along the string as it gets a match, like so:

Start - ^testing

Match "t" - t^esting

Match "e" - te^sting

Match "s" - tes^ting

Match "t" - test^ing

Once the parser has moved beyond a position there is no way to reverse up and re-attempt a match.
To understand where this causes difficulty, consider this, what if you needed to match the word
"test" but only when it was contained in the word "tested" and not any other possible combination
such as "tester". With lookahead you can simply assert that condition like so: (?=tested\b)test

This works because, with lookaround, the parser is not bumped along the string. This can be
especially useful for finding a position in a document by combining a lookahead assertion with a
lookbehind assertion. To demonstrate, let's consider that we need to match the string "test" when it
was contained within the string "protested" but not "detested". To do this you can do a negative,
lookbehind assertion on "de" and a positive lookahead assertion on "tested", like this:
(?<!de)(?=tested\b)test

In other words you are matching a position at which to start matching text. The above pattern would
set the parser at the following position in the string "protested"

Start - pro^tested

Match "t" - prot^ested

Match "e" - prote^sted

Match "s" - protes^ted

Match "t" - protest^ed

Another good example of using lookaround would be to validate "special" password conditions such as:
"Password must be between 8 and 20 characters, must contain
at least 2 letter characters and at least 2 digit characters.
It can only contain either letter or digit characters."

For such a password constraint, the following expression would probably do quite nicely:
^(?=.*?\d.*?\d)(?=.*?\w.*?\w)[\d\w]{8,20}$

Readability and Maintainability

One of my personal favorite new features is the ability to have embedded comments in regular
expressions. Most of us will have, at one time or another come across a regular expression that
looks somewhat like this:

If you are lucky you might find a comment that alludes to the purpose of the regular expression, but,
when the time comes to maintain the expression you are undoubtedly left with a sense of anxiety and,
more often than not, a complete re-write is undertaken as opposed to some minor maintenance
operation. .NET allows regular expression patterns to be authored with embedded comments via the
RegExOptions.IgnorePatternWhitespace compiler option and the (?#...) syntax
embedded within each line of the pattern string.

This allows for psuedo-code-like comments to be embedded in each line and has the following affect on
readability:

Delegates

Finally, a really useful addition to the .NET Framework is that the Regex.Replace()
method allows the use of a delegate as the "replacement" argument. To understand what I'm talking
about, consider the following snippet:

After the replace operation has occurred, the value of myString will be "a a a of a a" and it's
fairly obvious what happened. Every time the regular expression parser found a match within the
string it replaced it with the letter "a". That's all nice and easy if all you need to do is a
straight replace, but what about if you need to implement some sort of business logic into the check
or you need to "touch" the sub-matches in some way and re-build the replaced string.

A good enough example is converting all words within a body of text to proper case (i.e. first letter
capitalized). To do this your first instincts might be to create a pattern like so: \b(\w)(\w+)?\b.
You could then enumerate the matches, convert the first sub-match to its uppercase version, join the
sub-matches and re-append them to a StringBuilder instance, like so:

That would work fine if your string contained only word characters, but, what if it looked like this:
~~~ This %%% is ### a chunk of text.
After the replacement operation you would end up with the following string meaning that all non-word
characters that didn't participate in the matches were dropped: ThisIsAChunkOfText.
There are ways around it, mostly by building bigger, more complex patterns and doing more string
building inside the match collection iteration.

A more elegant solution is to wire-up a MatchEvaluator delegate. You can think of a
MatchEvaluator as an event handler that fires when an "OnMatch event"
occurs. You provide the MatchEvaluator with a pointer (reference) to handler function
and that function will be called each time a match is encountered. The function must take a
Match parameter as its single argument and must return a String back to the regular
expression Replace method that invoked it. This method of replacement allows you the
flexibility to do all sorts of operations transparently to the Replace method itself,
and because it is all handled within the Replace method call, you are not left with
having to re-build a string as in the previous example.

A demonstration is in order - let's re-write our previous failed attempt at converting a string to
proper case using delegates:

As you can see, the separation is much cleaner and having the replacement logic handled in a
separate handler method allows you to implement very complicated operations without affecting
readability, maintainability or - and most importantly - data integrity as a result of missing data
in a string re-building operation.

Conclusion

Novice programmers often tend to rely heavily on inelegant, unweildy, or slow solutions that focus
heavily on string handling operations; programmers with a higher command of languages are more commonly
turning to regular expressions to manage and manipulate chunks of text.

The .NET flavor of regular expressions allows regular expressions to be written in a more efficient
and maintainable manner. While learning and mastering regular expressions takes time, the ultimate reward
is an increased ability to provide accurate solutions efficiently.

There is a sample ASP.NET Web page that uses many of the advanced features
discussed in this article that you can try out. Specifically, the sample Web page
retrieves the HTML from a remote Web server and then prefixes a URL to all hyperlinks that do not start
with http://.