Using Regular Expressions to Parse for Email Addresses

Advanced Email Regular Expressions Pattern

While the previous email pattern would catch most of the email addresses, it is far from complete. This section illustrates a step at a time how to build a much more robust email pattern that will catch just about every valid email address format. To begin with, the following pattern catches "exact matches". In other words, you shouldn't use it to parse a document, but rather to validate a single email address:

^[^@]+@([-\w]+\.)+[A-Za-z]{2,4}$

Personally, I find it easier to read a pattern by dissecting it into components and then attempting to understand each of the components as they relate to the overall pattern. Having said that, this pattern breaks down to the following parts:

^The caret character at the beginning of a pattern tells the parser to match from the beginning of the line, since the focus of this pattern is to validate a single email address.

[^@]+When the caret character is used within brackets and precedes other characters, it tells the parser to search for everything that is not the specified character. Therefore, here the pattern specifies a search to locate all text that is not an at-sign@ character. (The plus sign tells the parser to find one or more of these non at-sign characters leading up to the next component of the expression.)

@Match on the at-sign literal

([-\w]+\.)+This part of the pattern is for matching everything from the @ to the upper-level domain (e.g., .com, .edu, etc.). The reason for this is that many times you'll see email addresses with a format like tom.archer@archerconsultinggroup.com. Therefore, this part of the pattern deals with that scenario. The first part[-\w]+tells the parser to find one or more word-letters or dashes. The "\." tells the parser to match on those characters leading up to a period. Finally, all of that is placed within parentheses and modified with the plus operator to specify one or more instances of the entire match.

[A-Za-z]{2,4}$Matches the terminating part of the expressionthe upper-level domain. At this point, reading this part of the pattern should be pretty easy. It simply dictates finding between two- and four-letter characters. The $ character tells the parser that these letters should be the end of the input string. (In other words, $ denotes end of input, compared with ^, which denotes beginning of input.)

In order to test for "direct matches", you need a very simple function like the following:

The main differences between this pattern and the previous one are the following:

I removed the beginning-of-line metacharacter^since the pattern will be used to search through an entire string for all email addresses (instead of being used to validate the entire string for a single email address).

I used the ?: capture inhibitor operator so that I don't capture unneeded submatches.

As with the beginning-of-line metacharacter, I also removed the end-of-line metacharacter$.

I implemented additional "grouping" to locate all emails in a provided input string.

So the natural question at this point would be "Is this pattern guaranteed to find every single valid email address?" After doing quite a bit of research on this issue it turns out that an all-encompassing email regular expression pattern is almost 6,000 bytes in length! However, that pattern would be necessary to catch only a very miniscule percentage of email addresses that the patterns illustrated in this article won't. The two patterns that I've covered will catch 99 percent of all email addresses.

Regular Expressions: A Lot of Ground to Cover

My original intention for a series on using the .NET regular expressions classes from Managed C++ was to simply cover some basic patterns and usages. However, the more I wrote, the more I realized needed to be covered. So it turned out to be a much-longer-than-planned series. It covered splitting strings, finding matches within a string, using regular expression metacharacters, grouping, creating named groups, working with captures, performing advanced search-and-replace functions, and finally writing a complex email pattern.

Hopefully along the way, those of you who are new to regular expressions saw just how powerful they can be. Just think of how much manual text parsing code would be necessary to parse a block of code for (almost) every conceivable email address. Compare that with the single line of code it takes with regular expressions! For those who wish to learn still more about working with the .NET regular expressions classes, my bookExtending MFC Applications with the .NET Frameworkprovides a full 50-page chapter on the subject and introduces half a dozen demo applications with code that you can easily plug into your own production code.

Acknowledgements

I would like to thank Don J. Plaistow, a Perl and Regular Expressions guru who helped me tremendously when I first started learning regular expressions. Don's help was especially helpful with regards to the email patterns in this article.

About the Author

Tom Archer owns his own training company, Archer Consulting Group, which specializes in educating and mentoring .NET programmers and providing project management consulting. If you would like to find out how the Archer Consulting Group can help you reduce development costs, get your software to market faster, and increase product revenue, contact Tom through his Web site.