What’s wrong with this code, part 8 – Email Address Validation

Today’s example is really simple, and hopefully easy. It’s a snippet of code I picked up from the net that’s intended to validate an email address (useful for helping to avoid SQL injection attacks, for example).

I’ve read in the O’Reilly regular expression book (the one with the owl) that it is impossible to do e-mail address validation and if you want to do anyway, they printed a full-one-page regular expression.

.info is actually OK with this RE, as is .name. .museum wouldn’t be, however. Also, apparently, noc@to, for example, is a valid address for people at the management of certian cTLDs. Also, + is a valid and fairly common character for the before-the-at-sign portion.

Oh, and that split regex is amazingly ugly — it’s much better to split on something closer to atom boundries (for some reasonable level of "atom"), even though not all chunks are the same length.

The {1,3} portions following the [0-9] is searching for IP address patterns. The {2,4} after the [a-zA-Z] pattern is looking for 2, 3, or 4 character TLDs. There are three copies of this [0-9] junk, and then the trailing pattern allows for the final quad.

It’s the ([a-zA-Z0-9-]+.)+ pattern that allows for any number (one or more) for domain and subdomain purposes.

Great resource for testing .NET and client-side RegEx at http://www.regxlib.com. They also list 30 user-submitted variations on email validation patterns.

It was a pain in the ass to do, but I actually wrote an e-mail address format verifier in JavaScript once. I essentially did a line by line translation of the BNF in RFC822 to JavaScript strings containing equivalent regular expressions. I then matched whatever I wanted to check against this regular expression. What is a pane is that I have to use the backslash character to escape a character in the regular expression and because the regular expression is in a string, I have to escape every backslash. There is definitely a lot more to a correct e-mail address than what that example script uses. You can have extended characters for example.

I tried to write my own validator a couple of weeks ago but it was too complicated. I didn’t want to cover the whole RFC 822/2822 standard, just a subset of it that can be easily covered by a regex expression.