Validating Email Addresses

I use the + quite a bit to separate the wheat from the chaff. If I'm forced to sign up for something on a website and I have to get the unlock through a valid email address, I just slap the name of the website I used the email address on (myemail+thedailywtf.com@mydomain.com) and watch as they either ignore or honor their privacy policy. I can't tell you how many times I see my email address come int eh mail from a website that supposedly never sells email addresses to third parties.

captcha: burned (indeed, indeed)

PJH2007-02-19 09:56

it's a pet peeve of mine when people "validate" away perfectly valid addresses, for instance: websites that think all domains end in .com, .net, .edu, or .org; and agents that refuse to transfer mail with a + in the local-part.
[...]
And as I promised, here's my own RegExp for you to tear apart. (Yes, I know it doesn't handle a quoted local-part. No, I don't mind. Seriously, who does that?)

Um - the people who might use + in the local part?

PJH2007-02-19 09:59

Xeronos:

I can't tell you how many times I see my email address come int eh mail from a website that supposedly never sells email addresses to third parties.

I can count on the hands of one arm this has happened to me - and judging by the subsequent 'spam' I got, it appears to have been an 'inside job' where someone leaving the company acquired the address. All my other unique addresses have only ever been used by the companies they've been attributed to.

TheJasper2007-02-19 10:00

The format for e-mail addresses is specified in a number of RFCs; it's a pet peeve of mine when people "validate" away perfectly valid addresses, for instance: websites that think all domains end in .com, .net, .edu, or .org; and agents that refuse to transfer mail with a + in the local-part. To that end, I wrote my own regular expression that (I believe) follows the specification, which I'll share below.

And as I promised, here's my own RegExp for you to tear apart. (Yes, I know it doesn't handle a quoted local-part. No, I don't mind. Seriously, who does that?)

so, you don't like it that valid addresses are invalidated, so you present a regex that follows spec...except you then admit it doesn't do everything because...well who uses that anyway. Somewhere there is a person gnashing their teeth becuase their perfectly valid address isn't being validated by your code.

btw, I don't no the exact spec, and haven't tried to figure out if your regexp follows it. I don't think my boss would appreciate the time spent ;}

MooseBrains2007-02-19 10:02

Whitespace is allowed in email addresses, as are constructs like:

"Moose Brains !!!" @ (yes, this is my address) spam.la <MooseBrains>

which both would fall over on.

Anonymous Tart2007-02-19 10:03

use Mail::RFC822::Address qw(valid)

Steve2007-02-19 10:08

Ooooh, ooh! I got it!

The WTF (apart from this stupid comment box I'm typing in being only 20x2 characters) is that he thinks the 'at' sign is called an ampersand.

PS. That RegEx would fail on email addresses that use an IP address instead of a FQ name.

leeg2007-02-19 10:10

Your regexp doesn't support valid addresses such as billg@[131.107.115.212]

Dave2007-02-19 10:15

The regexp for validating all compliant email addresses is to large to this in this margin.

(Seriously, it's pretty big.)

mol2007-02-19 10:19

Don't tell me that you think that the ugly regex is more readable than the javascript version. The purpose of email validation is just to check for common errors it has no sense to try to validate perfectly because it won't save you against valid nonexisting email (you just have to send the mail there and wait for the response).

regex looks like garbage.
is there a framework somewhere that can validate an email address?

Don't FTP servers have constructs that allow them to verify email addresses as being valid without physically checking them on the internet?

Jeroen2007-02-19 10:48

my ISP doesn't support de local part!

morry2007-02-19 10:49

The Regexes are utterly unreadable and therefore unmaintainable. I'd hate to have to fix one of those monsters.

Recently I overheard a collegues' phone conversation. He was babbling on about the email validation not being tight enough. "after the period, it should check for exactly 3 characters. You know: .com .org .net. But it should just check for those three we don't want to limit ourselves if they come up with more TLDs. I'll raise a low priority defect for that after the call."

So I shot him off an email (not wanting to interrupt his call) giving him some examples of .info and .co.uk email addresses. I didn't have the heart to show him the RFC.

TheD2007-02-19 10:52

PJH:

I can count on the hands of one arm...

For some reason, this was hilarious to me. Maybe I need more coffee?

craaazy2007-02-19 10:59

I believe it's much better to not do any validation, really.
Even better, let e-mail be an optional field, and only if it's filled, complain about a missing @ if it's not there - and if you really need that @ to be there (you might want to accept local accounts/aliasses, ugly exchange/notes addresses, X.500, whatnot).

Other than that, the syntax is simply too complex. Even if you catch someone forgetting a bit of the FQDN or whatever, you still can't catch them making "valid" typos (yuo@exmapel.com).

If you're going to be using the e-mail adress for, well, sending e-mail, you still need to send a confirmation e-mail just so you won't be called a spammer anyway. So let their mail server do the checking for you.

The one exception I can think of, is if your e-mail system itself has some limitation (that isn't in the specs). For example, if your system simply can't handle IP addresses and quoted local parts, validate against those.

LizardKing2007-02-19 10:59

Hmm, email address validation is a nasty one. I remember trying to validate by doing lookup on the hostname portion, only to get scuppered by mail servers that don't resolve but are valid. I forget the details as this was many aeons ago, however a more experienced colleague pointed me at some RFC's (and would have probably submitted my code as a WTF if this site had been around).

The Regexes are utterly unreadable and therefore unmaintainable. I'd hate to have to fix one of those monsters.

I read a quote somewhere that described regex as a "write once, read never" syntax. It's the poster child for the differnce between "Clever" and "Wise".

Captcha: tesla - good scientist, BAAAD band...

Tukaro2007-02-19 11:09

Er... I use a much simpler check than most of you do; perhaps it doesn't cover everything, but this is an internal thing, so it doesn't need to.

/^([a-zA-Z0-9_\.\-])+\@(([a-zA-Z0-9\-])+\.)+([a-zA-Z0-9]{2,4})+$/

steve2007-02-19 11:10

I just like how in the code they have the var "ampisthere" for ampersand (&) and they think that's what you call @

MULL2007-02-19 11:12

The correct regexp for emails can be found there:
http://examples.oreilly.com/regex/

stevekj2007-02-19 11:13

Steve:

Ooooh, ooh! I got it!

The WTF (apart from this stupid comment box I'm typing in being only 20x2 characters) is that he thinks the 'at' sign is called an ampersand.

I don't think that's it. I think he's using "amp" as a short form for "ampersat", which is indeed a more or less valid reference to "@". The real WTF is that no one besides this particular coder knows what an "ampersat" is.

The other real WTF is that you can also refer to "@" as an "asperand". WTF?

In a Google battle between "asperand" and "ampersat", "ampersat" comes out slightly ahead - but both are practically undefined, by Google standards, at just under 3k references each. So using either one in code that is going to be maintained by someone else is definitely a WTF.

Er... I use a much simpler check than most of you do; perhaps it doesn't cover everything, but this is an internal thing, so it doesn't need to.

/^([a-zA-Z0-9_\.\-])+\@(([a-zA-Z0-9\-])+\.)+([a-zA-Z0-9]{2,4})+$/

Yeah, I use something much simpler even than that in internal code.

!/^$/

If they dont get their email, it is highly likely that they didnt put in the right address :)

Buzz2007-02-19 11:23

From http://www.eskimo.com/~hottub/software/programming_quotes.html

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Jamie Zawinski

TSK2007-02-19 11:27

To add insult to injury, there may be e-mail addresses of IDN's (Internationalized Domain names) in future with umlauts, ogoneks, cedilles and such a stuff...they will be transformed by nameprep and punycode into RFC addresses before using, but that won't help validating them....

To regexps: They violate the good old KISS principle ("Writing solid code"). They are hard to read (both visually and mentally), they cannot be accordingly commented (if you have qualms to spread both comment and regexp over the page) AND they are fragile (you know what I mean if you accidentally tipped one more char than necessary)....sometimes it breaks, sometimes not.
I think it is some pride involved to be able setting up a mighty "all-cases-in-one" regexp, but for maintenance the long monsters are garbage.
It's not so much fun, but keep the style boring; write so that you know five pages beforehand what it going to happen. In this case break the mail address into parts and verify them individually (with short regexps, yes) and comment what you are doing.
You will be pleased if you are forced to rewrite old routines which you haven't seen a year under time pressure.

Asd2007-02-19 11:29

After reading this I had a look at what the jakarta commons validator did. And found a nice mini WTF. The EmailValidator is a singleton despite it not having any state. God I wish they had never invented that pattern.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Jamie Zawinski

Some people, when confronted with regular expressions, like to quote jwz. Now they are fools who cannot cope with regular expressions.

mathew2007-02-19 11:39

The only validation of e-mail addresses I do is to check it matches .+@.+\..+

i.e. there's an @, there are characters on both sides of the @, there's at least one . to the right of the @, there are characters on both sides of the .

Remember, no syntactic check is going to determine whether an e-mail address is actually valid and working. All you're checking for is obvious brokenness like putting in localhost-specific user IDs, or putting their name in the field instead of their e-mail address.

another moron2007-02-19 11:40

regex sucks. they are a real solo trip, to be trotted out after 15 cups of coffee and a similar number of cigarettes. maintainability? you're joking... just start again!

skington2007-02-19 11:44

TSK:

To regexps: They violate the good old KISS principle ("Writing solid code"). They are hard to read (both visually and mentally), they cannot be accordingly commented (if you have qualms to spread both comment and regexp over the page) AND they are fragile (you know what I mean if you accidentally tipped one more char than necessary)....sometimes it breaks, sometimes not.

Perl has allowed comments and non-meaningful white space in regexes since 1998. That you can write regexes that look like line noise doesn't mean you have to.

That big monster of a regex that validates RFC822 email addresses exists because, if you're going to validate email addresses, you may as well validate them properly - by dint of building up a regex bit by bit, and then, once you're happy it works, compiling it down to one big long humungous lump of code for performance, so other people can just say "use Email::Valid" or whatever other Perl modules use it. In the same way that ages ago people used to shorten variables and eliminate white space to fit more code into 32K, or however much RAM their machine had at the time. It doesn't mean you develop that way.

Toby2007-02-19 11:49

With the help of the wonderful RegExBuddy the splendid regex above can be translated into an approximation of English.
-----------

^[-!#$%&mp;'*+/0-9=?A-Z^_a-z{|}~](\.?[-!#$%&mp;'*+/0-9=?A-Z^_a-z{|}~])
*@[a-zA-Z](-?[a-zA-Z0-9])*(\.[a-zA-Z](-?[a-zA-Z0-9])*)+$
Assert position at the start of the string «^»
Match a single character present in the list below «[-!#$%&mp;'*+/0-9=?A-Z^_az{|}~]
»
One of the characters "-!#$%&mp;'*+/" «-!#$%&mp;'*+/»
A character in the range between "0" and "9" «0-9»
One of the characters "=?" «=?»
A character in the range between "A" and "Z" «A-Z»
One of the characters "^_" «^_»
A character in the range between "a" and "z" «a-z»
One of the characters "{|}~" «{|}~»
Match the regular expression below and capture its match into backreference
number 1 «(\.?[-!#$%&mp;'*+/0-9=?A-Z^_a-z{|}~])»
Match the character "." literally «\.?»
Between zero and one times, as many times as possible, giving back as
needed (greedy) «?»
Match a single character present in the list below «[-!#$%&mp;'*+/0-9=?AZ^_
a-z{|}~]»
One of the characters "-!#$%&mp;'*+/" «-!#$%&mp;'*+/»
A character in the range between "0" and "9" «0-9»
One of the characters "=?" «=?»
A character in the range between "A" and "Z" «A-Z»
One of the characters "^_" «^_»
A character in the range between "a" and "z" «a-z»
One of the characters "{|}~" «{|}~»
Match the character "
" literally «
»
Match the character "
" literally «
*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
Match the character "@" literally «@»
Match a single character present in the list below «[a-zA-Z]»
A character in the range between "a" and "z" «a-z»
A character in the range between "A" and "Z" «A-Z»
Match the regular expression below and capture its match into backreference
number 2 «(-?[a-zA-Z0-9])*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
Note: You repeated the backreference itself. The backreference will
capture only the last iteration. Put the backreference inside a group and
repeat that group to capture all iterations. «*»
Match the character "-" literally «-?»
Between zero and one times, as many times as possible, giving back as
needed (greedy) «?»
Match a single character present in the list below «[a-zA-Z0-9]»
A character in the range between "a" and "z" «a-z»
A character in the range between "A" and "Z" «A-Z»
A character in the range between "0" and "9" «0-9»
Match the regular expression below and capture its match into backreference
number 3 «(\.[a-zA-Z](-?[a-zA-Z0-9])*)+»
Between one and unlimited times, as many times as possible, giving back as
needed (greedy) «+»
Note: You repeated the backreference itself. The backreference will
capture only the last iteration. Put the backreference inside a group and
repeat that group to capture all iterations. «+»
Match the character "." literally «\.»
Match a single character present in the list below «[a-zA-Z]»
A character in the range between "a" and "z" «a-z»
A character in the range between "A" and "Z" «A-Z»
Match the regular expression below and capture its match into
backreference number 4 «(-?[a-zA-Z0-9])*»
Between zero and unlimited times, as many times as possible, giving
back as needed (greedy) «*»
Note: You repeated the backreference itself. The backreference will
capture only the last iteration. Put the backreference inside a group
and repeat that group to capture all iterations. «*»
Match the character "-" literally «-?»
Between zero and one times, as many times as possible, giving back
as needed (greedy) «?»
Match a single character present in the list below «[a-zA-Z0-9]»
A character in the range between "a" and "z" «a-z»
A character in the range between "A" and "Z" «A-Z»
A character in the range between "0" and "9" «0-9»
Assert position at the end of the string (or before the line break at the end
of the string, if any) «$»

Buzz2007-02-19 11:50

Some people, when defending regular expressions, like to show their arrogance because they know how to write obscure and usually unmaintainable code.

Ölbaum2007-02-19 11:52

I own domain ölbaum.ch. Isn't there an RFC that allows it in an e-mail address? Then most of these regexps would have to be rewritten. Lucky no e-mail client (that I know of) supports IDNs.

imMute2007-02-19 11:56

Bill:

morry:

The Regexes are utterly unreadable and therefore unmaintainable. I'd hate to have to fix one of those monsters.

I read a quote somewhere that described regex as a "write once, read never" syntax. It's the poster child for the differnce between "Clever" and "Wise".

That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Iff2007-02-19 12:29

Just a little note to add: I literally go maniacal when a webpage refuses my e-mail address that starts with "i@". One letter local part still makes an e-mail, for the love of Pete.

Regards.

Chris2007-02-19 12:29

When I've had to use fairly hairy regexes, they are always iteratively designed. I start simple, and add to it until it does what I need.

The trick is I have each iteration as a comment beforehand.

This lets me see what I've done, documents the limits of the regex, and lets me dig in and make changes without having to completely start over.

Sure, that chunk of the code will have a large amount of comments relative to other places... but when you've got a tough chunk of code, isn't that a good thing?

Bill2007-02-19 12:34

imMute:

Bill:

morry:

The Regexes are utterly unreadable and therefore unmaintainable. I'd hate to have to fix one of those monsters.

I read a quote somewhere that described regex as a "write once, read never" syntax. It's the poster child for the differnce between "Clever" and "Wise".

That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does. Also, if the tool were worth a damn, it would also give you comments to imbed along with the regex.

Hopefully this WAS simply the output of a builder class, where the method calls used to build it provide adequate documentation. But based on the OP, I doubt it.

Captcha: tacos - with that suggestion, I'm off to lunch

JC2007-02-19 12:35

I don't see the point of validating email addresses at all... even if it vaguely resembles what an email looks like, there is no guarantee that it is correct.
It will soon enough be validated when you try and send an email to it and at least you won't be alienating those users with non-standard-but-legal addresses.

(And even the argument that it is helping the user by telling them they've typoed... they are probably just as likely to typo the text part of the address than the @ and . parts!)

Josh2007-02-19 12:36

The real WTF is you comments section not word wrapping.:P

Maybe you should consider a HTML class?

Kalle2007-02-19 12:38

The easiest and most likely to succeed way to validate an address is to establish an SMTP session to the primary MX of the domain and do an RCPT. If the address is invalid, either you cannot establish a connection or the SMTP server returns an error. Easy :)

[And yes, I do know that the Internet mail doesn't work like that any more, more is the pity.]

... The purpose of email validation is just to check for common errors it has no sense to try to validate perfectly because it won't save you against valid nonexisting email (you just have to send the mail there and wait for the response).

Mol is right, validation should eliminate the basic errors in entering an email. If the user does not reply with a confirmation, too bad.

Zygo2007-02-19 13:26

About the only thing I ever bothered to check for in email validating regexps is that the email address won't trigger some weird kind of non-mailbox delivery on the local host (e.g. "/dev/sda@localhost" or "|/bin/sh@[127.0.0.1]"). This was using /usr/sbin/sendmail to submit mail.

Now I open a SMTP connection to a relay server. Modern SMTP servers are already adequately hardened against malicious or merely uncooperative email addresses, so it's a waste of my time to duplicate these features.

I only check for the absence of the following characters:

CR, LF, NUL - explicitly prohibited by RFC

> - end quote character for RCPT command should not appear in an address, half the MTAs on the planet wouldn't know what to do with it.

% ! - abused by spammers on open relays. Get a real email address if you are living behind one of these, you luddite.

/^[^\r\n\0>!%]+@[^\r\n\0>!%]+$/os

Many MTA's don't support anything like full RFC822 email syntax, and only support "RCPT TO: <" string-of-some-characters ">" with various permitted values for the "string-of-some-characters", and various transformations done on the text if special characters like spaces or parens are used.

RFC822 was designed to allow any damn email address from any damn local email system to be encoded into an RFC822 email address. It's not feasible for me to validate addresses from some legacy email system that still runs on a PDP-10 somewhere, so I don't try.

If I'm required to determine the validity of the email address I'll send a token to the address and require the user to enter the token before I talk to them again. That tests not just the validity of the email address, but the reliability and availability of the whole return path to the requesting user and the user's willingness to cooperate with receiving mail at that address--a much more useful assertion.

Zygo2007-02-19 13:36

Kalle:

The easiest and most likely to succeed way to validate an address is to establish an SMTP session to the primary MX of the domain and do an RCPT. If the address is invalid, either you cannot establish a connection or the SMTP server returns an error. Easy :)

[And yes, I do know that the Internet mail doesn't work like that any more, more is the pity.]

For those who haven't tried it, there are four cases:

1. It actually works, RCPT returns OK if the address is valid and an error otherwise.

2. It's totally broken, RCPT returns an error if the address is valid and OK otherwise. People with this kind of mail host don't get much mail. This breed is rare but not extinct.

3. The remote host graylists all SMTP hosts that contact it for the first time, in which case you'll get an RCPT temporary error on the first connection, then a correct answer when you retry between two minutes and 24 hours later.

4. The remote host is not the final destination host but a gateway without access to a database of local addresses for validation, so it says OK to all RCPT commands. Some time later a bounce message will be generated for the invalid ones and sent to the envelope sender address. This kind of host is really damn annoying and I get hundreds of messages from them every day bouncing messages containing Windows viruses because the message had my email address as the sender.

facetious2007-02-19 13:39

mathew:

The only validation of e-mail addresses I do is to check it matches .+@.+\..+

It pains me to see you post this so soon after my .@._ comment. Your check would think that @@@.@ is a valid email address.

ahigerd@stratitec.com2007-02-19 13:52

The utterly insane regex listed earlier in this thread is actually NOT 100% RFC822-compliant. It takes a shortcut by placing an arbitrary restriction on the nesting depth of the comments.

A regular expression actually CANNOT validate an E-mail address according to RFC822. The language described in RFC822 is recursive and cannot be normalized to an iterative description. If you can't normalize it like this (that is, if there's no way to write the language in such a way that you never have to refer to a symbol that hasn't been defined yet and you never have a rule that refers to itself) then it is, technically, impossible to construct a regular expression for it.

That said, there's no value in validating an address against the full force of RFC822, as discussed earlier in this thread; not many MTAs -- and even fewer desktop mail applications -- conform to the full scope of the "requirements" and only implement the most commonly used subset.

CDarklock2007-02-19 13:56

Bill:

The fact remains that it's unmaintainable as-is.

Regular expressions are maintained by throwing them away and writing new ones. You do this when they don't work. When they do, just leave them alone.

If you don't know whether a given regular expression works, it doesn't.

Meulop2007-02-19 14:21

From the VMWare converter registration page:

function isValidEmail(str)
{

return (str.indexOf("@") > 0);

}

The frustrating thing is that I sent them an e-mail a while back to complain about their old e-mail 'validator', and they changed it to something more sensible, but have now regressed to this which is even worse than the original.

That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does. Also, if the tool were worth a damn, it would also give you comments to imbed along with the regex.

It matters not? If it's used correctly, it matters the world! Having generated parts of code is fine as long as you don't have to modify the generated parts. It's done all the time... witness Yacc and Lex. Heck, witness a compiler. Do you think that convoluted but terribly efficient assembly code produced by a compiler is bad because the fact that it was compiled matters not?

If all it does is plop out a RegEx and wrap "isvalid(email_address)" around it, then this is a perfectly valid approach.

(Of course, if it goes into a file that needs to be modified, then you can no longer regenerate the file if you make changes/fix bugs/find a better way/etc. without losing those changes. Then it's just as bad as if a human wrote it.)

MMMMM...using exceptions to handle expected program flow. Tasty. Not sure if Java would allow you to do that another way without using exceptions, but that is really bad form.

It won't. There's no "TryParse" equivalent to this. You shouldn't design code to use exceptions to handle expected conditions, but occasionally when there is an impedance mismatch like this with a library routine, it is necessary.

thrashaholic:

Now if you'll excuse me I'm going to go hang myself.

Good riddance.

darwin2007-02-19 15:13

CDarklock:

Janek:

This is how I do it in Java

This is one of the many reasons why I believe Java developers are evil and must be stopped.

Oh, did I say "evil"? I meant "stupid".

Yes, using a standard library routine to validate an email address, as opposed to writing your own unreadable, unmaintainable, and broken regular expression to do it, is both evil and stupid. Idiot.

darwin2007-02-19 15:19

LizardKing:

Hmm, email address validation is a nasty one. I remember trying to validate by doing lookup on the hostname portion, only to get scuppered by mail servers that don't resolve but are valid. I forget the details as this was many aeons ago, however a more experienced colleague pointed me at some RFC's (and would have probably submitted my code as a WTF if this site had been around).

I would guess that the piece you were missing is the idea of an MX (mail exchanger) record. You can have a domain, such as email-handled-elsewhere.com, and it has an MX record in DNS of mail-handler.com (or several records, with several different hosts), and the mail goes not to the server mentioned in the email address, but to (one of) the one(s) in the MX record(s).

kfx2007-02-19 15:19

darwin:

Yes, using a standard library routine to validate an email address, as opposed to writing your own unreadable, unmaintainable, and broken regular expression to do it, is both evil and stupid. Idiot.

Using anything coded in Java is evil and stupid. You whine because you can't read machine-generated code and then you load an entire virtual machine to make sure an e-mail address meets the approval of a library you've never read.

MMMMM...using exceptions to handle expected program flow. Tasty. Not sure if Java would allow you to do that another way without using exceptions, but that is really bad form.

It won't. There's no "TryParse" equivalent to this. You shouldn't design code to use exceptions to handle expected conditions, but occasionally when there is an impedance mismatch like this with a library routine, it is necessary.

That's what I feared. However I see code like this used far too many times (when it shouldn't) to simply keep quiet about it. I honestly could not care less about the general WTF-ness of Java's standard libs (because I f-in hate Java), but seeing code like that makes me want to vomit. As it should.

darwin:

thrashaholic:

Now if you'll excuse me I'm going to go hang myself.

Good riddance.

Wow, mature there sir, real mature. Your momma's calling, she said you need to clean up the basement after you're done playing WoW.

savar2007-02-19 15:27

Tukaro:

Er... I use a much simpler check than most of you do; perhaps it doesn't cover everything, but this is an internal thing, so it doesn't need to.

/^([a-zA-Z0-9_\.\-])+\@(([a-zA-Z0-9\-])+\.)+([a-zA-Z0-9]{2,4})+$/

You don't need to escape '.' inside a character class. But yes, there are many different ways to kinda validate an email address.

I wish they'd stop posting stories like this on TDWTF because every time they do, the usual furor erupts with 99% of people not being aware how complex the RFC actually is, or what a monstrous regex it takes to meet it.

Then a wise soul points out that validating the formatting of an email address isn't really doing anything in the first place, and we all end up right where we started.

sjs2007-02-19 15:29

Regexes are not bad. Just because people don't make use of the /x modifier and comments doesn't make regexes themselves evil. They are incredibly useful, and those that think they add an extra problem to the task at hand just haven't bothered to learn how to wield the power of regexes.

If you find regexes confusing or difficult you need to read Freidl's book. If you still have trouble then you are not a geek and should not be programming.

Not sure if Java would allow you to do that another way without using exceptions, but that is really bad form.

Why?

darwin:

You shouldn't design code to use exceptions to handle expected conditions

Why not?

The code is readable, maintainable, and acceptably efficient. And those are the criteria for determining what is good code, not whether or not it meets some arbitrary idea of "purity" that has no useful rationale or general applicability in the real world.

thrashaholic2007-02-19 16:02

Iago:

thrashaholic:

Not sure if Java would allow you to do that another way without using exceptions, but that is really bad form.

Why?

darwin:

You shouldn't design code to use exceptions to handle expected conditions

Why not?

The code is readable, maintainable, and acceptably efficient. And those are the criteria for determining what is good code, not whether or not it meets some arbitrary idea of "purity" that has no useful rationale or general applicability in the real world.

Try/Catch is expensive. (Severity of the expense depends on language/compiler/platform/VM/etc..) They should not be used in places where a simple If statement would suffice. That's all the reason I need right there.

The other reasons are mostly philosophical, if not accepted "form". Exceptions (as the name implies) shouldn't be used to control the flow of the normal happenings of your code. If you're doing this, you have a fundamental misunderstanding of exception handling. Exceptions are to be used for EXCEPTIONAL CASES that you can not plan for.

If you actually code the logic for something happening, you probably shouldn't use an exception to control the flow of said logic. "Is something null?" "Is this variable actually instantiated?", etc..are all cases where IF (WHATEVER) is a lot cheaper and vastly more "proper" than using exceptions to branch.

This is a more high level blog post by the author of above linked thesis.

Some of the more accepted Exception "Best Practices" include:

"Don't use exceptions to indicate absence of a resource"
"Don't use exception handling as means of returning information from a method"
"Use exceptions for errors that should not be ignored"

Etc..etc..etc..indeed, the most important rule for exception handling is: Don't do it.

(People who know WTF they're doing will understand that)

hamstray2007-02-19 16:23

stevekj:

... I think he's using "amp" as a short form for "ampersat", which is indeed a more or less valid reference to "@". The real WTF is that no one besides this particular coder knows what an "ampersat" is. ...

"@" better known as: "a human or elf"

EvanED2007-02-19 16:24

thrashaholic:

The other reasons are mostly philosophical, if not accepted "form". Exceptions (as the name implies) shouldn't be used to control the flow of the normal happenings of your code. If you're doing this, you have a fundamental misunderstanding of exception handling. Exceptions are to be used for EXCEPTIONAL CASES that you can not plan for.

So it's the name? What if the language called them something else? For instance, Common Lisp has something similar it calls conditions. Would it be appropriate to signal a condition?

And what's exceptional? Is a network connection going down exceptional and a good place for an exception? Why not malformed user input? Both will happen from time to time, it's just a difference of degree. If something happens 1 in every 100 requests is it exceptional, or does it need to be 1 in 1000?

"Don't use exceptions to indicate absence of a resource"

Hmm, what about C++'s bad_alloc exception, Java's OutOfMemoryError, or .Net's OutOfMemoryException? Are those aspects of those languages poorly designed? Or should memory be treated differently than other resources?

"Don't use exception handling as means of returning information from a method"

Exceptions ALWAYS return information from a method, specifically "this method could not complete normally." If he means as a substitute for return, yes.

I can understand a dislike for exceptions as a whole, but I don't see how you can approve of some uses and yet remain limited enough to think that malformed addresses is an inappropriate use. (At least absent profiling information that tells you so.)

Laie Techie2007-02-19 16:33

Since the RFC allows comments to be indefinitely deeply nested, there can't be a single REGEX to work on every single valid email address. Even the 6598 byte long REGEX in Appendix B of Mastering Regular Expressions assumes zero spaces and no hash marks only allows for doubly nested comments.

thrashaholic2007-02-19 16:54

EvanED:

So it's the name? What if the language called them something else? For instance, Common Lisp has something similar it calls conditions. Would it be appropriate to signal a condition?

And what's exceptional? Is a network connection going down exceptional and a good place for an exception? Why not malformed user input? Both will happen from time to time, it's just a difference of degree. If something happens 1 in every 100 requests is it exceptional, or does it need to be 1 in 1000?

The network going down is a prime case for exceptions. Malformed user input is not. IMO.

EvanED:

Hmm, what about C++'s bad_alloc exception, Java's OutOfMemoryError, or .Net's OutOfMemoryException? Are those aspects of those languages poorly designed? Or should memory be treated differently than other resources?

Of course not. Being out of memory is an exceptional condition. However, a variable being null is not. In C#, would you do :

I can understand a dislike for exceptions as a whole, but I don't see how you can approve of some uses and yet remain limited enough to think that malformed addresses is an inappropriate use. (At least absent profiling information that tells you so.)

If it's something that you can check for with a minimal amount of conditional code, then it's not a good case for exceptions. If it's something that no sane amount of conditional branching could ever solve, then a try...catch block is appropriate.

foxyshadis2007-02-19 16:56

MooseBrains:

Whitespace is allowed in email addresses, as are constructs like:

"Moose Brains !!!" @ (yes, this is my address) spam.la <MooseBrains>

which both would fall over on.

That's not an email address, that's an exercise in wankery.

Why do wonks always bring up the RFC's utter insanity whenever email comes up? It's 2007, not 1987. It's hard to find a specific place to draw the line, but it should have been done for RFC 2822. If at least 20% of MTAs online can't actually transfer your message, it's not an email address for all practical purposes, and in your case it's more like 99%.

The whole RFC is just an exercise in what goes wrong when standards are designed around including everyone and not forcing anyone to change, and practices that are long gone or have used at all are left in because they sound cool.

BTW, why are you all referencing 822? 2822 obseleted it.

Bat2007-02-19 17:15

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Some people, when faced with a regular expression, think
"I know, I'll use Jamie Zawinski as an excuse and cherish my ignorance". Now they've got an infinite number of problems.

I believe because all TLD consist of 2,3 or 4 characters and so "se" would be considered a TLD. Is it? I dunno. Even if it was one you can't just email that directly, it would be like emailing "me@com"

EvanED2007-02-19 17:46

thrashaholic:

EvanED:

And what's exceptional? Is a network connection going down exceptional and a good place for an exception? Why not malformed user input? Both will happen from time to time, it's just a difference of degree. If something happens 1 in every 100 requests is it exceptional, or does it need to be 1 in 1000?

The network going down is a prime case for exceptions. Malformed user input is not. IMO.

I think you're stricter then on when you think exceptions are appropriate than I am. ;-)

I don't think I'd explicitly throw an exception on malformed input, but it depends on what the context is, and it's not on it's face something that I'd say I would consider bad style.

EvanED:

Hmm, what about C++'s bad_alloc exception, Java's OutOfMemoryError, or .Net's OutOfMemoryException? Are those aspects of those languages poorly designed? Or should memory be treated differently than other resources?

Of course not. Being out of memory is an exceptional condition. However, a variable being null is not. In C#, would you do :

[snip]

These are the types of instances I'm speaking of. (And I, unfortunately, see a lot) Having a network connection unavailable, or being OOM are exception cases. Examples like above, are not.

Gotcha. But on the other hand, there are plenty of times when not having a resource available IS a fine time for an exception, so I don't think that saying "Don't use exceptions to indicate absence of a resource" is silly advice.

I had a friend that met the "owner" of the .fr TLD. Despite some pretty consistant cajoling, he could not convince him to let him receive email as reed@fr.

ctmf2007-02-19 18:22

"It's done all the time... witness Yacc and Lex."

Actually, I think that would be the easiest way to accomplish the goal of validating an RFC-compliant address. Maintenance would be fairly easy with the original source files.

As others have pointed out, the goal may be a stupid one, but that's how I'd initially go about tackling it.

Uriah2007-02-19 18:40

Here's an idea...

get them to type it in, and take their word for it, or if that's too trusting for you, send them a confirmation email...

FIXED!

AssimilatedByBorg2007-02-19 18:47

*Sigh*

I wish I had a nickel for every time someone asked me to write code to validate email addresses, and thought it was simple.

I gently try to explain the incredible formats of addresses that are actually valid, and that eventually, they will really annoy somebody by using their restrictive idea of an email address.

The smart ones understand that.

It's the other kind of people that I don't know what to do with. The kind of people who ask, "why did your validation routine let through an email address with a typo in it?" (Seriously. This has happened.)

The answer is rather simple: You domain name only has one part to it. As I understand RFC 921, domains with only one part to it are assumed to be in the .arpa TLD.

Anonymous2007-02-19 19:10

Honestly, comments like "regexs are write once/read never" and "regular expressions are an excercise in arrogance on the part of the programmer" really rankle with me. How hard is it to compile the regex with the '/x' modifier (or with RegexOptions.IgnorePatternWhitespace in .NET) and write your regex like this:

^

# the first part of the email, we let it accept ',' because pointy haired
# boss changed his name to 'wile e. coyote, genius' by deed poll and insisted
# that we set 'wile e. coyote, genius@100%paradigmsynergies.com' up despite telling
# him it could never work. the upshot was that he praised us for our initiative in
# reducing the amount of spam he receives.
[0-9A-z, \$\.]+

# fix by Maintenance Q. Programmer, Esq.:
# the cfo demanded we extend the validation so his email
# john.citizen.the.greatest.cfo.ever@100%paradigmsynergies.com would be accepted.
# thank heavens this was so well documented otherwise I might never have had the
# courage to pick up a manual on regular expressions and instead spent my time on
# message boards pissing and moaning about how hard they are to read.
(\.[0-9A-z]+)*

# for all of those lazy bum programmers out there who are too lazy to bother
# learning regular expression syntax, this doesn't do anything fancy at all. it just
# matches an 'at' symbol.
@

# matches the domain name portion of the email address, although not very well.
# Needs to accept the percent sign otherwise our company domain name won't work.
([0-9A-z%]+\.)*

# nobody yet knows what this bit does. if you work it out, drop me a line at
# jed@100%paradigmsynergies.com
[0-9A-z]+

$

So there you have it. It's full of idiotic comments, thinly veiled insults, general silliness and a cameo appearance by an old friend, but as you can clearly see, Mr. Maintenance Q. Programmer, Esq. didn't have too much trouble working out what was going on and successfully made his change.

Craig2007-02-19 19:18

Several people have said that as the standard for email addresses is recursive then there is no way to write a regular expression for it. Given that email addresses have a maximum length, can a regexp be used even though the standard is recursive? For example, there can only be a maximum of 127 full stops in the domain part.

David Henderson2007-02-19 19:38

thrashaholic:

Exceptions (as the name implies) shouldn't be used to control the flow of the normal happenings of your code. If you're doing this, you have a fundamental misunderstanding of exception handling. Exceptions are to be used for EXCEPTIONAL CASES that you can not plan for.

A malformed E-mail address IS an exceptional case, albeit one that you can plan for. When asked to provide their E-mail address, I suspect most users will enter it correctly.

You're checking the E-mail address before you send the message, which at first glance seems a logical approach.

But that means that either (A) sendEmailTo() doesn't check the address itself, in which case it's accepting on faith that its input is valid (not generally a safe programming practice), or (B) sendEmailTo() also calls formattedCorrectly(), in which case the address is getting checked twice, which is redundant; this may very well offset any "cost" of try/catch.

there are no classes 'InternetAddress' and 'AddressException' that I know of in the Java standard libraries.

there is a class 'InetAddress' with two subclasses 'Inet4Address' and 'Inet6Address' (for obvious reasons), but these are only usably for IP addresses, not for the full mail address scheme.

if these should be home-grown utility classes (and you do have control over it), it would be preferable to have a boolean 'isValid()' method in lieu of having to use exception handling for the control flow.

Glen2007-02-19 20:02

.....you know sometimes the reason its not a big deal is because it isn't.......Mail Servers validate email addresses.....users validate addresses by recieving it....why don't we build reg-ex's to validate users first names or middle initials....jesus....try to make sure it conforms > (@) TLD (uk.com, ws, biz, com, net, org.uk) but you can never be 100% so move on its really not worth the hassle!

AssimilatedByBorg2007-02-19 20:13

woohoo:

there are no classes 'InternetAddress' and 'AddressException' that I know of in the Java standard libraries.

javax.mail.internet.InternetAddress, found in J2EE libraries.

It's not standard in the sense of, "it's not J2 Standard Edition", but otherwise as close to standard as it gets :)

anon2007-02-19 20:19

woohoo:

I beg your pardon?

there are no classes 'InternetAddress' and 'AddressException' that I know of in the Java standard libraries.

there is a class 'InetAddress' with two subclasses 'Inet4Address' and 'Inet6Address' (for obvious reasons), but these are only usably for IP addresses, not for the full mail address scheme.

if these should be home-grown utility classes (and you do have control over it), it would be preferable to have a boolean 'isValid()' method in lieu of having to use exception handling for the control flow.

Hmm, email address validation is a nasty one. I remember trying to validate by doing lookup on the hostname portion, only to get scuppered by mail servers that don't resolve but are valid. I forget the details as this was many aeons ago, however a more experienced colleague pointed me at some RFC's (and would have probably submitted my code as a WTF if this site had been around).

I don't care about stupid regexes. I don't accept mail from hostnames that don't exist. I also reject mail from hosts that don't use fully-qualified doman names with the helo, where the helo FQDN doesn't resolve, and where the sender domain doesn't exist. If they don't want to ensure their mail can be bounced correctly if need be (much less replied to normally), I don't have to accept it. And putting in those simple rules have reduced our spam by 90% (giving the anti-spam engine a bit of a rest).

iw2007-02-19 20:47

Bat:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Some people, when faced with a regular expression, think
"I know, I'll use Jamie Zawinski as an excuse and cherish my ignorance". Now they've got an infinite number of problems.

Every time someone thinks about quoting Jaime Zawinski, their computer should generate a cock-shaped sound wave and plunge it repeatedly through their skulls.

(PS Does anyone else find it funny that the quote about skins was never said by jwz?)

iw2007-02-19 20:49

AssimilatedByBorg:

*Sigh*

I wish I had a nickel for every time someone asked me to write code to validate email addresses, and thought it was simple.

I gently try to explain the incredible formats of addresses that are actually valid, and that eventually, they will really annoy somebody by using their restrictive idea of an email address.

The smart ones understand that.

It's the other kind of people that I don't know what to do with. The kind of people who ask, "why did your validation routine let through an email address with a typo in it?" (Seriously. This has happened.)

Oh, please. What kind of idiot can't write a validation routine that sends an email, then checks the user's email account to see if they got it?

Josh2007-02-19 22:20

Who does quoted local parts? Plenty of people. I parsed most of the incoming email for a major tech company's email support once. While most messages didn't use quoted local parts there was still a significant portion that did.

!Z2007-02-19 23:53

mathew:

The only validation of e-mail addresses I do is to check it matches .+@.+\..+

i.e. there's an @, there are characters on both sides of the @, there's at least one . to the right of the @, there are characters on both sides of the .

Remember, no syntactic check is going to determine whether an e-mail address is actually valid and working. All you're checking for is obvious brokenness like putting in localhost-specific user IDs, or putting their name in the field instead of their e-mail address.

Hooray! Someone gets it!

operagost2007-02-19 23:58

I understand the complexities involved in validating email addresses. But why in God's name do websites tell me that my phone number is invalid or doesn't match my zip code? My area code has been around for about 8 years, when Verizon decided to inconvenience everyone forever instead of a smaller number of people for a short time by beginning the practice of overlays.

Gabe2007-02-20 00:46

Asd:

After reading this I had a look at what the jakarta commons validator did. And found a nice mini WTF. The EmailValidator is a singleton despite it not having any state. God I wish they had never invented that pattern.

Most likely somebody initially had a singleton because they were compiling the regexes when the class was loaded. If you're validating lots of email addresses, you can save a load of time by compiling (and possibly optimizing) the regexes.

Unfortunately somebody else probably then decided that they should just "fix" it so that it compiles the regexes whenever they're used, but then either didn't think to undo the singleton aspect, or it was too late because changing the interface would break code.

BruteForce2007-02-20 03:40

As far as emails being recursive...
I read somewhere that all recursive functions can be rewritten as iterative functions, and they pointed to a scientific proof thereof.

Kasper2007-02-20 04:19

Um - the people who might use + in the local part?

You don't need to quote an address containing a + character. And in fact the RFC says an email address should not contain characters requiring quoting. And you should not use quoting unless the address contains chars requiring it. Thus the RFC and a bit of logic implies, quoting should not be used.

And considering how tricky it is to correctly parse an SMTP command with a quoted email address, I'd say that is one of the things I'd not have a problem with rejecting. The original RFC even allowed newline and \0 in an email address as long as they were quoted and escaped.

lanzz2007-02-20 04:31

Anonymous Tart:

Yeah, I use something much simpler even than that in internal code.

!/^$/

why not just /./ ? or even better, avoid regex altogether and use the strlen()-equivalent.

lanzz2007-02-20 04:51

Bill:

imMute:

That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does.

so you disregard all code-generating tools (bison/yacc for example), because the code they produce is usually much harder to maintain than the source definitions they take? java bytecode is unmaintainable compared to the source code that produced it, but there are still java VMs instead of interpreters that would run the plain java-language source code.

regular expressions are truly a mess and are not easy to maintain, but their strength is not in writing a 3kb regular expression that you will never be able to change, but in using much shorter regexes (short in metachars, literal matches usually don't degrade readability). for example, if you need to match an identifier, it is usually easiest to write /^[a-z_][a-z0-9_]*$/i, and this regex is much easier to maintain than the code needed to match without regex. parsing the whole RFC definition of an email address purely in regex is meaningless excercise in cleverness. it is comparable to the IOCCC, not an argument why regular expressions are bad or why the RFC is insane - the complete syntax for email addresses is so hard to parse in regex mainly because it was not designed to be parsed in regex.

regexes are simply over-used for parses too complex for clean implementation in regex.

yeayeah2007-02-20 04:54

TheD:

PJH:

I can count on the hands of one arm...

For some reason, this was hilarious to me. Maybe I need more coffee?

:D
;P

And I have all the coffee I need in my blood

ChrisH2007-02-20 05:02

Email validation Regexes are like arseholes... everybody's got one but nobody wants to see the next man's.

The RFC should come with it's own RegEx.

lanzz2007-02-20 05:11

thrashaholic:

Exceptions are to be used for EXCEPTIONAL CASES that you can not plan for.

why then is there a mechanism to catch exceptions? even more, catch SPECIFIC exceptions? unless you want to catch them because you plan for them?

lanzz2007-02-20 05:15

BruteForce:

As far as emails being recursive...
I read somewhere that all recursive functions can be rewritten as iterative functions, and they pointed to a scientific proof thereof.

could be, but regular expressions are not actually functions.

Asd2007-02-20 05:35

Gabe:

Most likely somebody initially had a singleton because they were compiling the regexes when the class was loaded. If you're validating lots of email addresses, you can save a load of time by compiling (and possibly optimizing) the regexes.

Unfortunately somebody else probably then decided that they should just "fix" it so that it compiles the regexes whenever they're used, but then either didn't think to undo the singleton aspect, or it was too late because changing the interface would break code.

I thought that too, but it is like that as far back as it goes in SVN.

woohoo2007-02-20 06:41

anon:

woohoo:

I beg your pardon?

there are no classes 'InternetAddress' and 'AddressException' that I know of in the Java standard libraries.

there is a class 'InetAddress' with two subclasses 'Inet4Address' and 'Inet6Address' (for obvious reasons), but these are only usably for IP addresses, not for the full mail address scheme.

if these should be home-grown utility classes (and you do have control over it), it would be preferable to have a boolean 'isValid()' method in lieu of having to use exception handling for the control flow.

I believe because all TLD consist of 2,3 or 4 characters and so "se" would be considered a TLD. Is it? I dunno. Even if it was one you can't just email that directly, it would be like emailing "me@com"

Except that many country level TLDs do have MX records for themselves, normally for domain administrators, obviously.

As far as emails being recursive...
I read somewhere that all recursive functions can be rewritten as iterative functions, and they pointed to a scientific proof thereof.

Yes, this is true. The reverse is also true.

However, regular expressions can't express everything that iterative functions can produce. For instance, it's not hard to produce an iterative function that will figure out if the parens in an expression are balanced, but it's provably impossible to write a regular expression to do so. (Unless your regular expression engine acceps things that aren't really regular expressions.)

Dingbat2007-02-20 11:59

Email addresses are _not_ all defined by RFCs. There's a world beyond RFC 2822 you know, and if you have to deal with European government legacies, then you might well encounter it.

This isn't to say that such an address is _reachable_ from teh intawebs, and almost certainly not reachable in that format. They still exist though, and they may still be a personal identifier.

VARCHAR32007-02-20 13:33

skington:

In the same way that ages ago people used to shorten variables and eliminate white space to fit more code into 32K, or however much RAM their machine had at the time. It doesn't mean you develop that way.

Uh, eliminate white space and shorten variables to fit into 32K? Please (oh please), you must be talking about interpreted languages like BASIC.

Thank you. That is possibly the funniest geeks-only reference I have every read.

Cheers

Toby2007-02-20 13:38

I dont work for them but I do think their software is amazing. For all of you giving out about the use of Regular Expressions and the fact that they are impossible to figure out, take a look at RegExBuddy (http://www.regexbuddy.com/) and you will never look back. It helps you to create Regular Expressions for all of us who think it is a black art. I used to but no longer.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Jamie Zawinski

Some people, when confronted with regular expressions, like to quote jwz. Now they are fools who cannot cope with regular expressions.

I wish I could mod your comment up.

Dave C.2007-02-20 15:04

The plus-sign? One-letter names? Please. I'd be happy if web sites would just accept the damn .name TLD. It's been in use for three or four years now. Dammit!

Franz Kafka2007-02-20 16:11

Dingbat:

Email addresses are _not_ all defined by RFCs. There's a world beyond RFC 2822 you know, and if you have to deal with European government legacies, then you might well encounter it.

This isn't to say that such an address is _reachable_ from teh intawebs, and almost certainly not reachable in that format. They still exist though, and they may still be a personal identifier.

Sure they are, at least in the context of the internet. If you want to hook up your legacy network to the internet, it's your problem translating the address, not the internet's.

real_aardvark2007-02-20 18:24

Toby:

I dont work for them but I do think their software is amazing. For all of you giving out about the use of Regular Expressions and the fact that they are impossible to figure out, take a look at RegExBuddy (http://www.regexbuddy.com/) and you will never look back. It helps you to create Regular Expressions for all of us who think it is a black art. I used to but no longer.

Toby: Retreat from the Dark Side.

Just because you no longer think that it is a black art any more, that doesn't make the statement any less true.

Any damn moron can write a regex. Any damn moron armed with RegexBuddy can write one that works, as of yesterday. (What with RegexBuddy being fed yesterday's information.) Today, maybe. Tomorrow, the World! Except not.

Regexes, whilst they have their place, are inherently fragile. They depend upon using strings as your basic data structure, which unless you're a VB programmer, a Java programmer, or an utter moron, is unlikely to be your ideal choice of representation. Other than an ancient IBM mini from somewhere back in the '60s, the name of which I forget, computers do not think in terms of strings.

So, by all means, use regexes for simple tasks. Validating an email address is not a simple task. Nor is it, in fact, very useful. May I quote a wise man from further up this thread:

Zygo:

Kalle:

The easiest and most likely to succeed way to validate an address is to establish an SMTP session to the primary MX of the domain and do an RCPT. If the address is invalid, either you cannot establish a connection or the SMTP server returns an error. Easy :)

[And yes, I do know that the Internet mail doesn't work like that any more, more is the pity.]

For those who haven't tried it, there are four cases:

<snip> ... look up the details for yourself. Taught me a thing or two. <snip/>

And then again:

Bat:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Some people, when faced with a regular expression, think
"I know, I'll use Jamie Zawinski as an excuse and cherish my ignorance". Now they've got an infinite number of problems.

Then again, there are idiots like "Bat." Y'know, batso, it is in fact possible to understand regular expressions -- even to use them now and again, as appropriate -- without being ignorant. Or, indeed, using anything as an "excuse." I believe the concept is called "design choice." Obviously either Darwin or God made a bad mistake in your case.

And, to:

richardchaven:

sort it topologicaly to check for a circular dependencies!:

How do we know she's a witch?
Let's build a bridge of her.

Thank you. That is possibly the funniest geeks-only reference I have every read.

there are no classes 'InternetAddress' and 'AddressException' that I know of in the Java standard libraries.

there is a class 'InetAddress' with two subclasses 'Inet4Address' and 'Inet6Address' (for obvious reasons), but these are only usably for IP addresses, not for the full mail address scheme.

if these should be home-grown utility classes (and you do have control over it), it would be preferable to have a boolean 'isValid()' method in lieu of having to use exception handling for the control flow.

No, but there are in the JavaMail API (http://java.sun.com/products/javamail/).

Of course, you could have used Google...

Paul Warren2007-02-21 09:04

Bill:

imMute:

That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does. Also, if the tool were worth a damn, it would also give you comments to imbed along with the regex.

Hopefully this WAS simply the output of a builder class, where the method calls used to build it provide adequate documentation. But based on the OP, I doubt it.

As the author of the regexp on http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html, a few comments:

I did not write it by hand. Further, it does not appear in the code in that form. I wrote it by translating the components of the syntax in RFC822 into regexp components, producing around 25 lines of code that map very directly to the RFC822 EBNF spec. The full regexp is compiled (through the magic of string interpolation) when the module is run.

The form that it appears in the code (go read it, it's not that bad) is perfectly maintainable by anyone with reasonable knowledge of regexps and a copy of the RFC822 EBNF to hand (and if you don't have the latter, you shouldn't be writing email address validators). None of the component regular expression assignments are longer than an 80 char line, and that's including descriptive variable names drawn from the original grammar.

I wrote it because (a) I can't stand incorrect validation - either get it right or don't do it at all and (b) I found that using regular expressions actually works better in Perl than doing it "properly" using Parse::RecDescent.

The only reason I include it on the web page in its full 4k horror is to make people understand that any significantly shorter regexp is unlikely to be complete.

In response to an earlier comment: The reason that it can't cope with comments is because RFC822 allows comments to be arbitrarily nested and there's simply no way to cope with that in a regexp. The Perl module recursively applies a regular expression in order to strip out comments before validating the remainder.

If you're interested in proving the shortcomings of some of the shorter regexps, the test script in that module contains a decent set of wierd addresses, and could easily be pointed at a different regexp (credit to the author of the RecDescent validator for most of it).

Paul

MrBester2007-02-21 11:01

And the biggest WTF of all the comments is that they've completely forgotten that this was supposed to be a JavaScript solution...

I believe because all TLD consist of 2,3 or 4 characters and so "se" would be considered a TLD. Is it? I dunno. Even if it was one you can't just email that directly, it would be like emailing "me@com"

I think .museum is a valid TLD.

Andy2007-02-21 16:49

I just use 'something @ something dot something' :

/.*@.*\.*/

...or something to that effect.

captcha: 'bathe' - spookily, I just have

Sean2007-02-21 17:43

Thank god my computational linguistics teacher didn't make us convert that to a NFA.

*shiver*

Captha: gotcha (it rhymes)

volodya2007-02-22 06:14

Have i missed something or would this javascript mess up on bla@.com thinking that it was a valid address.

Timwi2007-02-22 09:25

I don't suppose you're reading the comments because otherwise you would have fixed your regexp by now -- but I find it truly pathetic how you derive so much amusements out of making fun of less informed people while messing it up royally yourself.

It was already mentioned that the local part of an e-mail address can validly end in a ".". I would like to add that it is also perfectly valid to have consecutive dashes in the domain name (<a href='http://en.wikipedia.org/wiki/Internationalized_domain_name'>read <i>Internationalized domain name</i> on Wikipedia</a>).

By posting your broken regexp you are perpetuating the same annoyance that you are ridiculing.

Another regexp lover...2007-02-22 19:24

It was the end of page two of the comments before I found why anyone would WANT to validate an email address beyond two simple requirements:

1. it's not so hosed it does something bizarre and destructive
2. you can get them to send a confirmation mail.

What are you all up to that it's so important anyway? I saw this one in a forum & think I might keep it myself, and it validates fine.

Exceptions are to be used for EXCEPTIONAL CASES that you can not plan for.

why then is there a mechanism to catch exceptions? even more, catch SPECIFIC exceptions? unless you want to catch them because you plan for them?

Not thinking of language mechanisms, there are two kinds of exceptions: the alternative flow in a usecase, i.e. well documented and testable, something you aware of while coding. And the second category are plain programming errors, so situations that leave your program in an mangled state, something you never anticipated.
In the second case, the only reasonable thing to do is to abort the program (or restart it after some error mesage), continueing a program with unkown state is never a good idea.

So the question when to use exception in a language depends on the support for it. In C++ it's impossible to write exception save code for two reasons:
1) It takes lots of dicipline to write exception save code, i.e. code that does not leak resources. If if you can do it, the next guy maintaining your code will probably make some mistakes.
2) C++ has no composite exceptions, if somethings goes wrong during a stack unwind on a throw, you can only abort. In other words, an exception may never leave a destructor.

So in C++ exception can only be used as an abort trap, (the programming error case, giving you a chance to log the error before calling abort).

However in a managed language such as C# or Python, exception are an excellent flow control mechanism for handling the alternative flow of a usecase. They often really simplify code and the performance hit is a non-issue, because they usually require user input to resolve the problem.

theultramage2007-02-23 09:30

Several people have said that as the standard for email addresses is recursive then there is no way to write a regular expression for it. Given that email addresses have a maximum length, can a regexp be used even though the standard is recursive? For example, there can only be a maximum of 127 full stops in the domain part.

With a pretty full explanation of what I did and did not include from the RFC 2822.

tomten2007-02-25 07:21

I see a problem in these comments (and on the Internets as well), in that people are using the word "valid" so casually that it becomes void of meaning. Before you test an email address for validity, you need to carefully define "validity". The reason why some people in these comments think "me@se" is valid and others do not could only be that the word "valid" means different things to them.

If you're going to use the string entered as "EMAIL" as the recipient in an outgoing email, then the definition of "valid" should surely be "usable as recipient", should it not?

And here's the important part: notice how "usable as recipient" is only LOOSELY related to "strictly follows RFCs 2/822". Before you attempt to validate the email field, ask yourself: are you absolutely 100% certain about what is "usable as recipient" when you're sending out mail? No? Then why pretend you can "validate"?

For comparison: What about the "NAME" field - would you "validate" that according to some "must contain at least two parts, be capitalized" scheme (maybe there's an RFC, even?), or just allow anything that's "usable as name"?

Or the "PHONE NUMBER"? When Geörge Lucäs enters 1-900-STARWARS, wouldn't it be fun it that validated, since your dial-up marketers actually can use that when placing a callback to see if the knitted mittens were to the customer's liking?

Several people have said that as the standard for email addresses is recursive then there is no way to write a regular expression for it. Given that email addresses have a maximum length, can a regexp be used even though the standard is recursive? For example, there can only be a maximum of 127 full stops in the domain part.

However, the most commonly used "regex" engines are not actually true "regular expression" engines, and therefore /CAN/ match recursive patterns. PCRE has had this for some time, and Perl5 has had it for even longer using "dynamic patterns" and in Perl 5.9.5 you also have "recursive patterns" as well.

A good rule of thumb is if an engine is documented to do "leftmost-longest" matching then it isnt a true regular expression engine, and therefore hypothetically /can/ match a recursive pattern. Whereas if it is documented as using a DFA or NFA simulating DFA or documented to provide longest-token matching semantics then it will NOT be able to match recursive patterns.

True regular expressions make doing things like backreferences, capturing, lookaround, etc much more difficult (or perhaps impossible) than doing so with the backtracking engines commonly found in programming languages, although true regular expressions have /much/ better worst case performance than the kind you will find in Perl, Python, Java, PCRE, etc. (OTOH Perl and friends probably have better best cases.) All of these engines use backtracking-nfa's as compared to true dfa or dfa simulation. This is for a good reason, in a programming language you can typically avoid the worst case by careful pattern construction, whereas the utility of true regular expression engines is far reduced from that which a backtracking implementation can provide.

TCL has a hybrid engine, and other projects are also doing work in implementing hybrid schemes so as to avoid the worst case performance when possible.

Stéphane Bortzmeyer2007-03-01 03:59

I fully agree and I wrote more or less the same rant (without code) in French :

That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does.

so you disregard all code-generating tools (bison/yacc for example), because the code they produce is usually much harder to maintain than the source definitions they take? java bytecode is unmaintainable compared to the source code that produced it, but there are still java VMs instead of interpreters that would run the plain java-language source code.

regular expressions are truly a mess and are not easy to maintain, but their strength is not in writing a 3kb regular expression that you will never be able to change, but in using much shorter regexes (short in metachars, literal matches usually don't degrade readability). for example, if you need to match an identifier, it is usually easiest to write /^[a-z_][a-z0-9_]*$/i, and this regex is much easier to maintain than the code needed to match without regex. parsing the whole RFC definition of an email address purely in regex is meaningless excercise in cleverness. it is comparable to the IOCCC, not an argument why regular expressions are bad or why the RFC is insane - the complete syntax for email addresses is so hard to parse in regex mainly because it was not designed to be parsed in regex.

regexes are simply over-used for parses too complex for clean implementation in regex.

anoopa@aztecsoft.com2007-07-02 06:14

lanzz:

Bill:

imMute:

That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does.

so you disregard all code-generating tools (bison/yacc for example), because the code they produce is usually much harder to maintain than the source definitions they take? java bytecode is unmaintainable compared to the source code that produced it, but there are still java VMs instead of interpreters that would run the plain java-language source code.

regular expressions are truly a mess and are not easy to maintain, but their strength is not in writing a 3kb regular expression that you will never be able to change, but in using much shorter regexes (short in metachars, literal matches usually don't degrade readability). for example, if you need to match an identifier, it is usually easiest to write /^[a-z_][a-z0-9_]*$/i, and this regex is much easier to maintain than the code needed to match without regex. parsing the whole RFC definition of an email address purely in regex is meaningless excercise in cleverness. it is comparable to the IOCCC, not an argument why regular expressions are bad or why the RFC is insane - the complete syntax for email addresses is so hard to parse in regex mainly because it was not designed to be parsed in regex.

regexes are simply over-used for parses too complex for clean implementation in regex.