So a friend happened to show me how odd and specific the general email syntax rules are. For instance, emails can have "comments". Basically you can put characters in parentheses that are just ignored. So not only is it valid, email(this seems extremely redundant)@email.com is the same email as email@email.com.

Now most email providers have more simpler and easier to work restrictions (like only ascii, digits, dots and dashes). But I thought it'd be a fun exercise to follow the exact guidelines as best I could. I wont delineate every specific here, as I (hopefully) have made it all clear in the code itself.

I did heavily consult the font of all knowledge, Wikipedia for its summary on the rules.

I'm particularly interested on feedback for how robust I made this and how I did the testing and separation of functions. In theory this should be a module people could import and call on (though I have no idea when someone would actually want to use it) so I'd like reviews to focus on that. Feedback about better or more efficient methods are, of course, welcome.

"""This module will evaluate whether a string is a valid email or not.
It is based on the criteria laid out in RFC documents, summarised here:
https://en.wikipedia.org/wiki/Email_address#Syntax
Many email providers will restrict these further, but this module is primarily
for testing whether an email is syntactically valid or not.
Calling validate() will run all tests in intelligent order.
Any error found will raise an InvalidEmail error, but this also inherits from
ValueError, so errors can be caught with either of them.
If you're using any other functions, note that some of the tests will return
a modified string for the convenience of how the default tests are structured.
Just calling valid_quotes(string) will work fine, just don't use the assigned
value unless you want the quoted sections removed.
Errors will be raised from the function regardless.
>>> validate("local-part@domain")
>>> validate("example@email.com")
>>> validate("John..Doe@example.com")
Traceback (most recent call last):
...
InvalidEmail: Consecutive periods are not permitted.
>>> validate("John.Doe@example.com")
>>> validate("John~.Doe@example.com")
>>> validate("john.smith(comment)@example.com")
>>> validate("(comment)john.smith@example.com")
>>> validate("(comment)john.smith@example(comment).com")
>>> validate('"abcdefghixyz"@example.com')
>>> validate('abc."defghi".@example.com')
Traceback (most recent call last):
...
InvalidEmail: Local may neither start nor end with a period.
>>> validate('abc."def<>ghi"xyz@example.com')
Traceback (most recent call last):
...
InvalidEmail: Incorrect double quotes formatting.
>>> validate('abc."def<>ghi".xyz@example.com')
>>> validate('jsmith@[192.168.2.1]')
>>> validate('jsmith@[192.168.12.2.1]')
Traceback (most recent call last):
...
InvalidEmail: IPv4 domain must have 4 period separated numbers.
>>> validate('jsmith@[IPv6:2001:db8::1]')
>>> validate('john.smith@(comment)example.com')
"""
import re
from string import ascii_letters, digits
HEX_BASE = 16
MAX_ADDRESS_LEN = 256
MAX_LOCAL_LEN = 64
MAX_DOMAIN_LEN = 253
MAX_DOMAIN_SECTION_LEN = 63
MIN_UTF8_CODE = 128
MAX_UTF8_CODE = 65536
MAX_IPV4_NUM = 256
IPV6_PREFIX = 'IPv6:'
VALID_CHARACTERS = ascii_letters + digits + "!#$%&'*+-/=?^_`{|}~"
EXTENDED_CHARACTERS = VALID_CHARACTERS + r' "(),:;<>@[\]'
DOMAIN_CHARACTERS = ascii_letters + digits + '-.'
# Find quote enclosed sections, but ignore \" patterns.
COMMENT_PATTERN = re.compile(r'\(.*?\)')
QUOTE_PATTERN = re.compile(r'(^(?<!\\)".*?(?<!\\)"$|\.(?<!\\)".*?(?<!\\)"\.)')
class InvalidEmail(ValueError):
"""String is not a valid Email."""
def strip_comments(s):
"""Return s with comments removed.
Comments in an email address are any characters enclosed in parentheses.
These are essentially ignored, and do not affect what the address is.
>>> strip_comments('exam(alammma)ple@e(lectronic)mail.com')
'example@email.com'"""
return re.sub(COMMENT_PATTERN, "", s)
def valid_quotes(local):
"""Parse a section of the local part that's in double quotation marks.
There's an extended range of characters permitted inside double quotes.
Including: "(),:;<>@[\] and space.
However " and \ must be escaped by a backslash to be valid.
>>> valid_quotes('"any special characters <>"')
''
>>> valid_quotes('this."is".quoted')
'this.quoted'
>>> valid_quotes('this"wrongly"quoted')
Traceback (most recent call last):
...
InvalidEmail: Incorrect double quotes formatting.
>>> valid_quotes('still."wrong"')
Traceback (most recent call last):
...
InvalidEmail: Incorrect double quotes formatting."""
quotes = re.findall(QUOTE_PATTERN, local)
if not quotes and '"' in local:
raise InvalidEmail("Incorrect double quotes formatting.")
for quote in quotes:
if any(char not in EXTENDED_CHARACTERS for char in quote.strip('.')):
raise InvalidEmail("Invalid characters used in quotes.")
# Remove valid escape characters, and see if any invalid ones remain
stripped = quote.replace('\\\\', '').replace('\\"', '"').strip('".')
if '\\' in stripped:
raise InvalidEmail('\ must be paired with " or another \.')
if '"' in stripped:
raise InvalidEmail('Unescaped " found.')
# Test if start and end are both periods
# If so, one of them should be removed to prevent double quote errors
if quote.endswith('.'):
quote = quote[:-1]
local = local.replace(quote, '')
return local
def valid_period(local):
"""Raise error for invalid period, return local without any periods.
Raises InvalidEmail if local starts or ends with a period or
if local has consecutive periods.
>>> valid_period('example.email')
'exampleemail'
>>> valid_period('.example')
Traceback (most recent call last):
...
InvalidEmail: Local may neither start nor end with a period."""
if local.startswith('.') or local.endswith('.'):
raise InvalidEmail("Local may neither start nor end with a period.")
if '..' in local:
raise InvalidEmail("Consecutive periods are not permitted.")
return local.replace('.', '')
def valid_local_characters(local):
"""Raise error if char isn't in VALID_CHARACTERS or the UTF8 code range"""
if any(not MIN_UTF8_CODE <= ord(char) <= MAX_UTF8_CODE
and char not in VALID_CHARACTERS for char in local):
raise InvalidEmail("Invalid character in local.")
def valid_local(local):
"""Raise error if any syntax rules are broken in the local part."""
local = valid_quotes(local)
local = valid_period(local)
valid_local_characters(local)
def valid_domain_lengths(domain):
"""Raise error if the domain or any section of it is too long.
>>> valid_domain_lengths('long.' * 52)
Traceback (most recent call last):
...
InvalidEmail: Domain length must not exceed 253 characters.
>>> valid_domain_lengths('proper.example.com')"""
if len(domain.rstrip('.')) > MAX_DOMAIN_LEN:
raise InvalidEmail("Domain length must not exceed {} characters."
.format(MAX_DOMAIN_LEN))
sections = domain.split('.')
if any(1 > len(section) > MAX_DOMAIN_SECTION_LEN for section in sections):
raise InvalidEmail("Invalid section length between domain periods.")
def valid_ipv4(ip):
"""Raise error if ip doesn't match IPv4 syntax rules.
IPv4 is in the format xxx.xxx.xxx.xxx
Where each xxx is a number 1 - 256 (with no leading zeroes).
>>> valid_ipv4('256.12.1.12')
>>> valid_ipv4('256.12.1.312')
Traceback (most recent call last):
...
InvalidEmail: IPv4 domain must be numbers 1-256 and periods only"""
numbers = ip.split('.')
if len(numbers) != 4:
raise InvalidEmail("IPv4 domain must have 4 period separated numbers.")
try:
if any(0 > int(num) or int(num) > MAX_IPV4_NUM for num in numbers):
raise InvalidEmail
except ValueError:
raise InvalidEmail("IPv4 domain must be numbers 1-256 and periods only")
def valid_ipv6(ip):
"""Raise error if ip doesn't match IPv6 syntax rules.
IPv6 is in the format xxxx:xxxx::xxxx::xxxx
Where each xxxx is a hexcode, though they can 0-4 characters inclusive.
Additionally there can be empty spaces, and codes can be ommitted entirely
if they are just 0 (or 0000). To accomodate this, validation just checks
for valid hex codes, and ensures that lengths never exceed max values.
But no minimums are enforced.
>>> valid_ipv6('314::ac5:1:bf23:412')
>>> valid_ipv6('IPv6:314::ac5:1:bf23:412')
>>> valid_ipv6('314::ac5:1:bf23:412g')
Traceback (most recent call last):
...
InvalidEmail: Invalid IPv6 domaim: '412g' is invalid hex value.
>>> valid_ipv6('314::ac5:1:bf23:314::ac5:1:bf23:314::ac5:1:bf23:41241')
Traceback (most recent call last):
...
InvalidEmail: Invalid IPv6 domain"""
if ip.startswith(IPV6_PREFIX):
ip = ip.replace(IPV6_PREFIX, '')
hex_codes = ip.split(':')
if len(hex_codes) > 8 or any(len(code) > 4 for code in hex_codes):
raise InvalidEmail("Invalid IPv6 domain")
for code in hex_codes:
try:
if code:
int(code, HEX_BASE)
except ValueError:
raise InvalidEmail("Invalid IPv6 domaim: '{}' is invalid hex value.".format(code))
def valid_domain_characters(domain):
"""Raise error if any invalid characters are used in domain."""
if any(char not in DOMAIN_CHARACTERS for char in domain):
raise InvalidEmail("Invalid character in domain.")
def valid_domain(domain):
"""Raise error if domain is neither a valid domain nor IP.
Domains (sections after the @) can be either a traditional domain or an IP
wrapped in square brackets. The IP can be IPv4 or IPv6.
All these possibilities are accounted for."""
# Check if it's an IP literal
if domain.startswith('[') and domain.endswith(']'):
ip = domain[1:-1]
if '.' in ip:
valid_ipv4(ip)
elif ':' in ip:
valid_ipv6(ip)
else:
raise InvalidEmail("IP domain not in either IPv4 or IPv6 format.")
else:
valid_domain_lengths(domain)
def validate(address):
"""Raises an error if address is an invalid email string."""
try:
local, domain = strip_comments(address).split('@')
except ValueError:
raise InvalidEmail("Address must have one '@' only.")
if len(local) > MAX_LOCAL_LEN:
raise InvalidEmail("Only {} characters allowed before the @"
.format(MAX_LOCAL_LEN))
if len(domain) > MAX_ADDRESS_LEN:
raise InvalidEmail("Only {} characters allowed in address"
.format(MAX_ADDRESS_LEN))
valid_local(strip_comments(local))
valid_domain(strip_comments(domain))
if __name__ == "__main__":
import doctest
doctest.testmod()
raw_input('>DONE<')

\$\begingroup\$Unfortunately, I couldn't get your code to work (I get an IndentationError), but I suspect that it might fail even on some of the more simple examples from RFC3696.\$\endgroup\$
– Jörg W MittagJan 22 '16 at 15:40

1

\$\begingroup\$Your handling of comments isn't strictly correct; quoted-string can only contain FWS between the quotes, not CFWS, so anything that looks like a comment inside a quoted-string isn't a comment, and shouldn't be removed. Something similar is true for domain-literals inside square brackets. Neither is likely to have much real-world impact, but if you want to be absolutely correct you might want to think about how to handle that.\$\endgroup\$
– hobbsJan 22 '16 at 17:34

\$\begingroup\$Well I just tried sending an email to somebody(with_a_comment)@gmail.com and my gmail won't even let me send it. It says "somebody" is invalid. The comment is not even mentioned in the error message.\$\endgroup\$
– OctopusJan 22 '16 at 21:58

12

\$\begingroup\$"I did heavily consult the font of all knowledge, Wikipedia for its summary on the rules." - there's your problem. If implementing something technical, you should always get the official spec - which is RFC 2822 (and the updates to it) for your case.\$\endgroup\$
– BergiJan 23 '16 at 21:40

6 Answers
6

You probably missed the idea to confirm your knowledge with the relevant RFCs, as a conforming implementation should abide by the rules described therein. While Wikipedia is quite reliable nowadays, it is by no means a normative source.

FWS means "folding white space" and is a construct containing an optional sequence made up of whitespaces that are followed by a single CRLF; that sequence (if present) preceding a mandatory part that consists of a single whitespace. While an address' local part can legally begin and end with a space, both spaces need to be separated by at least one character forming qcontent.

\$\begingroup\$This answer precisely describes why validating valid addresses is a mostly futile exercise. It's far easier to get it wrong than it is to get it right. Back in the day, you could just finger addresses to get an approximation of deliverability, but these days you may as well just send it out and hope for the best.\$\endgroup\$
– phyrfoxJan 23 '16 at 6:15

2

\$\begingroup\$The only way to validate an email address is to try and send an email to it. If it fails it doesn't necessarily mean its an invalid address, but it means that your methods of sending email can't send it there, so its validity isn't that important (whether or not this means you want to continue using that email library is up to you..)\$\endgroup\$
– DannnnoJan 24 '16 at 18:37

Personally this is a lot of boilerplate if all you wish to know is if it is valid.
For these cases I would recommend that you make an is_valid function.
This would change the above to:

if is_valid(''):
# Handle valid email
else:
# Handle non-valid email

It can help readability in cases where you don't want to know the error.
Which probably isn't how you want it to be used, with all the helpful errors.
But is a way I know I would want to use it.

All your functions are public which encourages me to do:

import email
email.valid_quotes('joe@domain')

This should be a private function,
that I shouldn't be using,
and so you should name it _valid_quotes.
Whilst it can still be used the same way,
it's now a 'Python private'.
And follows how re.py defines its functions.

And as @Mathias said you should also add __all__ too.

Other than the above three points,
you have a few PEP8 errors you may not have picked up on.
But they're quite petty:

Surround top-level function and class definitions with two blank lines.

You have too much whitespace around your imports,
two blank lines would be enough (Which would still go against PEP8).

You seems to build your module to contain only validate as "public" function. You may want to enforce that by declaring __all__ = ['validate', 'InvalidEmail']. It will affect the way that pydoc and the help builtin display help on your module (they will show only the module docstring, the exception and the validate function) as well as how from the_ultimate_email_validator import * is handled (letting only validate and InvalidEmail leak into the global namespace).

Other than that, looking at the intended usecase of validate it closely resembles int or related builtins. As such, it could be useful to rename it to a less passive action (say email) and call it like:

valid_address = email(user_input)

The returned value could be stripped of comments and any parsing issue would raise InvalidEmail the same way guess = int(raw_input()) would raise ValueError. The caller would still be responsible of handling invalid addresses using a try .. except as in your actual version.

Speaking of that return value, I guess it would be something along the lines of

return '{}@{}'.format(local, domain)

at the end of validate because comments are already stripped at the first line of the function. But then, why do you call valid_local(strip_comments(local)) and valid_domain(strip_comments(domain)) instead of valid_local(local) and valid_domain(domain)? There doesn't seems to be any case where comments could be left in either local or domain after stripping the entire address.

\$\begingroup\$Very good tip with __all__, I'd previously just used _ but it felt more unwieldy to do here, this is a great solution! Also you're right about the redundant duplicate of strip_comment, I had previously arranged it so that it wouldn't be called so early and didn't update to match the change.\$\endgroup\$
– SuperBiasedManJan 22 '16 at 14:17

Removing nested comments with regular expressions is hard, indeed with a purist view of regular expressions, it's impossible. Basically, to do this you need a recursive regex, which seems not to be supported by Python's RE engine.

In this particular case, it shouldn't be too hard for you to roll your own comment remover, all you really need to do is iterate over the string, keeping track of the number of brackets currently open. For a next iteration, this shouldn't be too hard.

You may find, though, that this gets unmanageable quite quickly when you try and account for quoted strings and escaped characters - how would you write your parser such that it parses foo"\")"("")@example.com down to foo")@example.com? If you really want to hit as many pathological edge cases as possible, I'd suggest learning about formal languages and parsers, then digging out a parser library for Python to help you build your own. The Python Wiki lists several, and this one in particular looks pretty nice, though I haven't tried to use it myself.