Email validation

10

Write a function or program to validate an e-mail address against RFC 5321 (some grammar rules found in 5322) with the relaxation that you can ignore comments and folding whitespace (CFWS) and generalised address literals. This gives the grammar

Mailbox              = Local-part "@" ( Domain / address-literal )

Local-part           = Dot-string / Quoted-string
Dot-string           = Atom *("."  Atom)
Atom                 = 1*atext
atext                = ALPHA / DIGIT /    ; Printable US-ASCII
                       "!" / "#" /        ;  characters not including
                       "$" / "%" /        ;  specials.  Used for atoms.
                       "&" / "'" /
                       "*" / "+" /
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~"
Quoted-string        = DQUOTE *QcontentSMTP DQUOTE
QcontentSMTP         = qtextSMTP / quoted-pairSMTP
qtextSMTP            = %d32-33 / %d35-91 / %d93-126
quoted-pairSMTP      = %d92 %d32-126

Domain               = sub-domain *("." sub-domain)
sub-domain           = Let-dig [Ldh-str]
Let-dig              = ALPHA / DIGIT
Ldh-str              = *( ALPHA / DIGIT / "-" ) Let-dig

address-literal      = "[" ( IPv4-address-literal / IPv6-address-literal ) "]"
IPv4-address-literal = Snum 3("."  Snum)
IPv6-address-literal = "IPv6:" IPv6-addr
Snum                 = 1*3DIGIT
                       ; representing a decimal integer value in the range 0 through 255

Note: I've skipped the definition of IPv6-addr because this particular RFC gets it wrong and disallows e.g. ::1. The correct spec is in RFC 2373.

Restrictions

You may not use any existing e-mail validation library calls. However, you may use existing network libraries to check IP addresses.

If you write a function/method/operator/equivalent it should take a string and return a boolean or truthy/falsy value, as appropriate for your language. If you write a program it should take a single line from stdin and indicate valid or invalid via the exit code.

Test cases

The following test cases are listed in blocks for compactness. The first block are cases which should pass:

email@domain.com
e@domain.com
firstname.lastname@domain.com
email@subdomain.domain.com
firstname+lastname@domain.com
email@123.123.123.123
email@[123.123.123.123]
"email"@domain.com
1234567890@domain.com
email@domain-one.com
_______@domain.com
email@domain.name
email@domain.co.jp
firstname-lastname@domain.com
""@domain.com
"e"@domain.com
"\@"@domain.com
email@domain
"Abc\@def"@example.com
"Fred Bloggs"@example.com
"Joe\\Blow"@example.com
"Abc@def"@example.com
customer/department=shipping@example.com
$A12345@example.com
!def!xyz%abc@example.com
_somename@example.com
_somename@[IPv6:::1]
fred+bloggs@abc.museum
email@d.com
?????@domain.com

The following test cases should not pass:

plainaddress
#@%^%#$@#$@#.com
@domain.com
Joe Smith <email@domain.com>
email.domain.com
email@domain@domain.com
.email@domain.com
email.@domain.com
email.email.@domain.com
email..email@domain.com
email@domain.com (Joe Smith)
email@-domain.com
email@domain..com
email@[IPv6:127.0.0.1]
email@[127.0.0]
email@[.127.0.0.1]
email@[127.0.0.1.]
email@IPv6:::1]
_somename@domain.com]
email@[256.123.123.123]

Peter Taylor

Posted 2013-02-22T21:47:33.797

Reputation: 41 901

since IPv6-addr has been left undefined, and there are test cases that have ipv6 addresses, is there a correct way to validate them? – ardnew – 2013-02-23T04:28:54.883

Why should email@d.com and ?????@domain.com fail? – grc – 2013-02-23T08:04:58.807

1@ardnew, I've added a link to the relevant RFC. I don't want to inline it because the question is already quite long. – Peter Taylor – 2013-02-23T09:16:01.967

@grc, good question. I've checked them, because no-one raised this during the several months that the question was in the sandbox, but I can't see why they should fail so I've moved them to the "Pass" side.

– Peter Taylor – 2013-02-23T09:19:17.860

Are length limits required as well? 254 for entire email address/64 for local-part/63 for each domain label? – MichaelRushton – 2013-03-02T22:30:45.170

Answers

2

Python 3.3, 261

import re,ipaddress
try:v,p=re.match(r'^(?!\.)(((^|\.)[\w!#-\'*+\-/=?^-~]+)+|"([ !#-[\]-~]|\\[ -~])*")@(((?!-)[a-zA-Z\d-]+(?<!-)($|\.))+|\[(IPv6:)?(.*)\])(?<!\.)$',input()).groups()[7:];exec("if p:ipaddress.IPv%dAddress(p)"%(v and 6or 4))
except:v=5
print(v!=5)

Python 3.3 is needed for the ipaddress module, which is used to validate IPv4 and IPv6 addresses.

Less golfed version:

import re, ipaddress

dot_string = r'(?!\.)((^|\.)[\w!#-\'*+\-/=?^-~]+)+'
    # negative lookahead to check that string doesn't start with .
    # each atom must start with a . or the beginning of the string

quoted_string = r'"([ !#-[\]-~]|\\[ -~])*"'
    # - is used for character ranges (also in dot_string)

domain = r'((?!-)[a-zA-Z\d-]+(?<!-)($|\.))+(?<!\.)'
    # negative lookahead/lookbehind to check each subdomain doesn't start/end with -
    # each domain must end with a . or the end of the string
    # negative lookbehind to check that string doesn't end with .

address_literal = r'\[(IPv6:)?(.*)\]'
    # captures the is_IPv6 and ip_address groups

final_regex = r'^(%s|%s)@(%s|%s)$' % (dot_string, quoted_string, domain, address_literal)

try:
    is_IPv6, ip_address = re.match(final_regex, input(), re.VERBOSE).groups()[7:]
        # if input doesn't match, calling .groups() will throw an exception

    if ip_address:
        exec("ipaddress.IPv%dAddress(ip_address)" % (6 if is_IPv6 else 4))
            # IPv4Address or IPv6Address will throw an exception if ip_address isn't valid
except:
    is_IPv6 = 5

print(is_IPv6 != 5)
    # is_IPv6 is used as a flag to tell whether an exception was thrown

grc

Posted 2013-02-22T21:47:33.797

Reputation: 18 565

very nice. i can't immediately find any duplicate patterns (to replace with a shorter variable identifier). but it looks like ALPHA in augmented BNF and the char literals constructing a Quoted-string are all case-insensitive. can you shave a few chars by specifying case-insensitivity and ditching one of those char class ranges? btw, if you're feeling frisky, can you give a short description of how you developed this? – ardnew – 2013-02-24T06:55:08.013

@ardnew: Thanks. I've added a less golfed version with a few comments trying to explain some of the trickier parts. I developed the regex in four individual pieces (dot-string, quoted-string, domain and address-literal), then merged them together and added the ip validation. Needless to say, golfing it got really messy. – grc – 2013-02-24T13:06:25.470

No length limits? – MichaelRushton – 2013-03-02T22:37:48.697

2

PHP 5.4.9, 495

function _($e){return preg_match('/^(?!(?>"?(?>\\\[ -~]|[^"])"?){255,})(?!"?(?>\\\[ -~]|[^"]){65,}"?@)(?>([!#-\'*+\/-9=?^-~-]+)(?>\.(?1))*|"(?>[ !#-\[\]-~]|\\\[ -~])*")@(?!.*[^.]{64,})(?>([a-z0-9](?>[a-z0-9-]*[a-z0-9])?)(?>\.(?2)){0,126}|\[(?:(?>IPv6:(?>([a-f0-9]{1,4})(?>:(?3)){7}|(?!(?:.*[a-f0-9][:\]]){8,})((?3)(?>:(?3)){0,6})?::(?4)?))|(?>(?>IPv6:(?>(?3)(?>:(?3)){5}:|(?!(?:.*[a-f0-9]:){6,})(?5)?::(?>((?3)(?>:(?3)){0,4}):)?))?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?6)){3}))\])$/iD', $e);}

And just for further interest, here's one for RFC 5322 grammar which allows for nested CFWS and obsolete local-parts:

(764)

function _($e){return preg_match('/^(?!(?>(?1)"?(?>\\\[ -~]|[^"])"?(?1)){255,})(?!(?>(?1)"?(?>\\\[ -~]|[^"])"?(?1)){65,}@)((?>(?>(?>((?>(?>(?>\x0D\x0A)?[\t ])+|(?>[\t ]*\x0D\x0A)?[\t ]+)?)(\((?>(?2)(?>[\x01-\x08\x0B\x0C\x0E-\'*-\[\]-\x7F]|\\\[\x00-\x7F]|(?3)))*(?2)\)))+(?2))|(?2))?)([!#-\'*+\/-9=?^-~-]+|"(?>(?2)(?>[\x01-\x08\x0B\x0C\x0E-!#-\[\]-\x7F]|\\\[\x00-\x7F]))*(?2)")(?>(?1)\.(?1)(?4))*(?1)@(?!(?1)[a-z\d-]{64,})(?1)(?>([a-z\d](?>[a-z\d-]*[a-z\d])?)(?>(?1)\.(?!(?1)[a-z\d-]{64,})(?1)(?5)){0,126}|\[(?:(?>IPv6:(?>([a-f\d]{1,4})(?>:(?6)){7}|(?!(?:.*[a-f\d][:\]]){8,})((?6)(?>:(?6)){0,6})?::(?7)?))|(?>(?>IPv6:(?>(?6)(?>:(?6)){5}:|(?!(?:.*[a-f\d]:){6,})(?8)?::(?>((?6)(?>:(?6)){0,4}):)?))?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?9)){3}))\])(?1)$/isD', $e);}

And if length-limits are not a requirement:

RFC 5321 (414)

function _($e){return preg_match('/^(?>([!#-\'*+\/-9=?^-~-]+)(?>\.(?1))*|"(?>[ !#-\[\]-~]|\\\[ -~])*")@(?>([a-z0-9](?>[a-z0-9-]*[a-z0-9])?)(?>\.(?2)){0,126}|\[(?:(?>IPv6:(?>([a-f0-9]{1,4})(?>:(?3)){7}|(?!(?:.*[a-f0-9][:\]]){8,})((?3)(?>:(?3)){0,6})?::(?4)?))|(?>(?>IPv6:(?>(?3)(?>:(?3)){5}:|(?!(?:.*[a-f0-9]:){6,})(?5)?::(?>((?3)(?>:(?3)){0,4}):)?))?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?6)){3}))\])$/iD', $e);}

RFC 5322 (636)

function _($e){return preg_match('/^((?>(?>(?>((?>(?>(?>\x0D\x0A)?[\t ])+|(?>[\t ]*\x0D\x0A)?[\t ]+)?)(\((?>(?2)(?>[\x01-\x08\x0B\x0C\x0E-\'*-\[\]-\x7F]|\\\[\x00-\x7F]|(?3)))*(?2)\)))+(?2))|(?2))?)([!#-\'*+\/-9=?^-~-]+|"(?>(?2)(?>[\x01-\x08\x0B\x0C\x0E-!#-\[\]-\x7F]|\\\[\x00-\x7F]))*(?2)")(?>(?1)\.(?1)(?4))*(?1)@(?1)(?>([a-z\d](?>[a-z\d-]*[a-z\d])?)(?>(?1)\.(?1)(?5)){0,126}|\[(?:(?>IPv6:(?>([a-f\d]{1,4})(?>:(?6)){7}|(?!(?:.*[a-f\d][:\]]){8,})((?6)(?>:(?6)){0,6})?::(?7)?))|(?>(?>IPv6:(?>(?6)(?>:(?6)){5}:|(?!(?:.*[a-f\d]:){6,})(?8)?::(?>((?6)(?>:(?6)){0,4}):)?))?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?9)){3}))\])(?1)$/isD', $e);}

MichaelRushton

Posted 2013-02-22T21:47:33.797

Reputation: 244