Safe email validation

Question

I have been using this RFC822-compliant regular expression for email validation. Pen testers on HackerOne have used the following horrendous email addresses which satisfy the regex:

'/**/OR/**/1=1/**/--/**/@a.a
a@a.a&a=////etc/passwd
a@a.com&&a=a
%00%2a@a.a

Are those email addresses valid? How can I do safe email validation?

[Here you go](http://stackoverflow.com/a/1903368/2191572), have fun bleeding from your iris. — MonkeyZeus, Mar 01 '16 at 15:43
@Zymus usability. alphanum + `_-.` is way too restrictive. At the very least you need to allow `+`, because quite a lot of people use that (eg for gmail). But even if you include that, you will exclude users as they will not be able to use their completely valid email addresses. If someone has so little trust in the security of their application that they do think that strict filtering is necessary, I would restrict to something like alphanum + `! # % & * + - = ? ^ _ . | ~`. It takes out most characters used in common attacks such as `< > ' " `` / $ { }`, but still allows most valid addresses — tim, Mar 01 '16 at 21:06
@Zymus gmail allows to use the `+` symbol with any email address (which is also why they likely don't allow it when signing up). So if your email address is `foobar@example.com`, you could use `foobar+spam@example.com` and `foobar+friends@example.com` and thus organize your emails. Other providers may provide similar functionality with different characters which is one of the reasons why limiting valid characters may not be a good idea. — tim, Mar 01 '16 at 22:04
I should run a mail server at some point just so I can have "@ @"@ to confuse non-technical people. — user253751, Mar 02 '16 at 01:26
What ***is*** "safe email validation", anyway? What does that phrase even mean? — user253751, Mar 02 '16 at 01:46
@Zymus: Gmail allows what Gmail allows; if I run a wildly successful webmail which only allows people to register with usernames matching `/^[d-q5-8]{24}_[a-c]{3,5}$/`, I'm not creating a new standard that's mandatory for everyone else to follow, just limiting what the acceptable local part is *at my own servers*. In other words, GMail only allowing you to create a username in a subset of *possible* local parts does NOT mean that this is the *only* possible local part. — Piskvor left the building, Mar 02 '16 at 15:38

score 40 · Accepted Answer · edited Oct 07 '21 at 07:59

Are those email addresses valid?

Yes, they are. See for example here or with a bit more explanation here.

For a nice explanation on how emails may look, see the informational RFC3696. The more technical RFCs are linked there as well.

Attacks possible in the local part of an Email Address

Without quotes, local-parts may consist of any combination of
alphabetic characters, digits, or any of the special characters
  ! # $ % & ' * + - / = ?  ^ _ ` . { | } ~
period (".") may also appear, but may not be used to start or end the local part, nor may two or more consecutive periods appear. Stated differently, any ASCII graphic (printing) character other than the at-sign ("@"), backslash, double quote, comma, or square brackets may appear without quoting. If any of that list of excluded characters are to appear, they must be quoted.

So the rule is more or less: most characters can be part of the local part, except for @\",[], those must be in-between " (except of course " itself, which has to be escaped when in a quoted string).

There are also rules on where and when to quote and how to handle comments, but that's less relevant to your question.

The point here is that many attacks can be part of the local part of an email address, for example:

'/**/OR/**/1=1/**/--/**/@a.a
"<script>alert(1)</script>"@example.com
" onmouseover=alert(1) foo="@example.com
"../../../../../test%00"@example.com
...

Attacks possible in the domain part of an Email Address

The exact structure of the domain part can be seen in RFC2822 or RFC5322:

addr-spec       =       local-part "@" domain

local-part      =       dot-atom / quoted-string / obs-local-part

domain          =       dot-atom / domain-literal / obs-domain

domain-literal  =       [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]

dcontent        =       dtext / quoted-pair

dtext           =       NO-WS-CTL /     ; Non white space controls

                        %d33-90 /       ; The rest of the US-ASCII
                        %d94-126        ;  characters not including "[",
                                        ;  "]", or "\"

Where:

   dtext           =   %d33-90 /          ; Printable US-ASCII
                       %d94-126 /         ;  characters not including
                       obs-dtext          ;  "[", "]", or "\"

You can see that again, most characters are allowed (even non-ascii characters). Possible attacks would be:

a@a.a&a=////etc/passwd
foo@bar(<script>alert(1)</script>).com
foo@'/**/OR/**/1=1/**/--/**/

Conclusion

You can't validate email addresses safely.

Instead, you need to make sure to have proper defenses in place (HTML encoding for XSS, prepared statements for SQL injection, etc).

As defense in depth, you could forbid quoted strings and comments to gain some amount of protection, as these two things allow the most unusual characters and string. But some attacks are still possible, and you will exclude a small amount of users.

If you do need additional input filtering that exceeds the limits of the email format, because you do not trust the rest of your application, you should carefully consider what you do allow and what you do not allow. For example + is used by gmail to allow filtering incoming emails, so not allowing it may lead users to not sign up. Other characters may be used by other providers for similar functionalities. A first approach might be to only allow alphanum + ! # % * + - = ? ^ _ . | ~. This would disallow < > ' " ` / $ { } &, which are characters used in common attacks. Depending on your application, you may want to disallow further characters.

And as you mentioned RFC822: It is a bit outdated (it's from 1982), but even it allows for quoted strings and comments, so just saying that you only accept RFC822 compliant addresses would not only not be practical, but also not work.

Also, are you checking your emails client-side? The JS code gives that impression. An attacker could just bypass client-side checks.

Ok, but while the mail specification allows lot of things in the domain part, it still must be a valid domain for the address to work, no? And domains are a bit stricter, aren't they? — Jan Hudec, Mar 01 '16 at 15:19
@JanHudec: Well, that depends on the mechanics of your mail delivery agent, but yes, if you know that the agent will perform an NX lookup through DNS, then (for example) all "domain part" values containing forward slashes can be tossed out. — Ben Voigt, Mar 01 '16 at 17:20
"Instead, you need to make sure to have proper defenses in place..." surely having proper defences in place ought to be standard practice for all user input in addition to validation checks. — James Snell, Mar 01 '16 at 20:46
@JamesSnell It should be, but it's really worth mentioning, because all too often, input filtering is the only or main line of defense (and as this question now seems to be on the hot network list, I'm glad I did mention it). — tim, Mar 01 '16 at 20:55
"you will exclude a small amount of users" -- has anyone ever seen a legitimate user that has an email address like these examples? Obviously excluding trolls who are just purposefully trying to raise a complaint about their valid email being rejected. I've never seen one (and think that RFC822 never should have even allowed such craziness). I don't think I even know any webmail hosts that allow you to make such emails. — Kat, Mar 03 '16 at 20:12

score 10 · Answer 2 · edited Mar 02 '16 at 13:57

10

The simplest way to test this would be to try sending an email to that address, from a send-only address (i.e. from noreply-randomblue@example.com). If it can't be delivered, it's not valid.

Using a regex to parse emails is probably best done on the client side to let them know in advance that they may have typos in their email address, before they register.

edited Mar 02 '16 at 13:57

user

7,670
2
30
54

answered Mar 01 '16 at 09:21

Philip Rowlands

1,779
1
13
27

5

Sending an email is great when you want to test if the email address actually exists (and really the only proper solution in that case), but it doesn't prevent payloads in the email address, which is what the OP is worried about. Checking for typos client-side is also great regarding usability, but again not regarding security as it's easily bypassed. – tim Mar 01 '16 at 14:10
1

@tim I obviously misunderstood the question! Thanks for clearing that up. – Philip Rowlands Mar 01 '16 at 14:25
What's the advantage of using a send-only address? – Randomblue Mar 01 '16 at 15:34
1

@Randomblue as I understand it, a send-only address can't receive emails. So if somebody tries spamming it or sending it a virus/Trojan/Gremlin nest, that's going to fail. – Philip Rowlands Mar 01 '16 at 16:32
5

@Randomblue, the advantage of a send-only address is that the justifiably annoyed recipients of your "test" email can't take out their frustration on you quite so easily. Plus, it stops you accidentally discovering that the email could not be delivered or was unwanted. – Toby Speight Mar 02 '16 at 09:24

kubanczyk · Answer 3 · 2016-03-01T21:00:46.403

You say you want to have safe e-mail addresses. I presume this means these are put into your app and you expect some predictable output. The developers who write your app have in their collective head some idea what to expect inside an e-mail field, and you better not allow anything else there. What your programmers don't expect is not very safe (even if it's valid according to some horrifying RFCs).

So if your developers are not very much into email-related RFCs, I suggest to use "a willful violation of RFC 5322" that happens to exist within a W3C standard for HTML5, and translates to quite a simple regular expression:

^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

source http://www.w3.org/TR/html5/forms.html#valid-e-mail-address

In case this is too lax (if you think your developers don't expect those strange #$%&| etc), I suggest securing it a bit more:

^[a-zA-Z0-9.+/=?^_-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)+$

I think 99.9% of real people addresses match both of these expressions.

Sensible, pragmatic advice – Matt Wilko Mar 02 '16 at 11:08 — Matt Wilko, Mar 02 '16 at 11:08

Matt Wilko · Answer 4 · 2016-03-02T14:36:12.690

4

You can spend too much time worrying about this sort of thing. Why do you really care that much?

There isn't really an unsafe address as such - it's what you do with it / how you process it that counts.

If you process the address in a non safe way e.g. concatenating a string to make sql instead of using parameters then you are asking for trouble, not just in email addresses but every field you are allowing the user to input.

Simply put; providing it has

[>= one char] @ [>= one char] . [>= one char]

or even just:

[>= one char] @ [>= one char]

you should allow it. It doesn't really matter what those chars are.

edited Mar 02 '16 at 14:36

answered Mar 02 '16 at 10:26

Matt Wilko

151
5

1

Don't require a char after the last dot. Some TLDs like `.mil` had emails like `bob@mil` :) – Navin Mar 02 '16 at 14:01
@Navin - I did think that might be the case when I wrote it. I have updated my answer – Matt Wilko Mar 02 '16 at 14:04

score 0 · Answer 5 · answered Mar 04 '16 at 01:49

The responses which emphasise the need to use a layered approach rather than relying on a single filter or defence are on the right track. There are heaps of articles out there about writing the 'correct' regexp to validate a mail address. The reality is you need to combine a number of checks and cannot just rely on a regexp.

What checks you need will depend on what it is your trying to do and what risks your trying to protect against. If your just trying to identify spammers, you may need to also look at content, subject lines and originating mail servers. On the other hand, if you are trying to verify a mail address for a registration process, you may want to verify domain, possibly add a confirmation process that sends a message to the address etc.

My advice is similar to @MattWilko - you soon get into diminishing returns when trying to derive the perfect regexp. As your expression becomes more complex, you will catch more bad addresses, but you will almost certainly also increase the number of false positives. The key is to find the right balance and that balance will depend on your use case and the risks your trying to protect against.

Safe email validation

5 Answers5