1

FindBugs flagged the following email address validation regex as vulnerable to DoS:

^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,6}$

Here's an easier to read version that substitutes CHARS for \w!#$%&'*+/=?`{|}~^-:

[CHARS]+(?:\.[CHARS]+)*@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,6}

Based on my reading of the OWASP ReDoS page, this isn't actually vulnerable. I can't see a way to ambiguously apply the pattern, because the repetition inside the groups doesn't apply to the period character, which ought to result in a canonical application of the pattern.

Have I misunderstood the threat here? Can anyone spot a way to abuse this pattern?

Duncan Jones
  • 1,647
  • 1
  • 10
  • 14
  • While not technically off-topic, this is probably more suited for the FindBugs support channel. – Tom K. Mar 01 '19 at 11:10
  • 1
    Side note: regex rejects possibly valid addresses: [Top-level domains are valid for addresses](https://serverfault.com/q/721927/87542). [Addresses are allowed to contain non-ascii characters](https://stackoverflow.com/q/760150/812837), which might include the domain (a [more comprehensive list](https://stackoverflow.com/questions/2049502/what-characters-are-allowed-in-an-email-address)). [Single-character TLDs are valid](https://stackoverflow.com/q/7411255/812837). [More comprehensive regex](https://stackoverflow.com/q/201323/812837), but note it won't cover international addresses... – Clockwork-Muse Mar 01 '19 at 19:02

1 Answers1

1

Tricky one! I can't seem to trigger the exponential complexity this (\.[CHARS]+)* pattern should introduce. I can make it be a little slow with a huge string (a few megabytes), but I guess that is to be expected with any regular expression. In Python:

import time, re
t=time.time()
print(re.match(r'^[\w!#$%&\'*+/=?`{|}~^-]+(\.[\w!#$%&\'*+/=?`{|}~^-]+)*@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,6}$', 'a.'+'xb'*2500000+'test@example.com'))
print(time.time()-t)
t=time.time()
print(re.match(r'^[\w!#$%&\'*+/=?`{|}~^-]+(\.[\w!#$%&\'*+/=?`{|}~^-]+)*@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,6}$', 'ax'+'.b'*2500000+'test@example.com'))
print(time.time()-t)

In the first test, we supply a long string where the xb pattern is repeated. This matches just the first part (namely, [CHARS]+) and not the second (namely, (\.[CHARS]+)*)). The second test does the opposite: it repeats the '.b' pattern which matches the second part a lot of times (where the backtracking should be) and not the first.

As output, you can see the difference in time it takes:

0.06872129440307617
0.3483853340148926

Both strings are equally long, so the pattern does take some more time.

Usually, one would use the second pattern and then make the end not match, so remove the .com part or something. The regex engine would start to backtrack and take exponentially longer with each iteration. I'm not sure why that does not happen here.

So far as I can see, you might be safe by just limiting the input length to a few kilobytes. But I am probably just too stupid to trigger it; using a pattern that is known to be vulnerable is just asking for someone to come along and figure it out ;-)

Luc
  • 31,973
  • 8
  • 71
  • 135
  • "*the exponential complexity this `(\.[CHARS]+)*` pattern should introduce*" > Doesn't the period prevent the complexity? The examples given in the OWASP page look more like `([CHARS]+)*`, which is ambiguous. However in this case, because `CHARS` doesn't include a period as a possible character, I think there is no complexity here? – Duncan Jones Mar 01 '19 at 12:06
  • @DuncanJones Good point. If I remove the dot and update the test strings accordingly, I can indeed trigger the vuln. So you might be right, but I'd still avoid the regex if possible. (And that's besides the point that email regexes are always faulty/annoying, but that's a different topic covered in other questions on other stackexchange sites.) – Luc Mar 01 '19 at 12:11