1

OWASP defines "evil regex" (here) as follows:

Evil Regexes A Regex is called "evil" if it can stuck on crafted input. Evil Regex pattern contains: Grouping with repetition Inside the repeated group: Repetition Alternation with overlapping

Examples of Evil Patterns:

    (a+)+
    ([a-zA-Z]+)*
    (a|aa)+
    (a|a?)+
    (.*a){x} | for x > 10

From the perspective of development and pentesting, I'd like to know if this is a complete characterization of "evil regex", or whether other forms exist?

Also, is there are any algorithm or method for constructing an input for a vulnerable regex that may cause it to "hang" arbitrarily long?

Soufiane Tahiri
  • 2,667
  • 12
  • 27
Shuzheng
  • 1,097
  • 4
  • 22
  • 37
  • 1
    If the regex is not written correctly you can create issues for sure, for example a regex expression like "a*b*c*d*e*f*g" will create 7 linear operations on the analyzed data that could potentially boost the cpu. Regex expressions needs to be evaluated and study the impact in order to avoid performance issues or invalid matches. – camp0 Sep 02 '19 at 08:34
  • Thanks, what's wrong with the regex "abcdefg"? It takes a constant amount of time to evaluate at each position in the input string. – Shuzheng Sep 02 '19 at 08:46
  • Pathological cases differ from regex engine to regex engine, so a good list would be by implementation. – Bruno Rohée Sep 02 '19 at 09:12
  • Sorry the edit didn't reflect the regex that I want to put, here is the regex with the issue "a.*b.*c.*d.*e.*f", so if you have 1MB of data with out that characters you will scan 7 times the MB, bear in mind that this is an example to illustrate a performance issue. – camp0 Sep 02 '19 at 09:26
  • 1
    @camp0, if neither of those characters are present int the MB, then the scanning will proceed over the input in one pass, with no match for "a" at any position. – Shuzheng Sep 02 '19 at 10:56
  • @Shuzheng He probably meant that the contents start with abcdef and never again contain those letters. First it looks up to the end to find a `b` and backtracks until it settles on the second character, then it scans everything to find the `c` and backtracks again until it decides to stop at the third character... etc. However a smart implementation could keep track of which characters it has seen (first & last position seen after the already matched text) and avoid these multiple scans&backtracks in this specific instances. – Giacomo Alzetta Sep 02 '19 at 13:27
  • The answer here explains it well https://stackoverflow.com/questions/12841970/how-can-i-recognize-an-evil-regex – Soufiane Tahiri Sep 02 '19 at 16:03

0 Answers0