Is it safe to let a user type a regex as a search input?

Question

I was in a mall a few days ago and I searched for a shop on an indication panel.

Out of curiosity, I tried a search with (.+) and was a bit surprised to get the list of all the shops in the mall.

I've read a bit about evil regexes but it seems that this kind of attack can only happen when the attacker has both control of the entry to search and the search input (the regex).

Can we consider the mall indication panel safe from DOS considering that the attacker only has control of the search input? (Leaving aside the possibility that a shop might be called some weird name like aaaaaaaaaaaa.)

If the user can enter a regex, and there's an interpreted language in use, I wouldn't be worried about DOS; I'd be worried about code injection. — gowenfawr, Aug 06 '18 at 00:08
@gowenfawr Could you elaborate your answer ? I didn't find anything sourcing/explaining this attack. Also, I found about [this question](https://stackoverflow.com/questions/4579497/is-there-any-way-to-put-malicious-code-into-a-regular-expression) on SO (I initially didn't see it before my post) but it doesn't go deep into code injection (barely mention the risk in the sandbox section of the answer). — Xavier59, Aug 06 '18 at 00:25
I would not expect a mall map to be designed for sophisticated users that might use regexes. Therefore, if regexes work, it suggests the application is sort of blindly passing the input string in. That's usually a place to try various forms of code and SQL injection. It's that little voice saying "I bet they didn't do that by design..." that makes the antenna perk up. This is a Comment, not an Answer, because (for me) there's not enough info here to say anything more accurate than that. — gowenfawr, Aug 06 '18 at 00:46
it could be all client-side, in which case there's no real risk. — dandavis, Aug 06 '18 at 02:16
@gowenfawr You're may be right that an application accepting regex in places you wouldn't expect regex to work could signal that there are security issues elsewhere, but I don't think this is really an argument against accepting user regex. Avoiding something because it may pique an attacker's interest is a similar logic to security through obscurity: trying to avoid the attention of attackers is scant defense (even more so than obscurity) and hoping to use it to deter an attacker from uncovering deeper issues seems like a questionable strategy. — Ryan Jenkins, Aug 06 '18 at 05:51
Despite the security concerns I would love to perform RegEx filtration in indication panels of huge shopping malls! — Daniel, Aug 06 '18 at 06:51
Did you test any regex that should get matches to determine it actually used regex? If I were to design a mall search I would list all shops if the search result was empty. Either the user is trying to have fun (like you) and the result would not matter or the user isn't good at using the search functionality and they should see something that might be of use to them. — Bent, Aug 06 '18 at 09:07
It is also possible that the search field just ignores any punctuation, and is programmed to return all shops for an essentially empty query. — jpa, Aug 06 '18 at 09:51
With some engines, no need for fancy regexps. For instance, with GNU `grep` 3.1 here, `grep -E 'a{0,32767}'` uses up over 4GiB of RAM and 100% CPU indefinitely regardless of the input. — Stéphane Chazelas, Aug 06 '18 at 11:09
It's very easy to craft a regex which stalls even on very short input. For example, `/^(((([a-z]*[a-z]*)*([a-z]*[a-z]*)*([a-z]*[a-z]*)*([a-z]*[a-z]*)*)*)*)+h$/.test("kjhsdf")` is enough to make Chrome freeze. For more advanced engines which does all sorts of analysis, it might be harder but still possible to craft an inefficient regex. — nhahtdh, Aug 06 '18 at 11:40
@jpa Yup I thought the same but I did some more testing with others regex and it's not the default design, it actually does execute the regex. — Xavier59, Aug 06 '18 at 13:34
@nhahtdh It depends on the implementation. Regexes can be executed in linear time with a good algorithm. https://swtch.com/~rsc/regexp/regexp1.html — Reinstate Monica, Aug 06 '18 at 13:47
@gowenfawr [Oh yes, little Bobby Tables, we call him.](https://xkcd.com/327/) — knocked loose, Aug 06 '18 at 16:28
@Solomonoff'sSecret: Well, most of the implementations that are used on web servers are not linear. — nhahtdh, Aug 07 '18 at 01:55
@nhahtdh If this is an important feature, the developer can implement a linear algorithm or use a library with one. — Reinstate Monica, Aug 07 '18 at 03:26
Whatever you do end up doing, do **NOT** `eval()` user input (or whatever equivalent your chosen language has) — ggdx, Aug 07 '18 at 12:27
Are you certain it's actually executing the regex? It could just be that the search ignores all special characters to make the search easier. For example, a store name `Spencer's` might also accept `Spencers`. — Matthew, Aug 07 '18 at 13:24

score 85 · Accepted Answer · edited Aug 07 '18 at 15:34

85

I would compare accepting user supplied regular expressions to parsing most sorts of structured user input, such as date strings or markdown, in terms of risk of code execution. Regular expressions are much more complex than date strings or markdown (although safely producing html from untrusted markdown has its own risks) and so represents more room for exploitation, but the basic principle is the same: exploitation involves finding unexpected side effects of the parsing/compilation/matching process.

Most regex libraries are mature and part of the standard library in many languages, which is a pretty good (but not certain) indicator that it's free of major issues leading to code execution.
That is to say, it does increase your attack surface, but it's not unreasonable to make the measured decision to accept that relatively minor risk.

Denial of service attacks are a little trickier. I think most regular expression libraries are designed with performance in mind but do not count mitigation of intentionally slow input among their core design goals. The appropriateness of accepting user supplied regular expressions from the DoS perspective is more library dependent.
For example, the .NET regex library accepts a timeout which could be used to mitigate DoS attacks.
RE2 guarantees execution in time linear to input size which may be acceptable if you know your search corpus falls within some reasonable size limit.

In situations where availability is absolutely critical or you're trying to minimize your attack surface as much as possible it makes sense to avoid accepting user regex, but I think it's a defensible practice.

edited Aug 07 '18 at 15:34

answered Aug 06 '18 at 00:38

Ryan Jenkins

836
7
7

7

Yes, a timeout is the first thing that comes to mind for mitigating a DoS. Even ignoring library support, it's fairly trivial in most languages/frameworks to spin off the search to a background thread, and have a timeout against that thread. – Bob Aug 06 '18 at 02:48
8

@Bob that's trivial yes, but stopping the background task is not. For example in a language like Java there is no way to forcibly terminate a thread, so even if your timeout had expired you would not be able to do anything about it. – Boris the Spider Aug 06 '18 at 06:34
1

Ages ago when I became aware of regex and moved beyond the basics to start getting fancy, I was able to create some really horrifically slow regex patterns. A lot of this depends on the regex engine but if you are working with one that supports backreferences, lookaheads/lookbehind and/or greedy quantifiers, it's not too hard to bog things down. Of course the length of the strings you are searching makes a big difference. Multi-line regex on large documents can really be a dog. – JimmyJames Aug 06 '18 at 14:47
@BoristheSpider [This StackOverflow question](https://stackoverflow.com/questions/2733356/) seems to provide a method for launching tasks with a time-out. Does that not work in this scenario? – Nat Aug 06 '18 at 15:18
3

@Nat it relies on cooperative multitasking - i.e. it will `cancel(true)` the task, which will `interrupt()` the `Thread` - if the task is interruptible then this may work, most likely it won't however. – Boris the Spider Aug 06 '18 at 15:26
5

Here is an example of [a regular expression which takes exponential execution times on Java](https://dzone.com/articles/how-kill-java-regular): `(0*)*A` – Philipp Aug 06 '18 at 15:32
The one "common" case of malicious/slow regex's I'm aware of is usually called [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). Notably it doesn't require any "advanced" features, like lookahead/behind. – mbrig Aug 08 '18 at 14:41
1

I remember that [SO once fell due to regex backtracking as well](http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016). – Alexander Aug 09 '18 at 13:01

score 15 · Answer 2 · answered Aug 06 '18 at 02:31

15

The main threat in accepting regular expressions will be in your regex execution engine rather than accepting regex itself. I'd expect the threat to be very, very low in any well implemented engine. The engine shouldn't need access to any privileged system resources and should only need to run logic on input provided directly to the engine. This means that even if someone finds an exploit in the interpreter, the damage that can be done should be minimal.

Overall, all regex is designed to do is look for patterns within a value. As long as proper security is followed on the values you check against, there is no reason the engine itself should have any access to modify values. I'd classify it as generally pretty safe.

That said, I'd also only provide it in situations where it made reasonable sense to do so. Regex is complex, potentially time consuming to run, and used in the wrong places could have some undesirable impacts on an application outside of a security context, but in the right use case they are hugely powerful and immensely valuable. (I'm a software architect who refactors hundreds of thousands of lines of code regularly using regex.)

answered Aug 06 '18 at 02:31

AJ Henderson

41,816
5
63
110

15

This doesn't cover DoS attacks via, for example, catastrophic backtracking. – Boris the Spider Aug 06 '18 at 06:36
5

@boris I didn't consider that a security threat as expensive regex handling in a non interfering manner is necessary even in normal usage. People are going to make excessively complex regex statements without it being an attack plenty often. Rational timeouts is a necessary design decision for performance reasons, not just security. It would be a bit like saying a security risk of adding a complex report is people may DOS your site by running the report. That's a performance concern, not a security one. – AJ Henderson Aug 06 '18 at 12:49
2

People have crashed servers with regular expressions, and I personally know of one site that had hundreds of thousands of users getting crashed with that kind of a construct. Can't agree with such damage being minimal, as it took them some time to get it back online. – eis Aug 07 '18 at 16:05
@eis did they exploit the regex engine or was performance safe guards not properly configured and a series of run away regex took down the server trying to solve? I said the risk of exploitation of the engine is low. Slow running queries, even in a dos sense, is a performance concern as legitimate queries could also take down the server without proper performance safe guards. – AJ Henderson Aug 07 '18 at 16:52
@AJHenderson you're right in that it's the latter, not about exploiting the engine. However even without any exploit I think the end user impact might be something else than minimal, even if the regex won't modify any values. – eis Aug 07 '18 at 18:36
@els - right, but re-read what I said in the answer. I said that the possible impacts of an exploit are minimal. I later said that they can run a very long time and have potentially undesirable impacts outside the security context. Since valid input can also put a substantial processing load, I consider performance safeguards to be a performance issue rather than a security one. Certainly the impact of not having proper performance safe guards would be catastrophic, even to legitimate use. – AJ Henderson Aug 07 '18 at 18:54

score 8 · Answer 3 · edited Aug 06 '18 at 14:10

As the other answers have pointed out, the attack vector would most possibly be the regex engine.

While you would assume that these engines are quite mature, robust and thoroughly tested, it did happen in the past:

CVE-2010-1792 Arbitrary Code Execution in Apple Safari and iOS. Quote from the Patch notes:

A memory corruption issue exists in WebKit's handling of regular expressions. Visiting a maliciously crafted website may lead to an unexpected application termination or arbitrary code execution.

But of course, the argument of a possibly flawed library holds for everything - even user-provided JPEG files.

The other aspect, albeit not inherently technical, would be the (.+) case you mentioned: Should the product allow arbitrary data retrieval?

score 7 · Answer 4 · answered Aug 06 '18 at 19:40

The problem is that regex engines "backtrack". When you have a reptition operation (e.g. + or * ) in your regex the regex engine will try to match it against as much of the input string as possible. If the match later fails then it will backtrack and try matching your repition against a smaller part of the input string.

Multiple repitition operations can lead to nested backtracking and this can lead to the time to evaluate the regex blowing up massively, especially if the repetition operators are nested.

https://www.regular-expressions.info/catastrophic.html

Not all regex engines backtrack (although, most do). The alternative to backtracking is to consider all the possibilities simultaneously until you've found the one that works. It's harder to implement but avoids the pathological cases. See https://swtch.com/~rsc/regexp/regexp1.html — kbolino, Aug 08 '18 at 22:49

score 5 · Answer 5 · answered Aug 08 '18 at 10:15

5

No, ReDoS does not require the attacker to craft unnatural search results.

The basic idea of ReDoS is that you have a sub-expression that can match in multiple ways and matches almost everywhere in the searched string except the end, and you iterate that sub-expression to get catastrophic backtracking. So for example if your shop description is Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua., you can just use something like ([^q]|[^q][^q])+ (or more complex constructs with e.g. lookaheads).

Whether that's a problem depends - as other answers have explained, you can just limit the time available to the regex engine.

answered Aug 08 '18 at 10:15

Tgr

668
3
11

1

I would mention that there is regexp implementations that does not do backtracking - and those avoids this problem. – Taemyr Aug 08 '18 at 10:47
1

RE2 is already mentioned in another answer. It's not really an implementation though, it's a safe subset of the language - so you'd lose features compared to something like PCRE (arguably features that no one cares about in a product search form). – Tgr Aug 08 '18 at 15:11

score -1 · Answer 6 · answered Aug 09 '18 at 02:18

-1

Short answer... No. Irregardless if it is a regex or not, it is still user supplied data and should NEVER be trusted. Standard practice is to properly validate all user supplied data... ALWAYS!

If you wish to allow use of regex from the user, then the user regex should be compared against a white list of allowable regex's that you wish to make available for the script. This way you never directly use the regex sent by the user, and if it doesn't match a regex in the whitelist, you can exit the script. Only safe way to allow regex as user input that I can think of.

answered Aug 09 '18 at 02:18

Epiphany

115
1

2

Not sure why this was being downvoted. Who here hasn't been touched by the story of little [Bobby Tables](https://www.xkcd.com/327/)? ;) – nick012000 Aug 09 '18 at 14:01
2

If you specifically want to allow the user to input an arbitrary regex (consider a search input with a regex option), then this answer isn't useful. What does it mean to validate a regex? What are the consequences of not doing so? – Macil Aug 10 '18 at 04:59
1

I think it quite unfair to down vote my answer, without showing how you plan to deal with this regex once server-side, as in my perfectly viable solution given in my answer above. If you somehow are suggesting that you pass user regex's (loaded with special characters) on to your server without validation... well, good luck with that! – Epiphany Aug 15 '18 at 08:10

Is it safe to let a user type a regex as a search input?

6 Answers6