Why is it dangerous to allow all characters in a URL?

Question

Reviewing the configuration of CodeIgniter, I saw the following line:

$config['permitted_uri_chars'] = 'a-z 0-9~%.:_\-';

And the documentation it says:

/*
|--------------------------------------------------------------------------
| Allowed URL Characters
|--------------------------------------------------------------------------
|
| This lets you specify which characters are permitted within your URLs.
| When someone tries to submit a URL with disallowed characters they will
| get a warning message.
|
| As a security measure you are STRONGLY encouraged to restrict URLs to
| as few characters as possible.  By default only these are allowed: a-z 0-9~%.:_-
|
| Leave blank to allow all characters -- but only if you are insane.
|
| The configured value is actually a regular expression character group
| and it will be executed as: ! preg_match('/^[<permitted_uri_chars>]+$/i
|
| DO NOT CHANGE THIS UNLESS YOU FULLY UNDERSTAND THE REPERCUSSIONS!!
|
*/

However I was not entirely clear, that effects or safety concerns may exist to allow all characters in a url.

What problems can generate this situation?

AFAICT there is no reason to limit this other than band-aiding otherwise flawed code (XSS, SQL-Injection, arbitrary file access, ...). I dont answer because im curious whether someone comes up with a valid reason for that. — marstato, Aug 10 '16 at 17:08
[This answer](http://stackoverflow.com/questions/4170418/allowing-any-character-in-the-url-in-codeigniter) on stackoverflow should give an answer your question. — WhackinMyKeyboard, Aug 10 '16 at 17:12
@DanK The question is asking the same thing - but where is the answer? — marstato, Aug 10 '16 at 17:16
@marstato Thanks for comment this question, I found an answer in the following link: https://security.stackexchange.com/questions/11234/how-does-hacking-work — Juan Pinzón, Aug 10 '16 at 17:58

score 3 · Accepted Answer · answered Aug 11 '16 at 15:22

Limiting the character set in this way (also called whitelisting) is one of the recommended methods of input validation. The purpose of input validation is to prevent a program from executing on data that may cause unintended problems.

There are many successful attacks that have resulted from malformed URLs (these are not actual attack URLs, but representative of attacks):

filepath injection: http://example.com/?C:\documents\top_secrets.txt
buffer overflow: http://example.com/aaaaaaaaaaaaaaaaaaaaaaaaa...aaaEvilShellCode
script injection: http://example.com/?<script>alert("Click me!")</script>
SQL injection: http://example.com/?USER=' or 1=1; select * from users

Initial reactions to these attacks were to prohibit the backslash character, quote marks, asterisks, and the less-than and greater-than symbols. This is called blacklisting; unfortunately, blacklisting is mostly a "patch after learning about the attack" approach. Whitelisting is somewhat more effective than blacklisting. However, limiting the characters that appear in a URL may do virtually nothing to prevent many of these attacks if they can all be bypassed using percent encoding, which enables the attacker to use only characters from the approved white list: %2F is the same as a /, etc.

To be effective, the regexp in CodeIgniter needs to be performed after the percent encoding has been decoded. And in order to prevent buffer overflow problems while simply testing the data with regexp, the first step of the validator has to be length checking.

There's another problem that they might be trying to prevent with their whitelist, and that is URL hijacking using Unicode characters to simulate ASCII characters. To a human just clicking a link, the strings "exampleZurichBank.com" and "exampleZuricⱨBank.com appear similar. Blocking Unicode characters that aren't in the [A-Z][a-z] range does help prevent these; it also disenfranchises a large segment of the planet by blocking URLs in their native alphabets.

Keep in mind that input validation is only one preventative measure out of many that still need to be implemented. Applications still need to defend against other common vulnerabilities, such as XSS, CSRF, SQL injection, session hijacking, etc.

Thanks for the answer, another question, **URL vulnerabilities can be avoided if the post method is used?** — Juan Pinzón, Aug 11 '16 at 15:47
@JuanPinzón, that would be a good separate question to ask. — John Deters, Aug 15 '16 at 14:53

score 2 · Answer 2 · answered Aug 11 '16 at 00:30

If you postulate that all the code running on your server properly parses URLs and encodes them when making filesystem lookups, when including them in database queries, when passing them to shell commands and so on, and that the code is fully consistent regarding when it considers strings to be equal (e.g. URL percent-encoding, case sensitivity, non-ASCII encodings, Unicode normalization, …), then there's no harm in allowing any character in URLs.

But how sure are you that all the code you're running is perfectly safe and consistent?

Reducing the set of allowed characters reduces the potential for vulnerabilities. For example, if you have code that constructs SQL queries by directly injecting parts of the URL like sprintf("select where name = '%s'", url.param[1]), but the URLs aren't allowed to contain ' nor %27, then this SQL injection vulnerability can't actually be exploited.

I agree, the whitelisting is probably workaround to poor code quality in CodeIgniter because there are no generic attacks that depend on any specific characters in URLs. In reality, the important thing is to *correctly encode* each URL to each specific *context*. For example, the encoding is *slightly different* if the URL is supposed to be user visible text nested in `` element vs being value of `href` attribute vs being string value for JS string vs being nested within `url()` function in CSS. A well written code will have *separate* encoding methods with binary input for all those cases. — Mikko Rantalainen, Jun 15 '21 at 08:45

Why is it dangerous to allow all characters in a URL?

2 Answers2