The problem
Abusing character encodings is a popular trick to get XSS to work even when there are filters in place. There are a number of different situations when it works, but they all share common prerequesits:
- The attacker sends a payload in character encoding A.
- The server doing the filtering or sanitazion is working in character encoding B.
- The victims browser is interpreting the page as if in character encoding A.
Let's look at two example of how this can happend.
Example #1: No encoding parameter in htmlspecialchars
This is a quite common sight in PHP:
echo htmlspecialchars($_GET["query"], ENT_COMPAT | ENT_HTML401);
The problem here is the default behaviour PHP falls back to when there is no encoding specified. From the manual:
If omitted, the default value of the encoding varies depending on the PHP version in use. In PHP 5.6 and later, the default_charset configuration option is used as the default value. PHP 5.4 and 5.5 will use UTF-8 as the default. Earlier versions of PHP use ISO-8859-1.
So what encoding PHP uses depends on your version and configuration. Great. So now all that stands between you and the abyss is someone making an innocent change in php.ini
, or perhaps just something as simple as a server upgrade or reinstall. I too like to live dangerously... but not that dangerously.
Note that this example has nothing to do with the browser. Modern or old, it doesn't matter, because it's the server and not the browser that's the problem here.
The solution off course is to specify the correct encoding and making sure that the same is specified in the HTTP Content-Type
header of the response:
echo htmlspecialchars($_GET["query"], ENT_COMPAT | ENT_HTML401, "UTF-8");
Example #2: Browser heuristics biting you
This is a problem if your server does not specify what encoding it is using in the response (or if it only does it in a meta tag that is to far down for the browser to care about it). If you do not tell the browser what encoding to use, it will have to guess. Unfortunately, all browsers aren't so good at that:
If certain strings of user input -- say, +ADw-script+AD4-alert(1)+ADw-/script+AD4-
-- are echoed back early enough in the HTML page, Internet Explorer may incorrectly guess that the page is encoded in UTF-7. Suddenly, the otherwise harmless user input becomes active HTML and will execute.
The payload in the quote is <script>alert(1)</script>
encoded in UTF-7. A sanitizer working in UTF-8 would see nothing dangerous in that payload and let it through, but the browser that is tricked into working in UTF-7 would still run it.
My understanding is that it is mostly old versions of IE where this is a problem. But I am not sure, so I would be happy to see another answer where it is clarified.
EDIT: See Xavier59's answer for a situation where it works on modern browsers.
The solution
What you need to do on the server is simple in theory. You need to make sure that the following is always true:
- The character encoding of the response is corectly set in the HTTP headers.
- The XSS filter is working in the same encoding as specified above.
In practice, it is surprisingly easy to get that wrong.