How is character encodings used to bypass XSS sanitizers?

Question

I read in different blogs that PHP htmlspecialchars() function has certain problems when one does not give the expected charset as an optional parameter.

Can someone explain some basic stuff about XSS exploits that arise from bad usage of sanitize functions with some examples related to character encoding?

Does this affect modern browsers as well?

Anders · Accepted Answer · 2018-02-28T20:12:52.597

The problem

Abusing character encodings is a popular trick to get XSS to work even when there are filters in place. There are a number of different situations when it works, but they all share common prerequesits:

The attacker sends a payload in character encoding A.
The server doing the filtering or sanitazion is working in character encoding B.
The victims browser is interpreting the page as if in character encoding A.

Let's look at two example of how this can happend.

Example #1: No encoding parameter in htmlspecialchars

This is a quite common sight in PHP:

echo htmlspecialchars($_GET["query"], ENT_COMPAT | ENT_HTML401);

The problem here is the default behaviour PHP falls back to when there is no encoding specified. From the manual:

If omitted, the default value of the encoding varies depending on the PHP version in use. In PHP 5.6 and later, the default_charset configuration option is used as the default value. PHP 5.4 and 5.5 will use UTF-8 as the default. Earlier versions of PHP use ISO-8859-1.

So what encoding PHP uses depends on your version and configuration. Great. So now all that stands between you and the abyss is someone making an innocent change in php.ini, or perhaps just something as simple as a server upgrade or reinstall. I too like to live dangerously... but not that dangerously.

Note that this example has nothing to do with the browser. Modern or old, it doesn't matter, because it's the server and not the browser that's the problem here.

The solution off course is to specify the correct encoding and making sure that the same is specified in the HTTP Content-Type header of the response:

echo htmlspecialchars($_GET["query"], ENT_COMPAT | ENT_HTML401, "UTF-8");

Example #2: Browser heuristics biting you

This is a problem if your server does not specify what encoding it is using in the response (or if it only does it in a meta tag that is to far down for the browser to care about it). If you do not tell the browser what encoding to use, it will have to guess. Unfortunately, all browsers aren't so good at that:

If certain strings of user input -- say, +ADw-script+AD4-alert(1)+ADw-/script+AD4- -- are echoed back early enough in the HTML page, Internet Explorer may incorrectly guess that the page is encoded in UTF-7. Suddenly, the otherwise harmless user input becomes active HTML and will execute.

The payload in the quote is <script>alert(1)</script> encoded in UTF-7. A sanitizer working in UTF-8 would see nothing dangerous in that payload and let it through, but the browser that is tricked into working in UTF-7 would still run it.

My understanding is that it is mostly old versions of IE where this is a problem. But I am not sure, so I would be happy to see another answer where it is clarified.

EDIT: See Xavier59's answer for a situation where it works on modern browsers.

The solution

What you need to do on the server is simple in theory. You need to make sure that the following is always true:

The character encoding of the response is corectly set in the HTTP headers.
The XSS filter is working in the same encoding as specified above.

In practice, it is surprisingly easy to get that wrong.

Xavier59 · Answer 2 · 2018-03-01T16:22:29.950

This come as an addition to Anders answer (which is great btw).

My understanding is that it is mostly old versions of IE where this is a problem. But I do not have a source for that, and I am not sure, so I would be happy to see another answer where it is clarified.

Yes, this affect modern browsers.

Let's take the following sanitization :

<?php
    header('Content-Type: text/html;charset=utf-8');
    echo preg_replace('/<\w+/', '', $_GET['name']).", can you p0wn it ?"
?>

This might not seem vulnerable because :

< followed by one or more letter is being removed so an attacker cannot open a new tag.
Content-Type header is correctly set to utf-8

Now, imagine that we send %00%3C%00, the regex parser will fail because < (%3C) is not followed by a letter (as defined by \w) but by %00 (the null byte). In UTF-8, the reflected input will not execute anything, but if we can find a way to get it reflected in UTF-16 ...

Here is what we can read from W3 :

If you have a UTF-8 byte-order mark (BOM) at the start of your file then recent browser versions other than Internet Explorer 10 or 11 will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.

You could skip the meta encoding declaration if you have a BOM, but we recommend that you keep it, since it helps people looking at the source code to ascertain what the encoding of the page is.

The BOM character in UTF-16 is the unicode character U+FEFF (the different BOM encoding are best described on Wikipedia). So because our input is being reflected at the beginning of the dom, we can change the charset to UTF-16 and get our code to execute.

Complete payload :

%FE%FF%00%3C%00s%00c%00r%00i%00p%00t%00%3E%00a%00l%00e%00r%00t%00(%00%22%00P%000%00w%00n%00e%00d%00%22%00)%00;%00%3C%00/%00s%00c%00r%00i%00p%00t%00%3E

Here is a POC I made. Most xss auditors will not fall for it, but Firefox will since its auditor is disabled by default. (tested on Firefox Nightly 60.0a1 - last version as of today)

However, htmlspecialchars and htmlentities will not fall for it. Nonetheless, this shows that there are always tricky edge cases around the corner !

Other attacks on encoding include character mapping wich are also still relevant as of today.

Thank you very much :) Seems a bit more advanced for my knowledge but thank you a lot also!!! Also why the payload starts with %FE%FF and not %FF%FE as in Little Endian? All modern x86 sy6stems are Little Endian correct? — XII, Mar 01 '18 at 07:04
You can either start the payload with `%FE%FF`, which is the indicator for `UTF-16BE` (Big endian) or with `%FF%FE` for `UTF-16LE` (Little Endian). If you choose to start the payload with `%FF%FE`, it can also works but you will have to encode characters in the other way eg `%3C%00` instead of `%00%3C` — Xavier59, Mar 01 '18 at 15:41

score 1 · Answer 3 · answered Feb 28 '18 at 08:27

From OWASP XSS page:

"Cross-Site Scripting attacks are a type of injection problem, in which malicious scripts are injected into the otherwise benign and trusted web sites. Cross-site scripting (XSS) attacks occur when an attacker uses a web application to send malicious code, generally in the form of a browser side script, to a different end user. Flaws that allow these attacks to succeed are quite widespread and occur anywhere a web application uses input from a user in the output it generates without validating or encoding it.

An attacker can use XSS to send a malicious script to an unsuspecting user. The end user’s browser has no way to know that the script should not be trusted, and will execute the script. Because it thinks the script came from a trusted source, the malicious script can access any cookies, session tokens, or other sensitive information retained by your browser and used with that site. These scripts can even rewrite the content of the HTML page."

This is an example of bad coding practices where you don't sanitize user's input.

Let's imagine you're a web developer and you create this file in your website (name.php):

<form action="" method="GET">
  What is your name: <input type="text" name="username"><br>
  <input type="submit" value="Submit">
</form>

<?php
  print("Entered name is: ".$_GET["username"]);
?>

When opening this page on your browser you're going to see something like this:

Let's put a name and see the behavior of this simple file, as we're using GET method, we will be able to see the sent data on the URL:

But, what happens if someone try to inject some HTML code in this input box, something like

<marquee><h1>Andrew ng</h1></marquee>

See the results in the image below:

The user's input was rendered as if it were part of the original source code of the file.

Now if we try the same thing with Javascript code, let's see what happes, the injection code to test on browsers will be 2 ways of XSS:

<h1>Andrew</h1><script>alert("XSS");</script>

<META HTTP-EQUIV="refresh" CONTENT="0;url=data:text/html;base64,PHNjcmlwdD5hbGVydCgndGVzdDMnKTwvc2NyaXB0Pg">

In both cases Google Chrome blocked the execution of this script:

But, In Mozilla Firefox both scripts are execute successfully:

Hope this can give you a better understanding of XSS and the current situation with modern browsers, this was tested on:

Google Chrome 64.0.3282.119 (Official Build) (64-bit)
Mozilla Firefox Quantum 58.0 (64-bit)

About htmlspecialchars() function you can find more info here.

Other example of XSS that could be of interest for you is this one in my blog.

Hope it helps.

thanks for the answer but what about encoding the payload? I read that most sites enforce a UTF-8 encoding in the response headers. What happens if the web app escaper for instance < to &gt and also set the charset to UTF-8? Are there any attack vectors for that case (for the case outside of tags not inside the value attribute of a tag) — XII, Feb 28 '18 at 08:40
This is a great explanation of XSS, but it does not adress charset encoding issues which is what the question is about. — Anders, Feb 28 '18 at 10:43
Thanks for the comment @Anders I forgot that point, but you already answered it. =) — galoget, Mar 03 '18 at 00:01