3

Reading about XSS and its countermeasures from http://www.xssed.com/xssinfo#Avoiding_XSS_vulnerabilities , it says (in the 2nd last paragraph of the link) that:

[…] support for Unicode character sets by browsers could leave an application open to XSS attacks if the HTML quoting algorithms only look for known-bad characters.

So, how does one exploit a page which looks only for known-bad unicode characters? An example would be highly appreciated.

Gumbo
  • 2,003
  • 1
  • 13
  • 17
Karan
  • 467
  • 5
  • 14
  • 5
    I highly recommend you to read the [OWASP XSS Filter Evasion Cheatsheet](https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet#Character_Encoding) – Dr.Ü May 11 '13 at 10:27
  • @Dr.Ü Thanks. This answers my question directly. Upvoted! – Karan May 12 '13 at 14:33

1 Answers1

3

A problem could be that if you assume that the input is in a different encoding than the browser. (If you don't tell the browser what encoding he should use, most browsers try to guess it).

This problem for example has hit Google's 404 page.
Here was the fact exploited that IE guesses the encoding of a page as UTF-7 if it finds a valid UTF-7 sequence in the first 4096 bytes of the response. With Google's small 404 page, this can be forced by an attacker.

So always tell the browser what encoding you use. And white listing is better than black listing.

Johannes Kuhn
  • 294
  • 3
  • 10
  • I read through your link about google's XSS based on UTF-7. So if the server does not explicitly specify the encoding it is using and the browser incorrectly guesses it, i think it will not be able to decode the text received from the server properly. In what way does this difference in encoding lead to a XSS? From the example given in the link, all of the text is decoded properly. – Karan May 12 '13 at 12:13
  • @user85030 Because it can decode the attack ` – Johannes Kuhn May 12 '13 at 12:15
  • @user85030 see the [wiki page for UTF-7](http://code.google.com/p/doctype-mirror/wiki/ArticleUtf7). An other problem is that an attacker could send a invalid sequence (`\700\600` instead `\000`), that an other side interprets as the NULL char. Ohh, I forgot: many characters can be encoded with different sequences, so the Umlauts: as single codepoint or as `a` with two dots over it. (2 codepoints) – Johannes Kuhn May 12 '13 at 12:30
  • Thanks for the quick response. So, in short, if the webserver fails to specify an encoding scheme to use, an attacker can encode a malicious script in UTF-7 and force the browser to use UTF-7 while accessing google's vulnerable page. Is this correct? What happens when a webserver specifies UTF-8 and the browser is forcefully set to use UTF-7? – Karan May 12 '13 at 13:39
  • A bad idea to setup the browser that way, but I'd say that is nothing you can prevent. (And very hard/impossible for an attacker to force that). – Johannes Kuhn May 12 '13 at 13:49