11

Short Question:

Question: Could any security vulnerabilities arise if a server runs htmlentities as UTF-8 but the client views the results as ISO-8859-1?

Assumption: No vulnerabilities exist when one consistent charset is used


Detailed Question:

Question: Could any security vulnerabilities arise if the server htmlentities a ISO-8859-1 string as UTF-8? (and the client interprets the result as ISO-8859-1?)

(e.g. $results = htmlentities($iso_8859_1_string, ENT_QUOTES, "UTF-8")

Assuming everything is coded in such a way that no vulnerabilities arise when just one character set encoding is consistently used. (Ignoring if $results = empty string).

Perhaps if $iso_8859_1_string could contain any value, the results would be treated as either invalid UTF-8 (and return ""), or as valid UTF-8. For valid UTF-8, the UTF-8 sequences would be escaped as expected, but how would the results be viewed on the client interpreting the result as ISO-8859-1? The characters results in the 0 - 127 range being escaped as expected (same as "US-ASCII"), some characters would resolve into html entities and could be displayed as expected. Are there valid UTF-8 characters in the higher 128+ range which do not resolve to html entities? Would the client just see a bunch of garbled/garbage text/symbols but no characters which would cause the web browser to execute code or switch into a code execution context? (e.g. no tag characters such as '<' '>' symbols)? (Assuming the $results are put into a "content context", and not in an "attribute value" or a "script" body).

Is this right line of thinking?


Note: I believe I've already worked out the vice versa case (i.e. if the server htmlentities a UTF-8 string as ISO-8859-1 and the client interprets the result as UTF-8)

(e.g. htmlentities($utf8_string, ENT_QUOTES, "ISO-8859-1"))

Answer: My guess is no security vulnerability on the client (for htmlentities as ISO -> client reads as UTF-8) because:

  • In ISO-8859-1, characters in the range :

    • 0-127 (US-ASCII): are encoded exactly the same way in UTF-8,
    • 160 -> 255 in ISO-8859-1 would all be encoded as HTML entities,
    • leaving just the 128-159 character range..., but according to Wikipedia's UTF-8 specification, http://en.wikipedia.org/wiki/UTF-8#Description, all UTF-8 bytes that are in the 128+ range are all part of "multi-byte sequences" which comprise a "leading byte" which is always 192 or higher, and "continuation bytes" in the 128+ range. Thus, the htmlentities($utf8_string, ENT_QUOTES, "ISO-8859-1") could not output any "leading bytes" needed by UTF-8 to generate valid multi-byte sequences. So any characters in this range would appear in UTF-8 as a ? (i.e. an invalid character) due to not seeing any "leading byte".

I think this solves my question for the other direction.


Real-world situation: A PHP 5.3.x server with security backports uses ISO-8859-1 as the default encoding. Starting with PHP 5.4, UTF-8 is the default encoding. http://php.net/htmlentities. I'm wanting to determining if the code works properly in either an all UTF-8, or all ISO-8859-1 environment, and ensuring there are no automatic security holes caused by encoding mistakes/mismatch.

I feel like I can rest assured that only usability is affected, but not security in these specific cases.

dajon
  • 211
  • 2
  • 5
  • Only thing I can find is http://zaynar.co.uk/docs/charset-encoding-xss.html but even then I don't think it is related in your situation. –  Apr 06 '14 at 19:26

2 Answers2

5

As far as I'm aware, there's no security issue.

The "dangerous" characters in HTML (less-than, greater-than, ampersand, single quote, double quote) all have identical byte values under UTF-8 and ISO-8859-1 (and virtually every other encoding you're likely to encounter, with the exceptions of UTF-16, UTF-32, and EBCDIC). As a result, escaping them in one encoding will escape them in the other encoding as well.

The reason this holds true is that the vast majority of character encodings, including UTF-8 and ISO-8859-1, are "ASCII plus additional characters", and the structure of an HTML document only uses characters in the ASCII portion of the encoding.

Mark
  • 34,390
  • 9
  • 85
  • 134
  • So the only real question is "are there any dangerous non-ASCII characters": if the answer is "no", then any encoding which encodes ASCII verbatim (e.g. the ones you mention) is safe. This question is answered briefly in the last sentence of this answer; is there any need for expanding? – bzlm Jul 16 '14 at 21:04
-2

for as long as i know, aslong as your PHP scripts (i.e. forms) use the filter for htmlspecialchars() and strip things like weird symbols and backslashes, there wouldnt be a security risk, atleast from my perspective.

forcing a charset to be used by the clien is an option for us paranoid people though, along with the basic stuff i just named.

Lighty
  • 2,368
  • 1
  • 23
  • 36
  • 1
    You're forgetting attacks such as [XSS with UTF-7](http://nedbatchelder.com/blog/200704/xss_with_utf7.html) where sequences such as `+ADw-script+AD4-` will bypass filters and encoding routines that work in UTF-8 and will be rendered as ` – SilverlightFox Apr 23 '14 at 13:07
  • thats why i reccomend the basic protection against these attacks but filtering those characters out, since that stuff is only lethal on server-side, in PHP – Lighty Apr 23 '14 at 13:25
  • 2
    That's my point, they won't be filtered out if encoded by UTF-7. XSS is client side. – SilverlightFox Apr 23 '14 at 13:28
  • it doesnt get filtered out client sided?.....gotta dig into that one :3 – Lighty Apr 23 '14 at 13:30