PHP: if charset mismatches (htmlentities UTF-8) viewed by client as ISO-8859-1 (or vice versa)

Question

Short Question:

Question: Could any security vulnerabilities arise if a server runs htmlentities as UTF-8 but the client views the results as ISO-8859-1?

Assumption: No vulnerabilities exist when one consistent charset is used

Detailed Question:

Question: Could any security vulnerabilities arise if the server htmlentities a ISO-8859-1 string as UTF-8? (and the client interprets the result as ISO-8859-1?)

(e.g. $results = htmlentities($iso_8859_1_string, ENT_QUOTES, "UTF-8")

Assuming everything is coded in such a way that no vulnerabilities arise when just one character set encoding is consistently used. (Ignoring if $results = empty string).

Perhaps if $iso_8859_1_string could contain any value, the results would be treated as either invalid UTF-8 (and return ""), or as valid UTF-8. For valid UTF-8, the UTF-8 sequences would be escaped as expected, but how would the results be viewed on the client interpreting the result as ISO-8859-1? The characters results in the 0 - 127 range being escaped as expected (same as "US-ASCII"), some characters would resolve into html entities and could be displayed as expected. Are there valid UTF-8 characters in the higher 128+ range which do not resolve to html entities? Would the client just see a bunch of garbled/garbage text/symbols but no characters which would cause the web browser to execute code or switch into a code execution context? (e.g. no tag characters such as '<' '>' symbols)? (Assuming the $results are put into a "content context", and not in an "attribute value" or a "script" body).

Is this right line of thinking?

Note: I believe I've already worked out the vice versa case (i.e. if the server htmlentities a UTF-8 string as ISO-8859-1 and the client interprets the result as UTF-8)

(e.g. htmlentities($utf8_string, ENT_QUOTES, "ISO-8859-1"))

Answer: My guess is no security vulnerability on the client (for htmlentities as ISO -> client reads as UTF-8) because:

In ISO-8859-1, characters in the range :
- 0-127 (US-ASCII): are encoded exactly the same way in UTF-8,
- 160 -> 255 in ISO-8859-1 would all be encoded as HTML entities,
- leaving just the 128-159 character range..., but according to Wikipedia's UTF-8 specification, http://en.wikipedia.org/wiki/UTF-8#Description, all UTF-8 bytes that are in the 128+ range are all part of "multi-byte sequences" which comprise a "leading byte" which is always 192 or higher, and "continuation bytes" in the 128+ range. Thus, the htmlentities($utf8_string, ENT_QUOTES, "ISO-8859-1") could not output any "leading bytes" needed by UTF-8 to generate valid multi-byte sequences. So any characters in this range would appear in UTF-8 as a ? (i.e. an invalid character) due to not seeing any "leading byte".

I think this solves my question for the other direction.

Real-world situation: A PHP 5.3.x server with security backports uses ISO-8859-1 as the default encoding. Starting with PHP 5.4, UTF-8 is the default encoding. http://php.net/htmlentities. I'm wanting to determining if the code works properly in either an all UTF-8, or all ISO-8859-1 environment, and ensuring there are no automatic security holes caused by encoding mistakes/mismatch.

I feel like I can rest assured that only usability is affected, but not security in these specific cases.

Only thing I can find is http://zaynar.co.uk/docs/charset-encoding-xss.html but even then I don't think it is related in your situation. — , Apr 06 '14 at 19:26

score 5 · Answer 1 · answered Jun 08 '14 at 21:18

As far as I'm aware, there's no security issue.

The "dangerous" characters in HTML (less-than, greater-than, ampersand, single quote, double quote) all have identical byte values under UTF-8 and ISO-8859-1 (and virtually every other encoding you're likely to encounter, with the exceptions of UTF-16, UTF-32, and EBCDIC). As a result, escaping them in one encoding will escape them in the other encoding as well.

The reason this holds true is that the vast majority of character encodings, including UTF-8 and ISO-8859-1, are "ASCII plus additional characters", and the structure of an HTML document only uses characters in the ASCII portion of the encoding.

So the only real question is "are there any dangerous non-ASCII characters": if the answer is "no", then any encoding which encodes ASCII verbatim (e.g. the ones you mention) is safe. This question is answered briefly in the last sentence of this answer; is there any need for expanding? — bzlm, Jul 16 '14 at 21:04

score -2 · Answer 2 · answered Apr 23 '14 at 11:49

-2

for as long as i know, aslong as your PHP scripts (i.e. forms) use the filter for htmlspecialchars() and strip things like weird symbols and backslashes, there wouldnt be a security risk, atleast from my perspective.

forcing a charset to be used by the clien is an option for us paranoid people though, along with the basic stuff i just named.

answered Apr 23 '14 at 11:49

Lighty

2,368
1
23
36

1

You're forgetting attacks such as [XSS with UTF-7](http://nedbatchelder.com/blog/200704/xss_with_utf7.html) where sequences such as `+ADw-script+AD4-` will bypass filters and encoding routines that work in UTF-8 and will be rendered as ` – SilverlightFox Apr 23 '14 at 13:07
thats why i reccomend the basic protection against these attacks but filtering those characters out, since that stuff is only lethal on server-side, in PHP – Lighty Apr 23 '14 at 13:25
2

That's my point, they won't be filtered out if encoded by UTF-7. XSS is client side. – SilverlightFox Apr 23 '14 at 13:28
it doesnt get filtered out client sided?.....gotta dig into that one :3 – Lighty Apr 23 '14 at 13:30

PHP: if charset mismatches (htmlentities UTF-8) viewed by client as ISO-8859-1 (or vice versa)

2 Answers2