Short Question:
Question: Could any security vulnerabilities arise if a server runs htmlentities as UTF-8 but the client views the results as ISO-8859-1?
Assumption: No vulnerabilities exist when one consistent charset is used
Detailed Question:
Question: Could any security vulnerabilities arise if the server htmlentities a ISO-8859-1 string as UTF-8? (and the client interprets the result as ISO-8859-1?)
(e.g. $results = htmlentities($iso_8859_1_string, ENT_QUOTES, "UTF-8")
Assuming everything is coded in such a way that no vulnerabilities arise when just one character set encoding is consistently used. (Ignoring if $results = empty string).
Perhaps if $iso_8859_1_string
could contain any value, the results would be treated as either invalid UTF-8 (and return ""), or as valid UTF-8. For valid UTF-8, the UTF-8 sequences would be escaped as expected, but how would the results be viewed on the client interpreting the result as ISO-8859-1? The characters results in the 0 - 127 range being escaped as expected (same as "US-ASCII"), some characters would resolve into html entities and could be displayed as expected. Are there valid UTF-8 characters in the higher 128+ range which do not resolve to html entities? Would the client just see a bunch of garbled/garbage text/symbols but no characters which would cause the web browser to execute code or switch into a code execution context? (e.g. no tag characters such as '<' '>' symbols)? (Assuming the $results are put into a "content context", and not in an "attribute value" or a "script" body).
Is this right line of thinking?
Note: I believe I've already worked out the vice versa case (i.e. if the server htmlentities a UTF-8 string as ISO-8859-1 and the client interprets the result as UTF-8)
(e.g. htmlentities($utf8_string, ENT_QUOTES, "ISO-8859-1")
)
Answer: My guess is no security vulnerability on the client (for htmlentities as ISO -> client reads as UTF-8) because:
In ISO-8859-1, characters in the range :
- 0-127 (US-ASCII): are encoded exactly the same way in UTF-8,
- 160 -> 255 in ISO-8859-1 would all be encoded as HTML entities,
- leaving just the 128-159 character range..., but according to Wikipedia's UTF-8 specification, http://en.wikipedia.org/wiki/UTF-8#Description, all UTF-8 bytes that are in the 128+ range are all part of "multi-byte sequences" which comprise a "leading byte" which is always 192 or higher, and "continuation bytes" in the 128+ range. Thus, the
htmlentities($utf8_string, ENT_QUOTES, "ISO-8859-1")
could not output any "leading bytes" needed by UTF-8 to generate valid multi-byte sequences. So any characters in this range would appear in UTF-8 as a ? (i.e. an invalid character) due to not seeing any "leading byte".
I think this solves my question for the other direction.
Real-world situation: A PHP 5.3.x server with security backports uses ISO-8859-1 as the default encoding. Starting with PHP 5.4, UTF-8 is the default encoding. http://php.net/htmlentities. I'm wanting to determining if the code works properly in either an all UTF-8, or all ISO-8859-1 environment, and ensuring there are no automatic security holes caused by encoding mistakes/mismatch.
I feel like I can rest assured that only usability is affected, but not security in these specific cases.