Are there any security bugs with UTF-8?

Question

I have just recently decided to allow all characters for my website. Are there any common security bugs that I need to deal with? are there any ways to "inject" using utf-8? Is it safe to allow users to use passwords with non english alphabetical characters? and can php's bcrypt handle hasing that?

edit: I have no idea what I'm doing when it comes to things like character sets.

Yes, it can be used for bypassing many things, specially for XSS attakcs. Read [this article](https://www.blackhat.com/presentations/bh-usa-09/WEBER/BHUSA09-Weber-UnicodeSecurityPreview-PAPER.pdf), this may help you a bit. And I recommend you not to use Unicode for passwords. — , Aug 12 '16 at 20:39
[IIS Unicode exploit](https://www.giac.org/paper/gcih/115/iis-unicode-exploit/101163) is a very well known historical one — paj28, Aug 12 '16 at 21:12
@FarazX The reason to disallow some UTF-8 characters in passwords is a human usability one (e.g., reliability of being able to enter certain codepoints across multiple devices) and has little to do with security. — Stephen Touset, Aug 16 '16 at 21:17

Macil · Answer 1 · 2016-08-16T21:06:43.740

The common inherent possible security issues from adding Unicode support (not specific to UTF-8) come from the increased potential for visual spoofing, and issues coming from normalization mismatches.

Visual spoofing: say you have a forum with a user named "admin" that everyone knows to trust. Someone else could register a user account named "аdmin" (the first letter is the cyrillic letter a), and trick others into thinking they were the site admin. This is mostly a technique for social-engineering: it's unlikely that any software will mix up the users. (This specific example could be partially addressed by having the site add special formatting or flair near the admin's name, making profile names be links to profile pages which show the user's activity history and join date, etc., so users could identify others in ways besides their visible forgeable name. This is a more general issue that isn't exclusive to unicode support: users could also name themselves other misleading names like "<site> Support", "admin " with a space, "admim", etc.)

Normalization: certain characters like "ö" can be represented in multiple ways. It could either be the single character U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS), or the two characters U+0061 U+0308 (LATIN SMALL LETTER O + COMBINING DIAERESIS). Normalization is the process of converting all text to the combined or decomposed form. If you consistently never use normalization or always use normalization, then you won't run into issues. However, if you sometimes do, you can have security issues:

For example, OS X normalizes unicode in filenames. Say you had a website without any normalization-related code running on an OS X server where whenever a user registered, a file was created with their name, and you used a database without any normalization to keep track of usernames that were already registered in order to prevent names being re-registered. If you had a user named "foö" (using U+00F6), then someone else could register an account named "foö" (U+0061 U+0308), and the site would allow it but would overwrite the file created by the first "foö" user. To solve this, you would either need to make your application normalize consistently throughout the whole application, or you would need to check for collisions whenever you cross some boundary that does normalization differently (when a user registers and you need to make a file for them, open the file in exclusive mode so that it will fail if the file already exists, and you can block the new user from being registered).

score 6 · Answer 2 · edited Mar 17 '17 at 13:14

AgentME's answer describes two important classes of Unicode-related vulnerabilities: visual similarity, and normalization. I won't go over them.

There are also vulnerabilities related to UTF-8 specifically. UTF-8 has some invalid byte sequences and some applications don't cope with them well, e.g. they may crash or compute invalid lengths. Invalid byte sequences can also cause havoc in parsers. For example, suppose you have code that doubles all single quotes to stuff them into an SQL query:

"Robert'); DROP TABLE Studers;--" → "select * where name = '" + "Robert''); DROP TABLE Studers;--" + "'"

(Hopefully this isn't done by application code but by a low-level library… but in the real world, there's far too much code that does this and doesn't always get it right.) Now imagine there's an invalid UTF-8 byte sequence after Robert, e.g. "Robert\200'); etc". The quoting library and the database have to agree whether the ' needs to be doubled in that case, and in practice they don't always agree and you get an SQL injection.

To be clear, while the problem is *exposed* by allowing UTF-8, the solution is to use parameterized queries; disallowing UTF-8 is not necessarily a correct fix. — Stephen Touset, Aug 16 '16 at 21:13
There's a slightly-subtle reference to this xkcd comic on SQL injection in the answer. https://www.xkcd.com/327/ — Cody P, Aug 17 '16 at 21:24

Are there any security bugs with UTF-8?

2 Answers2