What are best practices for handling user Unicode in a web application?

Question

Lately, the security community has been asking interesting questions around surprising side effects of raw Unicode formatting characters in source code. That got me thinking about input validation and display in web apps. Normally, I rely on a template library like React to worry about most HTML sanitization, and I usually rely on safe methods of string handling and delimiting that will safely handle any mess of Unicode you throw at it.

But I've also been reading the lovely Secure by Design and rethinking how I do validation and what steps I could be taking to reduce the likelihood of a Unicode-based attack. So what are the best practices for safely handling Unicode in a program, specifically a web application?

mrdecemberist · Accepted Answer · 2021-11-12T20:30:11.537

This is a very broad topic, and there are a lot of resources on the matter:

Unicode TR36: Unicode Security Considerations goes into great detail about visual and non-visual Unicode attacks and recommendations for dealing with them
Unicode TR39: Unicode Security Mechanisms describes problem-detection mechanisms and acceptance criteria for Unicode strings
The OWASP Input Validation Cheat Sheet has a section on Unicode

I've done a base level of research on the topic and have come up with the following list of recommended base practices. This isn't meant to be comprehensive; there are many, many more listed in the resources above. As always, what you need to do depends on your risk profile. For my purposes, these practices seem to fit the bill.

On input, on the server, and in this order:

If accepting UTF, raise an error if the input has any illegal byte sequences or non-shortest-form UTF-8 characters.
If converting from another character set, always encode something if there is input. No character sequences should just be ignored. If nothing else, illegal inputs should be converted to the U+FFFD Replacement Character.
Input strings should be normalized using NFC (which preserves things like ligatures and superscript numerals) or NFKC (which helpfully decomposes ligatures but destroys superscripts) as appropriate for the use case. This ensures that strings that should compare equal do, regardless of the algorithm.
Then check the text for legal syntax or acceptable characters. Raise an error or insert an obvious replacement character if the text is illegal. Do not simply remove the invalid characters—that concatenates things in the string that were not concatenated before, which may defeat IDS-like protections before this point.
- The Identifier_Type categories from TR39 are useful here, especially Not_Character which includes NULs and other invisible control codes.
  - The Default_Ignorable category includes many invisible characters, including the bidi control characters. Again, be careful that any changes to the text do not alter its meaning to any algorithms that follow.
- You can safely detect and remove a leading U+FEFF Byte-Order Mark.
- Consider checking against one of the TR39 restriction levels. For example, if the text should be in one language only, it should match the Single Script restriction level.
- Accepting or rejecting bidi codes is debatable. Some input texts may need them. If you accept them, you should ensure that any overrides or isolates are closed by the end of the string (this is non-trivial, as a meta discussion discovered) or that you will properly isolate the string when displayed (such as always following the string with a paragraph-separator character like a newline).
If the text represents an identifier, such as a domain name or email address, consider using the TR39 detection mechanisms for confusables and identifier restriction levels. These keep a user from being confused by lookalike characters from other scripts. Using the stronger KC normalization form can also help.

During processing:

Always treat Unicode strings as Unicode, and be aware of caveats in your language's implementation.
- For example, the astral plane code point U+1F600 is a string of length two in JavaScript because it is stored as a surrogate pair in UTF-16, and JavaScript operates on the 16-bit codes rather than whole characters. Accessing strings with surrogate pairs by index can lead to unprintable half-characters, and RegExps need to be created with the unicode flag enabled to properly count astral characters.
Everyone should know by now: Never ever ever directly concatenate user text into anything. If you're executing SQL, use SQL parameters. If you're writing CSV or XML, use a CSV or XML writer with proper quoting. (etc. etc.) Don't let delimiters in your file/command format get confused with characters in user-controlled text.

On display:

In HTML, use an appropriate templating library so that text is never confused for HTML structure.
Either:
- Ensure the input text does not contain bidi or other non-whitespace control characters, and insert a U+200E Left-to-Right Mark or U+200F Right-to-Left Mark after the user-supplied string to reset the text direction to the default. This is recommended in Unicode TR9.
- Use the <bdi> tag or unicode-bidi: isolate CSS property to isolate the direction of user-provided text from surrounding text in HTML templates.
- Ensure the input text closes any overrides and isolates it introduces, and then surround the text with a U+2068 First Strong Isolate at the start and a closing U+2069 Pop Directional Isolate afterwards.
- Follow the input text with one of the bidi paragraph separator characters (such as line feed, carriage return, record separator, and U+2029 Paragraph Separator), which have the effect of terminating all previous overrides and isolates. You may need to insert a LRM or RLM mark to set the direction before the next text.

+1 This is a big answer for unicodes. – Parking Master Nov 12 '21 at 19:58 — Parking Master, Nov 12 '21 at 19:58

What are best practices for handling user Unicode in a web application?

1 Answers1