I am currently designing a login for a web service. I will use a PBKDF2 implementation for hashing the passwords.
However, I intend to allow unicode for passwords, as I will have international users, which might want to use, for example, cyrillic characters. To avoid any issues with unicode ambiguity, I thought of using the NFC unicode normalization before encoding the password as utf8 and passing it on to the hash.
The question now is: Is that safe or does it introduce any unwanted ambiguity into the password validation? It is clear that "a\u0308"
(a + combining diaresis) and "ä"
should be the same, but does NFC fold any more differences which users could be relying on?
Edit:
I found that there is a stringprep (RFC 3454) profile called SASLprep (RFC 4013) which is appearantly used for passwords and usernames in some protocols. It specifies to use a KD normalization, which I consider a bad idea. It will fold differences like ²
and 2
, which are two characters commonly on keybords in the western world at least, which could be used to enrich the password entropy. Unfortunately, no rationale is given for that.