9

I am currently designing a login for a web service. I will use a PBKDF2 implementation for hashing the passwords.

However, I intend to allow unicode for passwords, as I will have international users, which might want to use, for example, cyrillic characters. To avoid any issues with unicode ambiguity, I thought of using the NFC unicode normalization before encoding the password as utf8 and passing it on to the hash.

The question now is: Is that safe or does it introduce any unwanted ambiguity into the password validation? It is clear that "a\u0308" (a + combining diaresis) and "ä" should be the same, but does NFC fold any more differences which users could be relying on?

Edit:

I found that there is a stringprep (RFC 3454) profile called SASLprep (RFC 4013) which is appearantly used for passwords and usernames in some protocols. It specifies to use a KD normalization, which I consider a bad idea. It will fold differences like ² and 2, which are two characters commonly on keybords in the western world at least, which could be used to enrich the password entropy. Unfortunately, no rationale is given for that.

2 Answers2

4

If you treat 2 and ² as the same character, you're essentially removing one character from the character set. That isn't really so bad if it increases usability, especially if that encourages longer passwords.

Say you take a 8-character password, with a drawn randomly from a set of 2000 characters. That gives log₂(2000⁸) ≈ 88 bits of entropy. If you had a 9-character password, drawn from 1000 characters (half as many!), that's log₂(1000⁹) ≈ 90 bits of entropy. In fact:

+-----+---------------------------+
|     |    character set size     |
| len |  500 | 1000 | 2000 | 4000 |
|-----+------+------+------+------|
| 6   |  54  |  60  |  66  |  72  |
| 7   |  63  |  70  |  77  |  84  |
| 8   |  72  |  80  |  88  |  96  |
| 9   |  81  |  90  |  99  |  108 |
| 10  |  90  |  100 |  110 |  120 |
| 11  |  99  |  110 |  121 |  132 |
+---------------------------------+

As you can see, in the normal range of password lengths and Unicode character set sizes, the exact size of the character set isn't that important.

derobert
  • 222
  • 1
  • 7
  • So you are encouraging that users with a password like ``E=mc²`` are also allowed to log in using ``E=mc2``? – Jonas Schäfer Feb 26 '14 at 09:34
  • 2
    @JonasWielicki well, apparently that RFC does. The benefits to doing so are fairly obvious (e.g., can the user manage to type ² on a smartphone?) And it turns out the downside is fairly minor. – derobert Feb 27 '14 at 15:33
  • Sorry for coming back that late. I see your point with respect to the character set size, but this does in my opinion not cover the risk from accepting *multiple distinct* passwords for one account. From my understanding, it would follow that the entropy introduced by two characters is not neccessarily equal. In any case, applying these semantics requires to inform your users about this, so that they are aware of the subtleties. One could do that, e.g., when setting the password, by comparing the normalized version against the non-normalized and showing a warning in case of mismatch. – Jonas Schäfer May 06 '14 at 11:06
0

The question is whether there is an appropriate trade-off between entropy reduction and user experience. I wanted to look at the possible UX benefits of canonical and compatibility normalizations rather than at the exact entropy in my answer:

Evading problems where two possible representations of the same grapheme cluster exist makes a lot of sense to me, since these representations mostly exist for legacy reasons (Unicode’s goal was to have code points for all possible characters from legacy encodings) and look the same (unless there are font rendering issues). Also, as far as I know there is no standard way of choosing which form (single code point or modifiers) of, say, ä I want to input (speaking from my experience with most desktop and mobile operating systems).

Compatibility normalization on the other hand will “merge” characters which are distinctly different, e.g. ſ and s or ² and 2. Choosing between these is often possible, although some symbols might require IME, emoji pickers or symbol pickers. In my opinion there is no usability advantage here, since the user decides which version they want in their password. It can make sense for user names or other identifiers that might be vulnerable to homograph attacks though (to some extent—mixing similar looking alphabets is still possible).

As to why the RFC you mentioned prefers NFKD for passwords (in addition to user names where it might fend off some homograph attacks), I don’t know. But personally I find NFD (or NFC) more reasonable.

dlrlc
  • 166
  • 2