Am I wrong? Does hashing passwords with SHA256 before bcrypt not reduce security, even theoretically?
bcrypt has a 184-bit output hash. Having more entropy bits as input doesn't change that the amount of possible outputs is restricted below that.
256-bits > 184-bits, thus I don't see how security would be reduced.
You may ask why is the input wider in the first place? (72 bytes vs 23 bytes)
It's length is unrelated to bits of entropy, words and emoji can use more length/bytes, when these are used as units of an "alphabet" to compose a password, you can understand how the amount of entropy bits isn't restricted to single bytes/characters (which is where the misunderstanding seemed to focus on).
SHA-256 allows you to compress that representation down to an input size bcrypt accepts, which can still maintain that entropy of the original input.
More details
256-bits is plenty of security / defense, and more than the 192-bits output
I see discussion here about SHA-256 being 32 bytes or 64 characters as a string (assuming hexadecimal encoding, a 16 value charset 0-9,a-f), whichever way you look at it, you have 256-bits represented still.
That's an impractical amount to attack already (not that these hashes would be attacked, as it's fairly certain the actual password would have less than 256-bits of entropy).
You also get an output hash of 184-bits (8 bits are truncated before Radix-64 encoding to 31 characters), so any concerns about reduction for input is moot, you'd sooner get a collision on the output anyhow.
Also note that while the limit is 72 bytes, some implementations may truncate/limit input to a 55 character length string(56 when including the null terminator byte).
So if you're not passing 32 bytes of the SHA-256 hash to bcrypt, but instead feeding it to bcrypt as hexadecimal string, you may want to instead use base64 encoding which represents 32 bytes as 44 characters instead of 64.
Using a hash to keep the length of input constrained also avoids implementation bugs (fixed OpenBSD 2014, NodeJS 2020), where passwords exceeding 255 bytes overflowed an 8-bit string length which could treat the password as only a few characters long instead.
A passwords entropy is not restricted to single alphanumeric/ASCII characters for it's composition
Password entropy isn't just measured by single characters/bytes as was a common focus in other answers regarding 95 ASCII values charset. You can have words (eg EFF diceware lists of 7776 words) that substitute a single ASCII value in the composition of a password, or passphrase in this case.
These of course are longer in length and thus bytes, if each word averages 10 characters, you're only able to fit 7 words before any additional entropy from additional words to bcrypt would be lost. That is only about 90 bits (log2(7776^7)
).
Passwords further don't have to be restricted to words from a limited alphabet. Foreign languages or even emoji can be valid inputs, but these may use more than a single byte for a single visual glyph.
A single glyph ("character") visually can be represented by multiple bytes, especially with emoji
You can have an emoji that uses 17 bytes for example: (♀️ detective + skin tone + gender combined), that is represented with 5 codepoints in unicode: 0x1f575 0x1f3fb 0x200d 0x2640 0xfe0f
. These emoji are composed of a sequence of other base emoji and some invisible modifiers like 0x200d
ZWJ and 0xfe0f
VS16.
A single glyph, multiple codepoints (of which bytes per UTF-8 encoded codepoint varies). There are emoji that use more bytes still, yet the overall entropy bits for emoji isn't that high to justify the cost in bytes when that's limited like with bcrypt. A typical emoji (without any sequence involved) may use 3-4 bytes.
TL;DR: SHA-256 allows for avoiding length constraints where entropy would otherwise be lost
Thus SHA-256 hash of a password for input works around the length issue. With current emoji being about 3,521 (as of Sep 2020 Unicode 13.1), 21 emoji would fit into 256-bits of entropy (log2(3521^21) = ~247
), but could very well use over the 72 bytes in size, possibly exceeding 500 bytes depending on emoji choice. Using the SHA-256 hash ensures you don't have to worry about the byte length of the users password.
♀️
(92 bytes) vs ❤️
(81 bytes), the first 3 family emoji used for both use 75 bytes (25 each). If you use bcrypt to output a hash with the same salt, they will both ignore the 4th glyph resulting in the same hash.