12

I recently became aware of the fact that bcrypt truncates passwords to 72 characters. Practically speaking my intuition is that this does not pose any major security problems. However, I understand that it does mean any software libraries that use bcrypt potentially suffer from a "bug" where two ultra-long passwords that begin with the same 72 characters will be equivalent.

The authors of the Django web framework wrote something called BCryptSHA256PasswordHasher to resolve this issue; what it does is first hash users' passwords with SHA256 before passing them to bcrypt.

Some devs on my team were debating the merits of this approach and I just wanted to collect thoughts from the general community. On the one hand, does this not in some sense reduce security by shrinking the total space of possible values being fed into bcrypt? SHA256 will always output 32 bytes, far short of the 72 (significant) bytes that we would have otherwise. On the other hand, without hashing passwords first we essentially have an unevenly distributed space where for every 72-character prefix, all passwords starting with that prefix collide.

My gut feeling is that the distribution issue past 72 bytes is unimportant and that BCryptSHA256PasswordHasher really isn't useful. That said, I also recognize that, generally speaking, frameworks that are as popular and widely-used as Django tend to take these things seriously and have good reasons behind their decisions. So I don't really have much confidence in my gut on this one, if that makes sense.

Am I wrong? Does hashing passwords with SHA256 before bcrypt not reduce security, even theoretically? Is the distribution problem more severe than I realize? Is there something else that I'm just completely not thinking of?

Anders
  • 64,406
  • 24
  • 178
  • 215
Dan Tao
  • 281
  • 2
  • 6
  • SHA256 outputs are 256-bit (32 bytes like you said). This level of entropy is more than anyone could foreseeably break even with large GPU farms. Even at 1 trillion passwords a second, 2^256 is so large that it would still take 2^216 seconds (that is orders of magnitude longer than the universe has existed) to attempt every password. Just judging on the reduction from 72 bytes of entropy to 32 bytes, this wouldn't increase the likelihood that the password would be brute forced. Also I have to mention that any human generated 72 char password wouldn't really have 72 bytes of entropy anyway. – Owen Jun 22 '15 at 15:15
  • @Owen I understand that. But the question still remains (to me) what the point of hashing with SHA256 is in the first place. Does it solve a real security problem? Also, I share your skepticism regarding human-generated 72+ char passwords; but I am also skeptical that any 72+ char passwords would be human-generated (as opposed to tool-generated) at all. – Dan Tao Jun 22 '15 at 15:20
  • 3
    if we are talking machine generated (pseudo)random passwords longer that 72 characters, then the likelihood that 2 of them would be the same in the first 72 positions is even less likely than a brute force break I mentioned above. While this does reduce the entropy of the input to bcrypt, I wouldn't say it actually reduces security. – Owen Jun 22 '15 at 15:33
  • I agree with @Owen. Maybe they did that just to higher the computation time ? – r00t Jun 22 '15 at 15:37
  • @Owen Right. In practical terms I see no difference worth any consideration between these two approaches (i.e. hashing w/ SHA256 before bcrypt or not). I guess what I'm still failing to see is whether hashing the password first actually accomplishes anything useful. From a simple engineering standpoint, when faced with the choice of either *doing something* or *not* doing it, shouldn't there be a compelling benefit to doing it? – Dan Tao Jun 22 '15 at 15:38
  • @DanTao It may be solely to satisfy someone's concern that longer passwords/passphrases were being truncated. While it probably doesn't have a perceivable impact on security I know the idea of truncation doesn't sit well with some people, and the SHA256 pre-hashing process essentially addresses that. – PwdRsch Jun 22 '15 at 15:41
  • If it was recognized as being better then it would be the state of the art instead of bcrypt alone. – Neil Smithline Jun 22 '15 at 16:25
  • Isn't the point of it twofold: to stretch the average user password before feeding it into Bcrypt (where the average password would be fairly substantially shorter than 32 bytes), and to guard that same password in case Bcrypt was broken at some point? I guess it might also help when tuning BCrypt, since you have a known input size. On a side note, BCrypt's original spec mentions accepting 56 bytes, rather than 72 (https://en.wikipedia.org/wiki/Bcrypt) – iwaseatenbyagrue Mar 07 '17 at 16:54
  • If your password are randomly chosen alphanumeric passwords, then bcrypt+sha256 is no better than just bcrypt. If your password is actually chosen by the user (who might have chosen something like `constant long prefix+short secret code` as their password pattern or if the random string generator doesn't filter/encode null characters, then bcrypt+sha256 makes it much more secure. – Lie Ryan Apr 08 '17 at 02:21

4 Answers4

3

Bcrypt as stated in the Link is limited to 72 Characters.

SHA256 may have an OUTPUT size of only 32 Bytes, It's Message input is ((2^64)-1)\8 or roughly

2305843009213693952 Bytes (assuming a char is 8 bits)

To Bcrypt it's receiving a 32 Byte passphrase to encrypt, To SHA256 that could be a 400 Char data stream (IE password).

So no, you're not losing entropy on this, you're overcoming a limitation in Bcrypts design (If you want to call it a limitation.

Link on SHA2 discussing is sizes: https://en.wikipedia.org/wiki/SHA-2

Shane Andrie
  • 3,780
  • 1
  • 13
  • 16
  • 2
    I can't follow your argument: the key space for SHA-256 is still of (roughly) 32 bytes: beyond that, you're guaranteed to have a collision. This means that, even if you limit your input to 7-bit ASCII and considered that SHA-256 was a perfect hash function, you'd still have 2^248 collisions if you hash the passwords before feeding it to bcrypt. Am I missing something ? – Stephane Jun 23 '15 at 07:44
  • Since I am not really great a explaining the Math behind this all I will provide some links. http://stackoverflow.com/questions/4676828/when-generating-a-sha256-512-hash-is-there-a-minimum-safe-amount-of-data-to At this point were just talking about the SHA256 function ignoring Bcrypts additional layer. – Shane Andrie Jun 23 '15 at 16:34
  • I don't think the existence of a collision in SHA-256 is a problem, the question is, how easy will it be to create one. It is very likely is that two passwords that differ after the 72th character will have different SHA-256 values, so if that is the main concern it will help. If not: The hash-before-bcrypt approach will weaken the overall security against brute-force if the number of bits (here 256) is smaller than the relevant number of bits without. This is 72*8=576 in theory (bcrypt), but if the entropy is 6 bits per character (a good password) then you have 72*6=432 bits which is higher. – Ned64 Apr 02 '17 at 10:00
2

Am I wrong? Does hashing passwords with SHA256 before bcrypt not reduce security, even theoretically?

bcrypt has a 184-bit output hash. Having more entropy bits as input doesn't change that the amount of possible outputs is restricted below that.

256-bits > 184-bits, thus I don't see how security would be reduced.

You may ask why is the input wider in the first place? (72 bytes vs 23 bytes)

It's length is unrelated to bits of entropy, words and emoji can use more length/bytes, when these are used as units of an "alphabet" to compose a password, you can understand how the amount of entropy bits isn't restricted to single bytes/characters (which is where the misunderstanding seemed to focus on).

SHA-256 allows you to compress that representation down to an input size bcrypt accepts, which can still maintain that entropy of the original input.


More details

256-bits is plenty of security / defense, and more than the 192-bits output

I see discussion here about SHA-256 being 32 bytes or 64 characters as a string (assuming hexadecimal encoding, a 16 value charset 0-9,a-f), whichever way you look at it, you have 256-bits represented still.

That's an impractical amount to attack already (not that these hashes would be attacked, as it's fairly certain the actual password would have less than 256-bits of entropy).

You also get an output hash of 184-bits (8 bits are truncated before Radix-64 encoding to 31 characters), so any concerns about reduction for input is moot, you'd sooner get a collision on the output anyhow.

Also note that while the limit is 72 bytes, some implementations may truncate/limit input to a 55 character length string(56 when including the null terminator byte).

So if you're not passing 32 bytes of the SHA-256 hash to bcrypt, but instead feeding it to bcrypt as hexadecimal string, you may want to instead use base64 encoding which represents 32 bytes as 44 characters instead of 64.

Using a hash to keep the length of input constrained also avoids implementation bugs (fixed OpenBSD 2014, NodeJS 2020), where passwords exceeding 255 bytes overflowed an 8-bit string length which could treat the password as only a few characters long instead.

A passwords entropy is not restricted to single alphanumeric/ASCII characters for it's composition

Password entropy isn't just measured by single characters/bytes as was a common focus in other answers regarding 95 ASCII values charset. You can have words (eg EFF diceware lists of 7776 words) that substitute a single ASCII value in the composition of a password, or passphrase in this case.

These of course are longer in length and thus bytes, if each word averages 10 characters, you're only able to fit 7 words before any additional entropy from additional words to bcrypt would be lost. That is only about 90 bits (log2(7776^7)).

Passwords further don't have to be restricted to words from a limited alphabet. Foreign languages or even emoji can be valid inputs, but these may use more than a single byte for a single visual glyph.

A single glyph ("character") visually can be represented by multiple bytes, especially with emoji

You can have an emoji that uses 17 bytes for example: (‍♀️ detective + skin tone + gender combined), that is represented with 5 codepoints in unicode: 0x1f575 0x1f3fb 0x200d 0x2640 0xfe0f. These emoji are composed of a sequence of other base emoji and some invisible modifiers like 0x200d ZWJ and 0xfe0f VS16.

A single glyph, multiple codepoints (of which bytes per UTF-8 encoded codepoint varies). There are emoji that use more bytes still, yet the overall entropy bits for emoji isn't that high to justify the cost in bytes when that's limited like with bcrypt. A typical emoji (without any sequence involved) may use 3-4 bytes.

TL;DR: SHA-256 allows for avoiding length constraints where entropy would otherwise be lost

Thus SHA-256 hash of a password for input works around the length issue. With current emoji being about 3,521 (as of Sep 2020 Unicode 13.1), 21 emoji would fit into 256-bits of entropy (log2(3521^21) = ~247), but could very well use over the 72 bytes in size, possibly exceeding 500 bytes depending on emoji choice. Using the SHA-256 hash ensures you don't have to worry about the byte length of the users password.

‍‍‍‍‍‍‍‍‍‍♀️ (92 bytes) vs ‍‍‍‍‍‍‍‍‍❤️ (81 bytes), the first 3 family emoji used for both use 75 bytes (25 each). If you use bcrypt to output a hash with the same salt, they will both ignore the 4th glyph resulting in the same hash.

1

According to Django's documentation it's because

password are truncated on the first NULL byte (if any) and only the first 72 bytes of a password are hashed... all the rest are ignored.

They acknowledge that it's not a security issue:

While not a security issue per-se, bcrypt does have one major limitation (see previous)

Also, they note that the end of the 72C passwords are not "as important as" the rest:

Furthermore, bytes 55-72 are not fully mixed into the resulting hash (citation needed!).

I could translate it by "the end of the password is easier to crack than the begining".

That's why they do a has first (which as not NULL character, and is less than 72 characters and has no 55-72 characters):

To work around both these issues, many applications first run the password through a message digest such as SHA2-256. Passlib offers the premade passlib.hash.bcrypt_sha256 - BCrypt+SHA256 to take care of this issue.

This improves security if you have long passwords to handle. This may slightly decrease it if you only have short usual (A-Za-z0-9 & special chars from keyboard) passwords because of potential collisions (meh, not that important). So for a generic library, it makes sense to hash. For a lambda website/application managing user password, it seems useless (you have more risks by doing it wrong than by not doing it at all).

Xenos
  • 1,331
  • 8
  • 16
-2

I think the use of hashing the password prior to sending it to bcrypt is to overcome bad password habits. As long as you're generating random passwords then using the hash will be a detriment.

Let's say that your password is . So now I can have my 72 characters saved in a document and copy/paste it then append the name of the computer it's on (or some other calculation that is easy for a human to do). As we see this will result in essentially all of them being the same password. By hashing it we can prevent this, but at the cost of keyspace because the particular algorithm we're using outputs 32 characters.

Now, let's assume that every time you set a password it is randomly generated, the whole thing. A 72 character password has (9572) combinations. Your chance of collision is 1/9572.

If we hash this random value we will get 1 of 9532 values and thusly our chance of collusion is 1/9532. So already we can see that hashing costs us some security.

There's more to it though. If we consider that the hash will always output a 32 character string, even if the input is less than 32 characters, then we have exactly 9532 possible passwords. But bcrypt will take passwords in a range of lengths (31, 55, etc). So you can count those as well (9572+9571+9570+9569...) which quickly add up, resulting in further disparity between the two methods.

That's the theoretical, but what about the practical. How many characters do we need for suitable security? If it would take longer than the amount of time until the heat death of the universe to find a collision in a 32 character hash (I don't know that to be a fact, just making an example) then we have squeezed all the security we can out of that. We gain nothing but warm fuzzy feelings by making passwords longer. However, we can add security via hashing which helps mitigate bad password practices (when they're longer than 72 characters).

tenmiles
  • 97
  • 1
  • I don't think an attacker would try to feed bcrypt 32 byte binary strings - she would feed SHA-256 common passwords. So knowing the lenght of the input isn't really an issue here. – Anders Apr 06 '17 at 14:25
  • Also, I think your second paragraph is incorrect. In this case you are much more likely to get collisions if you don't hash. The risk of getting an SHA-256 collision is effectively zero. – Anders Apr 06 '17 at 14:27
  • @Anders If you take every possible 72 character string and hash it down to 32 characters then you WILL get collisions. There are just not enough characters to represent all possible inputs uniquely. – tenmiles Apr 07 '17 at 13:03
  • Yes, off course. But unless you have an astronomical number of users, the chance of that ever happening is astronomically small. – Anders Apr 07 '17 at 13:05
  • I see a lot of talk (above) about using bcrypt to enhance the security of the password, but I don't buy it. If you have 72 characters to play with then a random 72 character string is the best you're going to do. Hashing it down to 32 characters not only reduces the length of the string (and therefore strength), but also introduces a weakness whereby a large subset of those possibilities can be excluded. On paper the 72 character string is stronger, whether the inferiority of the hashed password has any practical significance in today's world is not what I was trying to address. – tenmiles Apr 07 '17 at 13:10
  • @Anders I'm not sure where you get binary strings from in my answer. If I can attack the data without having to put my attempts through the hashing process first then I only need to try all 32 character strings. However, this is not the same as if the password is limited to 32 characters because that could include 31, 30, etc char passwords whereas the hash output will always be 32 characters. I think this is equivalent to adding an additional possible character value (null) when calculating the probabilities? – tenmiles Apr 07 '17 at 13:18
  • @Anders I made a typo in one of my other comments, I said " I see a lot of talk (above) about using bcrypt to enhance the security of the password" but I should have said "SHA-256" or just "hashing". – tenmiles Apr 07 '17 at 13:21
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/56723/discussion-between-anders-and-tenmiles). – Anders Apr 07 '17 at 13:26