Why should passwords be compared by means of a byte array?

Question

A best practice in a login process of a web application is to compare hashed passwords as byte arrays. What is the real reason for this best practice? Is it about integrity?

Cert Secure Coding talks about using byte arrays because they are easier to release.

Basically, to prevent accidentally printing the password, and because byte (or char) array can be explicitly deleted, while strings are handled by the garbage collector. See https://stackoverflow.com/questions/8881291/why-is-char-preferred-over-string-for-passwords — Daniel Grover, Oct 31 '17 at 18:31
I'm voting to close this question as off-topic because this already has an answer on SO and is not security but programming related. Migration to SO seems useless though. — Tobi Nary, Oct 31 '17 at 18:35
Could you cite your source? I have never heard this advice. I can guess some reasons why this *might* be, but without knowing what you are being advised to do exactly, I can only guess. — Luc, Oct 31 '17 at 18:53
@SmokeDispenser "How to compare password by byte array" is a question for StackOverflow; "why compare passwords by byte array" is a perfectly valid question here. — Luc, Oct 31 '17 at 18:54
I think it's a security topic, because this is part of best practices to protect sensitive information, as Daniel Grover said using Strings can counduct to leak sensitive information. thanks Daniel Grover for your comment. — hmrojas.p, Oct 31 '17 at 18:54
Right, so your reasoning is this java-specific question is a good fit here even if it has a protected question over at SO? — Tobi Nary, Oct 31 '17 at 18:59
The reason is stated clearly right in the article you linked: "Java String objects are immutable and can be copied and internally stored by the Java Virtual Machine. Consequently, Java lacks a mechanism to securely erase a password once it has been stored in a String. See MSC59-J. Limit the lifetime of sensitive data for more information." — Ajedi32, Oct 31 '17 at 19:37
@Ajedi32 Not really. The article is about passwords, not password hash comparisons. — Xander, Oct 31 '17 at 19:40

Xander · Accepted Answer · 2017-11-01T22:50:43.997

There are a couple of issues at play here. The first, which is referenced in both the Cert Secure Coding example you linked to and the StackOverflow answer from Daniel Grover's comment doesn't have anything specifically to do with comparing the hash of a user-submitted password with a previously stored hash, but with the storage of the input password itself. The issue they describe is one of string immutability in platforms like Java and .NET, which means that if the clear-text password is ever stored as a string, it cannot be removed by the application, but only by the garbage collector, so a memory dump that occurs before that happens can lead to the exposure of a clear-text password.

This, however, is not really an issue when comparing password hashes. The reasons for using byte arrays rather than strings have fewer security implications, but it is still a good idea. First, because the input to, and output from a hash function is generally in the form of bytes. If you put bytes in and get bytes out, converting back to a string to compare makes little sense. Additionally, you eliminate an entire class of bugs (that could included security vulnerabilities) caused by incorrect string encoding. For instance, two byte arrays representing two entirely different passwords might falsely report as identical if incorrectly converted to ASCII before comparison, because in many cases, bytes outside the basics 128 character ASCII range will be converted to the same (3F) ASCII character value, opening the potential for string collisions even when the original byte arrays did not. If you leave the bytes alone, you eliminate this potential vulnerability entirely.

So, thinking of them and working with them as what they are, a hash output of bytes rather than strings leads to simpler, more correct code, and less error prone code. For security, this is a good thing.

I forgot to read this last night, but note that concerns like "For instance, two byte arrays representing two entirely different UTF-8 passwords might falsely report as identical if incorrectly converted to ASCII before comparison." apply to the comparison of passwords before the hash function, not the comparison of hash function output bytes. You eliminate a class of bugs by not trying to do anything to the input data (incoming password) and just feeding it to the hash function. Encoding the hash function output in some other way is unnecessary. — diagprov, Nov 01 '17 at 08:47
@diagprov Yes, I see what you're getting it. It was badly worded in my post because I conflated the two scenarios, but the same issue exists when comparing output. I'll update the post to make it clearer shortly. — Xander, Nov 01 '17 at 13:56
@diagprov Re-worded so it's specifically addressing the issue with outputs. Hopefully it's clear now. — Xander, Nov 01 '17 at 22:44

score 8 · Answer 2 · answered Oct 31 '17 at 19:37

You're confusing two slightly orthogonal issues here with the wording "compare hashed passwords"

Storage of the password (before hashing)

As Jon Skeet says on SO:

Strings are immutable (in Java and many other GC languages). That means once you've created the String, if another process can dump memory, there's no way (aside from reflection) you can get rid of the data before garbage collection kicks in.

If you are handling a user's password, you want to guard against this risk, so, you use a data type you can immediately overwrite once you have used it - an array of bytes fits the bill perfectly. Once you have applied your password hashing algorithm, you can zero fill the array (or use random data, or whatever).

Comparing the hashes (i.e. after hashing, before logging in)

Hash functions output bytes. It is how they are defined and typically how they operate in most programming languages, therefore, it makes most sense simply to compare the two bytestrings as they are, without first running them through any converters.

However, if your password hash happens to be base64 encoded in your database, there is no loss of security from taking the computed password hash and encoding this in base64 and then comparing strings. The hash is already in your database - if an attacker has access to your system sufficient to dump process memory it shouldn't be much of a stretch to recover the password hash from your database - doesn't really matter if they fish it out from there or from an immutable copy kept until your GC runs.

Summary

So to keep things simple:

Process user passwords as byte arrays, not using string types.
Process hashes as byte strings too, as there's no benefit to converting them just for comparing.
Use a secure password hashing scheme. Argon2 if available is currently recommended, otherwise follow the advice here: https://security.stackexchange.com/a/31846/99916

"if an attacker has access to your system sufficient to dump process memory it shouldn't be much of a stretch to recover the password hash from your database" -- there are plenty of attack scenarios where getting access to read memory is possible but performing active database queries is not. For a recent and widespread example, see the "heartbleed" vulnerability. Blanking password hashes is a useful mitigation step in such circumstances. — Jules, Nov 01 '17 at 07:06
I can't think of any other examples, other than heartbleed, that have dumped process memory onto the network - moreover, languages like Java ought to be memory safe and not allow this kind of behaviour. By contrast, there are plenty of SQLi vulns floating around in webapps. I'm honestly about 1000x more concerned about choice of password hash and salting scheme than I am about zeroing about the bytes of a hash in case someone manages to dump it before the GC runs. — diagprov, Nov 01 '17 at 08:42

Why should passwords be compared by means of a byte array?

2 Answers2