Revisting the Username Hash

Question

There are a few questions which ask for inputs on the wisdom of storing (possibly salted) username hashes for the purpose of authenticating end user access to some information resource. Most of the answers I see are pretty down on the idea, but none of them address what seems to be a vulnerability this approach may mitigate.

The vulnerability I'm refering to is that or username / password reuse. Many of the most recent database compromises resulted in the publication of plaintext usernames and associated passwords (if the developers didn't hash first) or password hashes (may or may not have been salted).

So my question is, will storing hashed usernames possibly salted with a site specific secret value (in addition to properly storing salted and hashed passwords), mitigate this vulnerability of username / password reuse? Certainly this will not protect against a username / password pair reused from a previously compromised site. The question refers mainly to the idea of protecting user information by not being the source of such a compromise.

Update: There seem to be a number of questions about how this would be implemented. This is a brief overview of the approach I took for the MySQL database I constructed. It lacks all of the detail, and there may be some detail which I have inaccurately reconstructed (either for the sake of simplicity or unintended oversights):

User table with the following fields:

Table Name:  USERS

USER_IDENTIFIER:  GUID (doubles as salt value for password hash in this example)
  USERNAME_HASH:  HASH(username + site_salt)
  PASSWORD_HASH:  HASH(password + USER_IDENTIFIER)
       USER_KEY:  ENCRYPT(value: <system generated random number>,
                            key: HASH(USER_IDENTIFIER+username+password))

There can also be an unencrypted public name (as long as it's not the same as the username shouldn't be a problem).

A group table, for lack of a better name, with the following fields:

Table Name:  GROUPS

 GROUP_ID:  GUID
 OWNER_ID:  USER_IDENTIFIER of the user which owns the group
GROUP_KEY:  ENCRYPT(value:  <system generated random number>,
                      key:  <unencrypted USER_KEY for the owning user>)

A lookup table allowing a group owner to share the group key with other users as in:

Table Name: SHARED_GROUP_KEYS

  GROUP_ID:  GUID of the group being shared by the user that owns the group
   USER_ID:  GUID of the user gaining access to the goup with the GROUP_ID above
SHARED_KEY:  ENCRYPT(value:  Unencrypted GROUP_KEY for the group,
                       key:  Unencrypted USER_KEY for the gaining user)

Information in the database would then be shared within groups. If you have been granted access to a group (by having a valid entry in the SHARED_GROUP_KEYS table for the group) you have the key to see content associated with that group. Otherwise you don't.

In order to authenticate, a user provides only a username and password.

So back to the original question, which is focused only on the USERNAME_HASH field of the USER table, does it make sense, in order to prevent spillage of a user's username when a compromise occurs, to hash the username instead of storing it as plaintext? In other words, as I mentioned in a comment below, is there any merit in changing the paradigm of treating the the username as public information, and instead treat it as another secret (for the purpose of authentication only)? Does it help prevent my site from being the source of a user's credentials which in turn would allow unauthorized access to that user's information on another site?

Hashing usernames doesn't sound like a great idea; you would have no way of locating the user record in your database. Hashing passwords, on the other hand, is standard practice, and should always be done with a suitably-generated unique salt. — , Feb 01 '12 at 18:11
@OliCharlesworth: Edited to be more clear. I assumed storing of salted and hashed passwords in addition to salted and hashed usernames. The question is will the addition of username hashes mitigate being the source of compromise? — andand, Feb 01 '12 at 18:19
How do you intend to uniquely locate a user record in a database if the usernames are hashed (and salted)? How do you intend to work back to username from the hash whenever you need to e.g. display the username? — , Feb 01 '12 at 18:20
@OliCharlesworth: Regarding the search, you use a stored procedure to search for the hashed username, rather than the plaintext username. The account identifier would be some sort of a GUID or other ID number. I've implemented it successfully in a small MySQL database. — andand, Feb 01 '12 at 18:21
Are you a) not displaying usernames anywhere else in the system (unusual), or b) *also* storing the unhashed usernames somewhere else(back to square one, really)? — , Feb 01 '12 at 18:22
How do you avoid collisions? And presumably this requires a single salt for all users? (which massively lowers its usefulness). And what happens when you need to display the username, when all you have is the hash/ID? — , Feb 01 '12 at 18:22
@Oli: If the hash is long enough (say, 128 bits or more), random collisions are a non-issue. (You'd need about 2^64 users before the first collision will happen on average. The current total population of the Earth is under 2^33.) — Ilmari Karonen, Feb 01 '12 at 18:25
The username can be encrypted with a user-specific key which the user may decide to share with others for various purposes. Collisions are preventing by rejecting a new username that hashes to something already in the database. And yes, the site will have to have a single salt value for hashing all of the usernames. — andand, Feb 01 '12 at 18:27
So now the user has to provide their username, user-specific key, *and* password to log in? And sometimes, their chosen combination will be rejected because of a collision? (Remember what people are trying to tell you here - you need the user to provide enough information for you to find their specific records, before authentication can occur) — , Feb 01 '12 at 18:32
@andand: This is starting to sound very complex! The user is going to have to provide a username, password *and* an encryption key in order to log in? — , Feb 01 '12 at 18:34
@Damien_The_Unbeliever: The user will need to remember their username and password only. Recovering a forgotten password or lost user name will be a problem. The user-specific (not user-specified) key is generated within the system using a cryptographically secure RNG. Multiple keys can be generated for each user as needed to allow sharing of information. And as for the collision issue, it's not really all that different than sites which right now don't allow two users with the same username. Collisions will be very rare with any decent, cryptographicaly secure hashing algorithm. — andand, Feb 01 '12 at 18:37
You seem to be missing the point, yet again. If the usernames are somehow protected by a user-specific key, then you have no means to determine *which* key applies to a particular login attempt (since you need the key in order to perform the hash, in order to locate the particular user). You have to attempt the login against *every* user in your system until you find a match. You're actually diluting the hacker vs site owner attack advantage. — , Feb 01 '12 at 18:43
@Damien_The_Unbeliever, OliCharlesworth: See edit above. And like I said, I have successfully implemented somehting akin to what I describe above. — andand, Feb 01 '12 at 19:24

score 6 · Accepted Answer · answered Feb 02 '12 at 13:21

In a typical system, there are several "usernames". There is the name that the user types to begin the login operation. Then there can be a "display name" (to be added to forum posts, automatic emails...), a contact email address, a billing name, a cardholder name... It makes relatively little sense to protect one of these names without dealing with the others, since they all contain, on average similar information (it may sound surprising, but, given the choice, many people will prefer to use their own true name for all these purposes).

The "login name", i.e. the username which the user types to begin the login operation, servers the following role: it allows the server to efficiently locate the user-specific data. The server may have thousands of users, in a large database, and it is inconvenient to scan all of them upon each login attempt; you want to quickly find the "user identifier" from which you can get the hashed password for that user.

Login names are not usually considered to be secret data (if they were secret data, we would call them "passwords"), if only because it is normal and expected that the login name is similar to the user's administrative identity. Nevertheless, if you want to somehow hide the list of login names, you must use a deterministic injection: that's a transform, such that given twice the same login name, you get the same output (that's determinism, it is required for the "find the user-specific data" part to work, and it must be understood server-wide), and two distinct login names will yield distinct output (the transform must be injective, or at least should not allow collisions to appear with a non-negligible probability). There are several solutions:

You can use a hash function such as SHA-256. You get worldwide determinism (that's the same SHA-256 for everybody) and almost-injection with high probability.
You can add a server-wide salt if you want to nullify any advantage an attacker may get from precomputations. The salt must still be the same for all login names on your server, because of determinism. The salt is equivalent to choosing your own, server-specific hash function. Note that an attacker who got the list of hashed login names will still be able to do a parallel attack on all the names (for each potential user name, the attacker hashes it -- with the server-wide salt -- and compares the result to all the hashed login names); thus, this specific salt does only a part of the job normally performed by a salt (but you cannot avoid it, because you need determinism).
You could make the salt secret; then, it is no longer a salt, but a key, and the hash function is no longer a hash function, but a Message Authentication Code (such as HMAC). This will do you any good only if the attacker can get access to the database but not to the key (so that's a restrictive attack model). Keys entail key management, something which is never simple, especially in the presence of multiple front-ends (they share the database, thus they must share the key -- and secrecy is not utterly compatible with sharing).

Remember that all of this is about a second-level defense: the model assumes that an illicit read access was achieved by the attacker (and that's already a big worry !). While such read accesses do occur in practice (that's why we hash passwords), we must not forget that this should not be the primary line of defense; on the other hand, every hash or encryption adds complexity, which is the well-known nemesis of security. So there is a trade-off, and the usual wisdom is to avoid complexity if it would serve only for secondary protection of data which is already semi-public anyway (i.e. the login name).

As for the details, the rough outline of the scheme you suggest would work (you are free to consider as "password" the concatenation of the login name and the password typed by the user, if you wish -- it will not make the scheme weaker). However, do yourself a favour: don't invent your own hashing. For the hash of the login name and of the password, rely on bcrypt or PBKDF2. Your user-specific key encrypted with the user password is really the result of a Key Derivation Function; there again, you'd better use a proper KDF like PBKDF2; you would just have to store the user-specific salt. Both the "hashed password" and the "encrypted user-specific key" can be used as the basis for a dictionary attack (aka "trying potential passwords"), so they must include the same level of protection (many iterations to make the processing slow, a salt to thwart parallelism).

I would still consider it a poor trade-off: much added complexity, for little extra security.

Thanks. This was along the lines of the reasoning I was looking for. — andand, Feb 02 '12 at 14:46

score 1 · Answer 2 · answered Feb 01 '12 at 18:23

(Rewritten answer, see history for original.)

If you just hash the usernames using a standard hash function and no salt, the same usernames will obviously produce the same hash, so it's trivial for an attacker with access to the user databases of two sites to compare the hashes. However, using a site-specific salt would indeed avoid this issue.

However, for many kinds of sites, collecting a reasonably complete user list will not be difficult. (For example, on a discussion forum site, the username will presumably be shown alongside each post.) Once an attacker has such a list (and access to your user database), they can easily hash each username on the list and then match the hashes to your database. So, for this technique to be of any use at all, you'd have to design your site so that the actual usernames are never shown anywhere. And even then, an attacker could just collect usernames from other sites and hash them, in the hope that they'll find some matching users on your site.

(Things get a little better if you can manage to keep your site-specific salt secret from the attacker, but it's generally not wise to assume that you can do that. Most of the time, if an attacker can obtain a copy of your database, they'll also be able to obtain the salt and any other information about your implementation they might need. That said, there are situations where that might not be the case, such as database leaks via simple SQL injection attacks.)

Canned answer that doesn't speak to the proposal of the OP? The OP is talking about hashing (possibly with salt) the *usernames*. — , Feb 01 '12 at 18:27
@IlmariKaronen: I guess what I'm trying to avoid with this scheme is being the source of compromised user data, to include the username. I can't help it if another site has been comporomised and the user has reused their username / password on my site. I don't want to be the other site where that information was the source of a compromise. — andand, Feb 01 '12 at 18:46
@andand: What I was trying to say is that this is kind of hard to do, since usernames are generally treated as public information. Even if _your_ site treats them as secret, most other sites probably won't, which means that any attacker who compromises your site can just collect username lists from elsewhere and match them against your database. — Ilmari Karonen, Feb 01 '12 at 18:53
@IlmariKaronen: Understood, and I acknowledge that limitation. I'm wondering if changing the paradigm of treating usernames as public information makes sense. That is, should all credentialing information be held as secret? — andand, Feb 01 '12 at 19:28

Revisting the Username Hash

2 Answers2

Linked