5

I am thinking about a log-in web service that would store a users username, password, and (optional) email address.

If they forget their username or password, they can have their username and a reset-password-link, sent to their email address (if they entered one at sign up).

I have no wish to contact users at any other time.

I would like to store the email salted and hashed, so it is (hopefully) impossible to retrieve any email addresses from the database.

So, if a user has forgotten their username, or would like to reset their password, they enter their email address. If it matches a hash in the database (when salted and hashed), then it is used to sent the username, and a reset-email link to the user.

Obviously the unavoidable risk here is of a hash collision which would cause someone to be emailed another persons username and password-reset-link.

  • Could this be avoided by using a public PGP key to (salt and) encrypt the email address as a form of hash, and never storing the private PGP so that the process is (hopefully) irreversible?

  • Can you see any weaknesses or problems with this approach?

  • Do you think this would be a significant improvement over storing plain email addresses in a database?

Paul
  • 51
  • 1
  • 2
  • 5
    From where do you retrieve the salt if one enters an email address for password reset? You must use the same salt as upon registration or else you will not find the hash in your db, i.e. the salt must be stored somewhere so you can find it for a given email address. – AHalvar Jun 19 '14 at 08:47
  • Yes, the salt will have to be stored unencrypted. I think this is normal practice, as (I believe) salts are used to protect against rainbow table attacks, rather than cracking. – Paul Jun 19 '14 at 09:00
  • How do you handle the situation where the user remembers their username, but not the email address they used to sign up for your service? – user Jun 19 '14 at 09:04
  • Good point! I guess they would have to try every email address they have access to. If it's an address they can't access, no password reset would work anyway. If they have access but can't remember the address then they would loose their account. I don't know how likely this is. I guess if this implementation went ahead, it might be worth encrypting the username and requiring it to be re-entered or changed when resetting the password. – Paul Jun 19 '14 at 09:14
  • Although you didn't ask for this, but do you really need to keep a user name? Can you just keep email and password (like facebook)? I think that might simplify your design, because as of now, the user needs to know a user name for your system, then email they used for your system, and then the password. – Omer Iqbal Jun 19 '14 at 09:18
  • 1
    The only real purpose of username would be for users who didn't want to enter an email address (personally, I'm reluctant to give mine to a website unless I know it well and trust it won't be used for spamming). Obviously they have the risk of forgetting their password forever. Perhaps either-username-or-email would be a better option, though I'm not yet sure how this would be clearly presented to the user. – Paul Jun 19 '14 at 09:31
  • The question says that you are going to store email for recovery, so I'm confused. Either you are storing emails or not? – Omer Iqbal Jun 19 '14 at 09:34
  • Sorry, perhaps I should have simplified the situation to avoid confusion. The email address is optional, and is used for resetting the password. If you don't enter an email address, then you can't reset your password. Some websites already do this as it makes signing up seem risk-free. One example is reddit.com – Paul Jun 19 '14 at 09:55
  • See also [Is it a good idea to store email addresses as hash only](http://security.stackexchange.com/questions/57553/is-it-a-good-idea-to-store-email-addresses-as-hash-only) – AHalvar Jun 19 '14 at 09:56
  • @Paul I wouldn't ever allow someone to create an account without an email address as that is how they'll reset their password anyways. But ultimately email addresses and usernames aren't secrets in the first place, so I wouldn't bother trying to protect them. If you do it may just bite you in the rear when/if you need to query based on a value you hashed. – Andrew Hoffman Jun 19 '14 at 16:08
  • What happens if somebody uses the same email for multiple accounts? Which do you return when you get them all? Or do you only return one? Related to that, what does your system return if it doesn't find an email address? You need to secure signup/login/recovery pages in such a way that returned messages can't be used to harvest such information. – Clockwork-Muse Jun 20 '14 at 05:12
  • @Clockwork-Muse it is very customary to prevent the creation of an account using an e-mail that already exists in the system. – Ohad Schneider Aug 12 '17 at 13:58

4 Answers4

5

Hash collisions involving non-hostile users and real-world data are vanishingly rare if you use a cryptographic hash function. For example, if you were to use SHA-1 as your hash function, you would need to have 1,200,000,000,000,000,000,000,000 users before you'd see a 50% chance of two of them having the same email hash. There may be reasons to use a method other than hashing to secure the email addresses, but collisions aren't one of them.

Mark
  • 34,390
  • 9
  • 85
  • 134
  • Yep and if there is a collision there is no telling the length of text or alphabet in use, so calculating the probability of a collision within the constraints of valid email addresses is probably ridiculously more rare than that. – Andrew Hoffman Jun 19 '14 at 16:17
  • And a hypothetical collision waiting to happen would have to be for a user that has forgotten his password, ruling out the Birthday paradox IMO. – ixe013 Jun 20 '14 at 13:29
2

I'd recommend against hashing e-mails - or at least, being unable to retrieve them. In fact, you might need to know your users' e-mail:

  1. Your website has been hacked and you must tell your users to change their passwords.
  2. Law enforcement comes along with a warrant and asks for your users' information.

As you seem to be concerned with keeping their e-mail "secure", I'd suggest the use of two databases: one for regular usage where e-mails are hashed; and a "secure" one with very limited access to store sensitive data. Either a "dropbox" storage where a process can only write but not retrieve data, or something similar to cryptodb.

That said, I'll leave the other comments to evaluate hashes and collisions, i.e. your original questions.

lorenzog
  • 1,911
  • 11
  • 18
  • 2
    I'd argue #2 is the reason for NOT having the emails. Less information you have, less you need to disclose to law enforcement. Unless there are some local laws where you NEED to have that information. – domen Jun 19 '14 at 14:16
  • 1
    @domen exactly - I agree on the 'rather not have' but in some legal systems if you can't provide enough information to identify a user to the authority you could become responsible for their content. Unless it's encrypted and you don't have the key... – lorenzog Jun 19 '14 at 15:11
  • 2
    Could you list an example of such legal system? There are plenty of services that would be illegal (think various pastebins). – domen Jun 19 '14 at 15:14
  • @domen IANAL and can't really think of one of the top of my head, sorry. Pastebin is a public service tho, and I suspect different rules apply. The original question talks about usernames and passwords, which led me to believe that part of the content produced might be non-public. But I'm just speculating at this point. – lorenzog Jun 19 '14 at 15:17
1

You should definitely not store emails in plaintext.

I think the best solution is to just hash the emails, but keep your salt a secret (e.g. encrypt it and put it in a credential store). This will decrease the vulnerability of passwords being discovered should the DB be lost.

As for PGP, in PGP, a new symmetric key is generated for every encryption. If you are generating a new symmetric key, then two encryptions of the same email will not result in the same cipher text, so you can't match. You will have to decrypt. And if you do not generate a new symmetric key per encryption, then what advantage do you get by using PGP over just encrypting the email using an asymmetric key? You can just use a single symmetric key with different IVs for each email, and IV can be stored in the DB - symmetric key stays the secret. That symmetric key can be encrypted using an asymmetric key so that you can rotate your asymmetric key. This is the same algorithm as PGP although implemented differently for your scenario.

If you kept encrypted emails, the advantage is that if ever you need to send your customers an email, you can decrypt it and use it. For example, if your site had a security breach, you can contact users to inform them of the issue and ask them to change passwords.

The disadvantage is that how do you guarantee no one else (esp. the employees of the company) will look at the email addresses, and esp. if someone's account was hacked who had access to decrypt, how will the email addresses be protected.

Omer Iqbal
  • 584
  • 2
  • 10
  • Please explain how are cipher texts prone to rainbow attacks. Ditto for salt where salt is not just static for all records. – domen Jun 19 '14 at 14:18
  • I removed that. I had made some assumptions (which are not necessary) and even then, I think it was erroneous. – Omer Iqbal Jun 20 '14 at 04:15
1

not so fast. The risk is not, in fact, about collisions. It more is about second preimages.

A collision is when someone can find two distinct inputs for a hash function, such that they hash to the same value. The attacker has control over both inputs. In your case, the attacker would compute two specially crafted email addresses, then register both, and, at a later time, would be able to get data from one account sent to the address of the other. It would not buy anything to the attacker: he already owns both accounts.

A second preimage is when the attacker is shown an input, and is challenged with finding another one which hashes to the same value. This is not the same setup as a collision attack: that time, the attacker has control over only one of the inputs, not both. This makes it more difficult. This maps to your situation: the attacker wants to register an email address which hashes to the same value as the target account's email address, so that the attacker may claim forgetfulness for that target account, and get user name and password reset link mailed back to his address.

If you use a decent hash function, then collisions are not to be feared since they are utterly improbable; and second preimages are even a lot less feasible. For a strong hash function with an output of n bits, cost for finding a collision (combining luck and raw power) is 2n/2 and that's already technologically infeasible if n ≥ 200 or so; for second-preimages, the cost rises to 2n, billions of billions of times higher. See this answer for more on the improbability of collisions.


As you state, you will want your hash function to be a password-hashing function (i.e. something like bcrypt, with salts and many iteration) in order to thwart a completely different kind of attack, i.e. an attacker stealing the email hashes and cracking them to be able to spam them. It must be noted that a given function may theoretically be deemed a "good password hashing function" (i.e. secure when it comes to storing password hashes) without actually be resistant to collisions or even second-preimages. Requirements for password hashing don't include all requirements for secure hashing.

A prime example is PBKDF2: as far as password hashing functions go, it is considered reasonably decent. However, it is not resistant to collisions (this is due to the fact that it uses HMAC and HMAC uses a key K which, in PBKDF2, is the password; and when the length of K exceeds that of the "block length" of the underlying hash function, then K is replaced with h(K); so a big K yields the exact same output as h(K)). Fortunately, you don't mind collisions; you just need resistance to second-preimages, and PBKDF2 will be fine.

This point illustrates the need to use precise terminology when dealing with cryptography. If you did not understand the details, then this illustrates it even more: cryptography is subtle.


Summary: use bcrypt or PBKDF2, and the risk you are fearing is non-existent. It won't happen in practice; attackers won't be able to force it. You should not worry about it, because there are other "risks" which are billions of billions of time more probable, and that you don't worry about (or go buy a shotgun !).

As a side note, you will want to normalize email addresses (e.g. force lowercase) before hashing, because at least parts of an email address is case insensitive, and you cannot expect users to always use the same casing for their own email address.


As another side note, since bcrypt/PBKDF2 are expensive functions, you will want to hash a submitted email address only once -- meaning that you must know what salt to use; you cannot afford to hash it one thousand times if you have one thousand stored email addresses. Therefore, one has to assume that the user who forgot his password actually remembers his user name, so that your server will compute the proper hash with the correct salt. This is the assumption I have used above.

Alternatively, don't use salted hashing, so that you may hash the email address generically, and use the resulting hash value as index in your database. However, this weakens the resistance of your hashes against the second attack type: when an attacker manages to steal a copy of your database (SQL injection, lost backup tape...), then he will find it easier to "reverse" the hashes and recover the email addresses. You have to choose your poison...

In that email-hash-as-index case, you again have to worry about collisions, because an attacker finding a collision would be able to force a situation where your server is trying to record an email-hash-indexed entry and finds an existing entry with the same hash -- depending on how you implement it, this may or may not be a problem. In fact, the "shared salt" model implies that password hashing functions are not a good fit. We are back to the "crypto design" phase and this requires a lot more thought. If you really want to go that road then you can expect difficulties.

(As a starting point, I would envision a custom nested hash as h(h(h(...h(email)...))) with h = SHA-256, and some thousands or millions of iterations, but a lot more thinking time would be needed before deploying it in production.)

Tom Leek
  • 168,808
  • 28
  • 337
  • 475
  • "a given function may theoretically be deemed a "good password hashing function" (i.e. secure when it comes to storing password hashes) without actually be resistant to collisions or even second-preimages [...] Fortunately, you don't mind collisions; you just need resistance to second-preimages, and PBKDF2 will be fine" - isn't that a contradiction? AFAIU second-preimages are bad for password hashing, as I could use the second pre-image and it would be accepted as the password... – Ohad Schneider Aug 12 '17 at 14:07
  • What about peppered hashing (i.e. basically use the same secret salt for all addresses)? That way you can still use it as an index (the same e-mail would always map to the same hash) but rainbow tables would still be useless. Actually, bruteforce would be useless as well if all your attacker as is SQL injection and/or lost backup tapes (as he wouldn't have access to the pepper in the source code / binary). You could even place the pepper in an HSM-backed secret store as Omer suggests in his answer, in which case even source control / binary access wouldn't be enough. – Ohad Schneider Aug 12 '17 at 14:12
  • One more thing - isn't key stretching overkill for e-mail addresses? Maybe if the service in question was one people would not want to be associated with AND some PII was stored alongside it. I suppose it would make phishing attacks more potent, as the attacker could pose as a site the user really has an account in (which might be easier to pull than posing as a site the user also has an account in like Facebook). Otherwise attackers presumably would not work so hard for e-mail addresses when e-mail harvesting is so accessible? – Ohad Schneider Aug 12 '17 at 14:21