Using a hash in place of user data

Question

I'm storing client data and they are sensitive about privacy and security of the data. In some cases, I don't need the actual data, but could work with a hash of the data. For example, in the case of a users email. I have no need in our application for the users email address except to compare for equality to find records about the same person.

So to minimise the exposure of that data, I was thinking to replace the email with a BCrypt hash of the email before saving it to the database - that way I don't store it, but can still compare like records, or if the client wants to lookup a particular email they can type it in and still be able to search for it.

But we will have 100,000's of records, so the computational cost of Bcrypt would quickly become a problem when cross referencing records.

I'm thinking to just use the lower MD5 instead since it's faster, but wanted to check my thinking:

Does the reduced difficulty of MD5 vs Bcrypt defeat the purpose of hashing in the first place, or is it a valid trade-off in this case?
Does this approach in general have a security catch or loophole that I may have overlooked?

One minor gotcha capitalization. My phones "auto-correct" has a bad habit of capitalizing the first letter of email addresses when I type them in. You would have to be really careful to standardize capitalization and character sets or bad things will happen. — AstroDan, Jun 28 '16 at 13:53
bcrypt is more of a compare tool than a hash tool; you can feed bcrypt 5 different strings and all will match a sixth. that's fine for validating passwords, but no so much for matching hashes using the literal compares DBs like. maybe if you can override it's salt generation you can get comparable repeatable values. — dandavis, Jun 29 '16 at 06:01

score 1 · Answer 1 · edited Mar 17 '17 at 13:14

MD5 is better than plaintext, but only marginally.

If you use bcrypt with a salt, to find all records with email foo@example.com you would need to hash that email one time per record with that records unique salt. That would quickly get out of hand, and as you note in your question, not work.

What you can do instead is to use a constant salt that is the same for all records. Then it is no longer called a salt, but a pepper. The value of the pepper should be random and treated with the same care as a cryptographic key, since without it a brute force on the hashes is practically impossible.

It is important to understand that a pepper is not as secure as a salt, since a brute forcer in possession of it would get the same speed up from not having to compute the hash one time per record as you do when searching. But it is a lot better than using a fast algorithm like MD5 or SHA-256.

A practical note: Not sure if all bcrypt implementations allow you to specify the salt yourself, and all of them will have the salt included in the output. You need to cut that part off before you store it, since the pepper should not be stored in the database.

score 0 · Accepted Answer · answered Jun 28 '16 at 14:02

0

There are two things going on with my interpretation of this. You have an identifier, and a message, and a client may need to recall a message, but in order to do so, they'd need an identifier (replacement) to maintain their privact, something strong enough that it could not be guessed. E.g.:

ORIGINAL MESSAGE

FROM johndoe@somewhere.com
MSG: This product is giving me an issue

In infer you would want your client to bring up this, and any message for whatever purpose, but since they don't want their personal data stored, you are seeking to create something like this:

FROM 7d9065d7076298c54b45b2672797cc7b
MSG: This product is giving me an issue

If this is the case, the primary concern would be if someone could figure out the hash 7d9065d7076298c54b45b2672797cc7b (which is the md5 result of johndoe@somewhere.com). E.g.:

$ echo johndoe@somewhere.com | md5
7d9065d7076298c54b45b2672797cc7b

While md5 is considered insecure, this is because a collision can be generated. In this usage, an attacker yields little in creating a collision. They'd need to be able to determine the value of the hash, not collide it. In practice you should be ok but would be better suited to either add a salt (var result = md5(salt+string);) or just bump up the hash to SHA512. Remember, the goal is to protect the identifier (email address) which can be done to a decent degree from a standard/common/simple attack. Whether or not it can withstand someone with resources/intent is a different question.

If you meant a hash on both the message and sender, that is not doable. You can hash the sender, and encrypt the message. Even in doing so, if the system housing this data is not protected by principles of least privileges, vulnerabilities, it becomes a moot point.

answered Jun 28 '16 at 14:02

munkeyoto

8,682
16
31

Suggesting that unsalted MD5 hashes provide any real security is misleading. – Neil Smithline Jun 28 '16 at 15:15
@NeilSmithline would you be kind as to point out where I stated unsalted hashes don't provide real security? I stated: "md5 is considered insecure because collisions can be generated" and then explained why this is NOT an issue in this case. I then provided the additional: you could use md5 with a salt in the event someone brought up: "but md5 is considered insecure" at NO POINT IN TIME did I state unsalted md5 hashes do not provide any real security so please re-read what I wrote – munkeyoto Jun 28 '16 at 15:20
1

`In practice you should be ok` sounds like a recommendation to me. You also recommend using unsalted SHA512. That would be open to precomputation attacks like rainbow tables. – Neil Smithline Jun 28 '16 at 15:23
@NeilSmithline "In practice you should be ok" was a reference to: "someone may nag about md5 being insecure" as for the rainbow tables comment, any hash is attackable depending on who the attacker is. There is a cost associated with trying to determine the value of the hash. Against common attacks "In practice you should be ok" you infer things how you'd like at this point. – munkeyoto Jun 28 '16 at 15:27
Thank you for the detail in your answer Munekyoto. Marking this one as accepted because it was the details that led me to realise that in my use case, I can still use Bcrypt. In my case the main operation of concern will be performing table joins on User.hashed_email. In this case, the records will already be hashed, so no additional cost is invoked for the join if I use a pepper as described in Anders answer. – ChrisJ Jun 28 '16 at 22:18
"While md5 is considered insecure, this is because a collision can be generated. In this usage, an attacker yields little in creating a collision. They'd need to be able to determine the value of the hash, not collide it." what you are describing is a second-preimage collision attack, to which MD5 is not vulnerable in the first place anyway (it is vulnerable against collision attacks, in which _both_ inputs are selected by the attacker). – Ohad Schneider Aug 12 '17 at 17:03

Using a hash in place of user data

2 Answers2