Hashing to protect privacy

Question

I like to know if hashing protects privacy enough, i.e., if a hash of PII is still considered to be personally identifying information (PII).

I found this article: https://www.johndcook.com/blog/2019/07/20/hashing-pii-does-not-protect-privacy. It states that it does not protect privacy because of the small space and you can make estimated guesses (brute force). I get that.

What I don't get is why using a salt (and not storing it with the data of course) is "useless". From the article:

If you throw the salt away, then you’ve effectively replaced each identifier with a random string. Then you’ve effectively removed these columns since they’re filled with useless random noise. Why not just delete them?

If you just want to check if the provided data was the same as last time (like with passwords) I don't think it's useless and it solves the brute force guessing attack.

Am I missing something or can the stored data be considered secure enough, i.e., it's no longer PII?

If you throw the salt away, you cannot check if it is the same as last time anymore. — nobody, Apr 08 '22 at 09:27
You face the problem of [the short input space of the keyless hash functions](https://crypto.stackexchange.com/a/81652/18298). Consider using HMAC or similar keyed hash functions. — kelalaka, Apr 08 '22 at 16:29
You're right. Throwing it away and keeping it to yourself (i.e. store it offline somewhere) are of course different things. :) — Bart, Apr 11 '22 at 11:17

score 3 · Answer 1 · answered Apr 08 '22 at 09:58

A salt protects against an attacker that uses as rainbow table to pre-compute hashes.

But, if the space of the hashed values is small enough, the hashes can be reversed by brute-force regardless of whether they are salted or not.

For example, take credit card numbers, which are considered to be PII. Credit card numbers are 16 digits in length, so there are (only) 10^16 possible credit card numbers. (It's actually much smaller than this, because of the Luhn checksum in credit card numbers, but let's use 10^16 anyway).

Suppose the system is breached, and the value stored in the salted_hahsed_credit_card_number field of one of the records in the database is:

8947b8ef2ae54741eed7b359442cafdc:172e90126cccfe4a8117cce66a50aef96bb8f2263b838b175f072045132ab1d2

An attacker looks at the code, and sees that the ':' is the delimiter, the salt is the first value, and the salted hashed credit card number is the second value, and that credit card numbers are salted and hashed using one round sha256 hashing.

Using ASICS technology, an attacker can build a rig that does 100 terra-hashes per second for a cost of a few thousand dollars. So, in 100000 seconds (approx 28 hours), the attacker's rig can iterate through all 10^16 credit card numbers in the space, trying each one using a process like the one in the python script below, until it finds the one that produces the value above stored in the database. Sooner or later, it will find the correct card number, which is: 5105105105105100. This can be confirmed using the script below:

import hashlib

creditcardnumber='5105105105105100'
salthex='8947b8ef2ae54741eed7b359442cafdc'

creditcardnumberbytes=creditcardnumber.encode('utf-8')
saltbytes=bytes.fromhex(salthex)

saltedhashedcreditcardnumber=hashlib.pbkdf2_hmac('sha256', creditcardnumberbytes, saltbytes, 1)
saltedhashedcreditcardnumberhex=saltedhashedcreditcardnumber.hex()

storedvalue=salthex + ':' + saltedhashedcreditcardnumberhex
print('stored value', storedvalue)
print('salt', salthex)
print('credit card number', creditcardnumber)

which produces:

stored value 8947b8ef2ae54741eed7b359442cafdc:172e90126cccfe4a8117cce66a50aef96bb8f2263b838b175f072045132ab1d2
salt 8947b8ef2ae54741eed7b359442cafdc
credit card number 5105105105105100

Using a more resource-intensive hash function (such as bcrypt, scrypt, or argon2) and/or using multiple rounds of hashing are effective ways of mitigating brute force attacks against salted hashing.

Thanks for the answer. Very clear. If I'm correct your answer implies that the salt is stored with the data and can be used by the attacker to brute force. Does it hold if the salt is unknown to the attacker? — Bart, Apr 11 '22 at 11:14
I'm glad it helped. Yes, my answer implies that the salt is stored with the salted-hashed data. And yes, the attacker needs the salt to use brute force reverse the hash. Without the salt, it is much harder for the attacker to brute force. But, without the salt, it also makes it impossible for your system to verify an input (such as a password) against the salted-hashed value that is stored. — mti2935, Apr 11 '22 at 12:11
So, if verifying that an input (such as `5105105105105100`) matches a stored value (such as `8947b8ef2ae54741eed7b359442cafdc:172e90126cccfe4a8117cce66a50aef96bb8f2263b838b175f072045132ab1d2`) is not a requirement for your system, then storing the salt is not necessary. But, in that case, why bother with salted hashing in the first place; and instead, why not just store random values in these fields? — mti2935, Apr 11 '22 at 13:47
The salt is stored, but not with the data itself. Verifying that the hash of the input matches the stored value will therefore still be possible. If the salt is stored where an attacker cannot get to it, privacy can be considered protected with hashing? — Bart, Apr 14 '22 at 07:20
@Bart, That makes sense. Yes, in that case, verifying that the hash of the input matches the stored value will therefore still be possible *offline*. And, if the salt is stored where an attacker cannot get to it, then that protects against brute force attacks on users' PII if the database is leaked. As to whether or not that protects users' privacy is another question, because there is much more to protecting users' private than just protecting against this one attack vector. — mti2935, Apr 14 '22 at 09:07

Hashing to protect privacy

1 Answers1