23

UPDATED

We have a very unique scenario: We have several old databases of user accounts. We'd like a new system to be able to connect these old accounts to new accounts on the new system, if the user wishes it.

So for example, on System X you have an old account, with an old, (let's say) RPG character. On System Y you have another old account, with another RPG character on it.

On our new system, with their new account, we'd like our users to be able to search these old databases and claim their old RPG characters. (Our users want this functionality, too.)

We'd like to keep users' old account PII in our database for the sole purpose of allowing them to reconnect old accounts of their new accounts. This would benefit them and be a cool feature, but under GDPR and our privacy policy we will eventually need to delete this old PII from our databases.

BUT - What if we stored this old PII in such a way as that it was irreversible. I.e. Only someone with the information would ever get a positive match.

I'm not a security expert, but I understand that simple hashing (eg. MD5) is too far easy to hack (to put it mildly), and (technically) doesn't require "additional information" (ie. a key).

The good thing about MD5 is that it's fast (in the sense that it's deterministic), meaning we could scan a database of 100,000s rows very quickly looking for a match.

If MD5 (and SHA) are considered insecure to the point of being pointless, what else can we do to scan a database looking for a match? I'm guessing modern hashing, like bcrypt, would be designed to be slow for this very reason, and given that it's not deterministic means that it's unsuitable.

If we merged several aspects of PII into a field (eg. FirstnameLastnameEmailDOB) and then hashed that, it would essentially become heavily salted. Is this a silly solution?

Django Reinhardt
  • 938
  • 2
  • 8
  • 20
  • 2
    Why do you need to pseudonymize them? You might have specific need to, but it is not a typical thing to need to do in this use case. – schroeder Jan 23 '19 at 12:34
  • @schroeder Sorry I thought I'd explained. Some of this PII is about to expire as per our privacy policy. Pseudonymization would allow us to to keep this functionality without keeping their data. – Django Reinhardt Jan 23 '19 at 13:52
  • 6
    Yep, that is a great situation for this use case. Kudos to your team for such great understanding of your policies! – schroeder Jan 23 '19 at 13:54
  • 17
    "The good thing about MD5 is that it's fast, however, meaning we could scan a database of 100,000s rows" - not sure how the speed of MD5 plays a part here, since you are presumably only hashing the email once and searching a database of hashed emails? (And the DB search presumably uses an index...?) – MrWhite Jan 23 '19 at 16:25
  • 3
    Isn't the point of that bit of the GDPR specifically to stop this? If I tell you "delete everything you have on me, GDPR says so", I want that gone from your records and never again relateable to me. I don't want an undo button for that. – Adam Barnes Jan 24 '19 at 14:11
  • @MrWhite It's not the reading 100,000s of rows, it's the looking for a match. MD5 is deterministic, so the hashed email would be the same every time -- extremely fast to find a match. – Django Reinhardt Jan 24 '19 at 14:48
  • @AdamBarnes The GDPR is about having control over your personally identifiable information. The point of this exercise is to remove your PII and replace with something anonymous -- an irreversible hash. I'm hoping that's possible. – Django Reinhardt Jan 24 '19 at 14:49
  • I agree with @AdamBarnes here. Since you own the salt, it's still possible to undo the deletion. It's not too hard to guess an email. –  Jan 24 '19 at 14:49
  • @sboesch So someone with the salt would be able to reveal the raw email addresses? Hmm :( I guess this why someone recommended [pepper](https://en.wikipedia.org/wiki/Pepper_(cryptography)), but I don't know how this could be implemented in this instance. Maybe this isn't possible at all – Django Reinhardt Jan 24 '19 at 14:50
  • @DjangoReinhardt "It's not the reading 100,000s of rows, it's the looking for a match" the point is that it doesn't matter which algorithm you use because you only have to run it once (per time that a user tries to link accounts). (1) User requests link, (2) Hash their email address (once), (3) Check if hash matches any from your indexed rows. The speed of the algorithm only affects step (2), which is a one-time step so the affect will be negligible. – Jon Bentley Jan 24 '19 at 16:27
  • @JonBentley No, this is only true if the hashing is deterministic (ie. you get the same hash every time). More secure hashing algorithms, like bcrypt, are non-deterministic. – Django Reinhardt Jan 24 '19 at 16:53
  • Hi Django, what did you end up implementing? – Ama Apr 26 '20 at 20:37

3 Answers3

37

MD5 or SHA is not the concern. Hashes can be used for pseudonymization. The problem is that the hash would need to be salted (or peppered) so that data from other sources could not be used to identify the person.

My email is the same everywhere. A hash of it would also be the same. So that means that, in this case, the hash and my email become synonymous. Just like a username and the legal name of a person if paired. If you use a hash in this case, you actually gain nothing in terms of GDPR.

Hashing with a salt (or pepper) makes de-anonymising nearly impossible without knowing the added value. The salt (or pepper) almost becomes the token, in this case.

As always, check with your DPO.

schroeder
  • 123,438
  • 55
  • 284
  • 319
  • 1
    I get it, the token is the normal case. I think that a salted hash might be the most useful to you and still be searchable and disconnected from the email address. – schroeder Jan 23 '19 at 13:59
  • 2
    You probably should still use a password hash not one designed for speed. Email addresses follow common patterns and may only have very short unique parts; which would leave some of them equivalent to short passwords that can be bruteforced if only protected by a single pass of MD5 or SHA. – Dan Is Fiddling By Firelight Jan 23 '19 at 16:03
  • 5
    "Hashing with a salt makes de-anonymising nearly impossible without knowing the salt." Since the salt is usually stored right next to the hash, shouldn't it be assumed that the salt is known? – kapex Jan 23 '19 at 17:33
  • @kapex absolutely. But it means that the salted hash can be used outside of the system safely. You would need to know the salt for the hash to be tied to the email. – schroeder Jan 23 '19 at 17:38
  • 9
    For efficient database lookups, consider using a [pepper](https://en.wikipedia.org/wiki/Pepper_(cryptography)) instead. – Maya Jan 23 '19 at 20:49
  • 6
    @DanNeely using a password-grade hash and a proper salt (unique for each user) would make the lookups prohibitively expensive; with password verification, you have already selected the user and know which salt to use, but in this case, you don't know which user it is and so have to try *all* of the salts – kbolino Jan 24 '19 at 02:42
  • 2
    @kbolino the lookup should still be fast, as NieDzejkob pointed out you just can't use a unique salt. Since the actual recovery process should be rarely run you can compensate for that with much higher difficulty factors than would otherwise be acceptable for a login. 10 or 20 seconds to hash the candidate email is fine, since once you're done it once you can do a fast DB lookup afterward; while the extreme slowness of the hash means that even without the need to do each user separately a brute force attack is prohibitively expensive. Just rent a big cloud VM for a for the initial seeding. – Dan Is Fiddling By Firelight Jan 24 '19 at 03:09
  • I wonder what kind of information a hashed email would be, technically, according to the GDPR. It's not really pseudonymous, because the process can't be reversed easily (the GDPR doesn't say that pseudonymization shouldn't be reversible, in fact, it says it can be reversed using other info that you must keep separate). It's not really anonymous either, because it could be de-anonymized by bruteforce. And it's not the same as data erasure or data minimization either. – reed Jan 24 '19 at 17:57
  • Salt is not needed and would drastically reduce the performance of the database lookup. Salts only serve as a protection when multiple users have the same password as it would produce the same hash if not salted, but emails are not shared by multiple users. A pepper makes much more sense in terms of performance and provides same level of security – Mr. E Jan 24 '19 at 20:23
  • @Mr.E "Salts only serve as a protection when multiple users have the same password as it would produce the same hash if not salted" exactly - and that means that the email could be correlated against other known emails in other potential systems. Hence the need for something to mutate the hash. If you want a pepper, sure, but I'm not sure what your comment adds to the other comment about peppers. – schroeder Jan 24 '19 at 20:26
4

Realistically, pseudonymization is any method of obfuscating someone's PII/NPI so that it can't be reasonably traced back to one certain individual. GDPR doesn't necessarily dictate what hashing algorithm you are required to use in order to comply with it's standard, and to be honest - it's best that it doesn't, because if you consider the fact that if everyone was using the exact same method of obfuscation, you're creating a massive single point of failure all around. Your best bet, (as mentioned above) is to use some form of tokenization with salt, to add extra randomness to your algorithm so that it can't be easily bruteforced.

  • 8
    From an information security perspective, the idea that it's bad to have a single widely used obfuscation method is dubious (it's either secure or not). However, it *is* accurate that standardizing the method by law could pose a problem, since it could become outdated. – Christoph Burschka Jan 23 '19 at 16:18
  • 1
    The legislation that the GDPR replaced (the data protection directive 95/46/EG) is over 20 years old. IIRC, in the mid-1990s, MD5 was a pretty decent choice, and certainly among the better that were generally available; these days it's considered horribly inadequate, and even SHA-1 (which was designed to replace it) is a bad choice. Who knows what will happen to hash algorithms in the next 20-25 years? I agree, mandating any particular method or algorithm in the regulations themselves would be a bad thing to do. – user Jan 24 '19 at 09:42
0

The problem with hashing emails is that they are usually short and easy to brute-force.

If you use a salt, by definition it is a public "key", so you do not add anything in terms of protection. Because GDPR includes yourself being unable to trace back your customers, you are the defendant and the attacker here, so any pepper or password is of little use against yourself.

The real problem is brute-force. I am no expert in security but the solution we are currently considering for our own issue which is similar to yours is the following: for each user email, apply a hashing algorithm N times, where N is a random number between Min and Max. When looking up in your database, take the email provided by your user and hash it Min-times, then lookup, then hash again, then lookup, etc until you either have a match or reach Max-times.

The advantage of having N varying for each database entry is that a brute-force attack would need to try hashing Max-times for every single combination they try, whilst if you have the combination, you are likely to get a lookup hit after only (Max-Min)/2 hash iterations. So on average, you make the attacker's life harder than yours. That's assuming your database lookups are faster than each hashing iteration.

Some further food for thoughts:

  1. Use a time consuming hashing algorithm
  2. Use a good salt (long and random)
  3. Consider having the salt changing for each iteration: salt(n) = f(salt(n-1))
  4. Consider having the salt evolving between each iteration: salt(n) = f(salt(n-1), hash(n-1))
  5. Do not store N, by the way.
Ama
  • 109
  • 3