1

In my company, numeric user IDs are considered PIIs and therefore need to be pseudo-anonymized to be GDPR compliant.

To do so, we populate a lookup table where to each ID is assigned a monotonically decreasing gdpr_ID. Then when users are inactive or request to be deleted, we simply update the ID column by using the gdpr_ID value, making it impossible to know that gdpr_ID was originally linked to a specific ID.

The downside of this approach is that we have different tables where these IDs appear and we need to anonymize these records, but this lookup table is getting bigger and bigger, making our ETL pipelines increasingly slower.

The solution we were thinking about was to use a hashing function, but since IDs are just numbers we need to attach a salt. How can we generate this salt in such a way that salt(ID) always returns the same value, without having to create/use lookup tables and, at the same time, not being able to reconstruct it?

schroeder
  • 123,438
  • 55
  • 284
  • 319
Vektor88
  • 111
  • 2
  • 1
    Isn't that basically saying that you want `salt(ID)` both deterministic (always returns the same value) and non-deterministic (nobody can check if it returns a specific output given a specific input) at the same time? That's impossible. If you have zillions of users you could make such brute forcing by checking all inputs infeasible if the function is slow enough and the range of possible inputs large enough though. – Steffen Ullrich Sep 17 '21 at 10:44

0 Answers0