3

Context

I'm building data lake from scratch within a small team (3-6 data engineers). I want to mask PII data when copying data from prod to dev/test environments.

I'm particularly interested in the case when ETL join, srk generation, deduplication logic depends on PII columns. So I believe I need bijective (no collisions, consistent) data masking.

Question

From point of data in-use protection, is there any better approach then deterministic encryption for consistent data masking in my case?

Considerations

In case of HIPAA, PCI or other regulations that enforce encryption at-rest, I can use both:

  • non-deterministic encryption at-rest, managed by storage
  • deterministic encryption for PII columns only, managed by me on moving data from prod

In scope of this question lets forget about subject removal requests (GDPR), so I consider lookup/translate tables as a big overhead. Testing should be automatized - so format preserve masking isn't required.

UPD (what is considered "better")

Could you very superficially assess the degree of security with deterministic encryption option? Would it be "math PhD with 10 huge servers may brute force in 1 month" or "average business analyst on personal mac may brute force in a few days"?

"Better" = just enough security (loosen) + less team reasources to implement & support the solution + distinct/cosnsitency properties

"Just enough security" = at first I hoped to achieve resistance to brute force attacks. But if that isn't possible completely...

"Just enough security (loosen)" = brute force risk/impact reduce + column-level restrictions for part of personnel + data download restrictions + full-fedge access to data only through audited notebooks/queries + employee trainings

Performance isn't critical at all (my case - batch processing, it's performance on write which happens pretty infrequently).

Statistical values of encrypted data also don't matter.

VB_
  • 215
  • 2
  • 9
  • 2
    How do you define "better"? Better performance? Better resistance against brute force? Better in the sense that some statistical values remain, e.g. frequency of person names or city names? Anything else? – mentallurg Feb 09 '21 at 18:15
  • This is looking very close to https://security.stackexchange.com/questions/213625/whats-the-advantage-in-encrypting-data-for-data-masking – schroeder Feb 09 '21 at 21:38
  • @mentallurg sorry, my fault. "Better" = just enough security (resistance to brute force, rainbows, XORs, etc.) + less team reasources to implement.
    Performance isn't critical at all (my case - batch processing, it's performance on write which happens pretty infrequently). Statistical values remain - also doesn't matter, I need to preserve only consisncy and distinct properties for joins/deduplications.
    – VB_ Feb 09 '21 at 21:44
  • @schroeder this question is more about search for alternatives. The reference you mention is about pros/cons of encryption. But I don't see alternatives, so it isn't about pros/cons only – VB_ Feb 09 '21 at 21:45
  • @mentallurg updated my requirements, see UPD section please – VB_ Feb 10 '21 at 08:38

2 Answers2

1

In general, you don't want to use deterministic encryption for non-uniform data (so cryptographic keys are okay, but other data is not). That's because secure encryption tells us nothing about the contents of the data except possibly its length, so it shouldn't be possible to determine anything about two messages based only on their encrypted versions.

However, with a deterministic encryption method, two identical messages encrypt to the same thing. That's bad, because if we take a large set of city names, the most common encrypted values are generally going to be the largest cities in the region. For example, in the United States, the most common cities will be New York, Los Angeles, Chicago, and Houston. If we also see first and last names encrypted deterministically, with a large enough data set, we can easily deanonymize data. For example, two people in the same city with the same uncommon last name might be related.

That's why this approach isn't secure and shouldn't be used, and it's why ECB mode can't be used for data. Moreover, if you use certain types of encryption, such as CTR mode with a consistent nonce (or another stream cipher), you make it very easy to recover the plaintext via crib-dragging attacks. You need to use a secure mode with a suitable nonce for each different piece of data.

So there isn't a secure way to do what you want to do. If you don't have a better suggestion than deterministic encryption for your test environment, I'd suggest generating test data independent of your prod data and using that.

Because modern encryption modes are not designed to encrypt multiple pieces of data with the same nonce and still protect the information, if you want some way of obscuring data which is unique, I would use something like HMAC-SHA-256 with a fixed and secret key. For things like phone numbers, which are unique or nearly so, this provides a secure way to obscure the data and provide a one-to-one correspondence with it, but it isn't invertible. This is secure because it's unlikely that people will be able to deanonymize your data, so it's okay for things like phone numbers and email addresses, but not for data which multiple people will share in common and which differ in frequency, such as names or city locations. For those, any deterministic technique will be a problem. If your data set is large enough and contains enough personal fields, an attacker will almost certainly be able to discover the identity of at least some of your users with at most a couple of days' worth of work.

I fully agree that using encryption or some other cryptographic method is better than not doing it, but if you're using these techniques to obscure personal data that isn't unique or nearly so, then you need to treat your development data as just as sensitive as your production data because its exposure will expose your users.

bk2204
  • 7,828
  • 16
  • 15
  • thanks for the details! Could you please look at the UPD section of my question? I still need distinct & consistency properties, otherwise duplicates count will grow in geometric progression with each processing iteration. Also what is the alternative? Create a full-size translate tables (PII value to distinct random mapping) on PROD in a separate secured zone and replace PII data everytime on moving from PROD to DEV/TEST? – VB_ Feb 10 '21 at 08:38
  • also one more ask - could you please provide an examples of columns for which deterministic mapping is till secure? Could I say that deterministic encryption is brute-force-resistent for all low-duplicate-rate columns like email, ip, phone number, address (street)? Is there any way to test if my name, surname, country or other columns are vulnerable to brute-force? – VB_ Feb 10 '21 at 08:45
  • *"two identical messages encrypt to the same thing. That's bad, because ..."* - The author say this is **not bad** in this case. See the OP: *"Statistical values of encrypted data also don't matter."* – mentallurg Feb 10 '21 at 22:22
  • The author may say that it's not bad, but if the development environment leaks to unauthorized parties, the data may very well be exposed. I've expanded on what the security level is and where a deterministic approach is appropriate. – bk2204 Feb 11 '21 at 02:27
  • @bk2204 thank you very much for details and your time! It's a brilliant answer :) – VB_ Feb 12 '21 at 11:57
1

1. Anonymize values

You said:

Statistical values of encrypted data also don't matter.

It means, that if there is public information about some facts contained in your database, e.g. what person has bought the most expensive car or house in what city, then some persons may be identified. Then based on your database further facts (not publicly known before) about these persons may be extracted from your database. But you said it is OK for you.

I would consider following methods:

A) Encryption (despite you say you are looking for other options besides encryption). Then the same values will be replaced with the same encrypted values. You wanted to do joins based on these data. This will be still possible also for the encrypted data. The well established methods like AES or ThreeFish are resistant against "known plain text" attacks. Thus even if somebody can identify a few persons based on statistical data, this will not help to restore the encryption key and to decrypt all the other data. One more advantage is, that a solution based on encryption needs relatively small secret.

B) Lookup tables (you said you dont't like such approach). But maintaining a lookup table may need permanent extension of the lookup table if you gen a new version of the data. Also the secret will be the whole lookup table, which is bigger that a normal key sufficient for reliable encryption (let say 256 bit key).

C) Other methods of data manipulation would actually mean a home grown encryption. In such case there will be no guarantee that all your requirements will be fulfilled. That's why I'd suggest not to consider any methods except of well established encryption algorithms or lookup tables.

2. Anonymize relations

You said:

Statistical values of encrypted data also don't matter.

But if you want to eliminate some statistical correlation, you can shuffle relations. Suppose you have a person table that refers addresses table using address IDs. Then you can take all address IDs, shuffle them and used the shuffled IDs in the person table. If you have a table with contact data like phone numbers, social network login names etc., you can shuffle references also there. Thus at any moment you will have a consistent data, all references will refer really existing data in other tables, but the combination of these values will not give any benefit to an attacker. For instance, one person living in Los Angeles will get an address in Monterey, and the neighbor of this person will get an address in New York. And they will get birthdays from some persons from Chicago and from Gettysburg respectively. Thus many relations between data will be broken.

Implementing such shuffling can need more efforts compared to encryption. For instance, if you use person IDs as references in 10 tables, then you would need to shuffle IDs in all these table using the same substitution table.

Also, depending on the logic of your application, some relations may need to be kept and should not be randomly shuffled. Only you can decide what manipulations are acceptable in your case.

3. Anonymize without encryption

In some cases encryption may be not needed at all. In case the fact that some person has any relation to your application is sensitive, e.g. if you maintain data about purchases of some weapon or about anonymous alcoholics, then you need some sort of encryption, see part 1 above. But if your application is a usual online shop and relation to it does not harm anyone, and if the number of entries is relatively big (not 3-5 presons, but say 100 000 persons), then encryption may be not needed at all. Just shuffle all the important relations: Relations between person name and address, person name and contact data, between orders and delivery addresses, etc. Thus every single piece of data will be real, unencrypted, but all together they will not give any correlation to real persons.

mentallurg
  • 8,536
  • 4
  • 26
  • 41
  • thanks for the answer! I didn't know that even with highly repeatable data AES may leak only a few values, not the whole dataset. I found "1. Anonymize values" section very useful, really appreciate that. – VB_ Feb 12 '21 at 12:08
  • about shuffling of foreign keys (2nd and 3rd) - I don't think it'd work, because usually logic depends on other columns as well. Suppose you're joining rows by phone_number, and then calculating revenue by bundle type (i.e. PAYG, INTERNET_BUNDLE, FIXED_MINNUTES_BUNDLE, etc). Changing bundle type may break the result or event cause an exception at ETL. – VB_ Feb 12 '21 at 12:10
  • @VB_: As I said, it depends on the logic in your case. When you use name or phone numbers to join some data entries, you still can shuffle other data. For instance, if persons in your database are connected with some orders, you can shuffle these orders. The volume of purchases will remain the same, but the assignment to persons will differ from the reality and thus there will be no harm if some data becomes known. – mentallurg Feb 12 '21 at 17:08
  • got it, thank you! – VB_ Feb 12 '21 at 20:44
  • ow, maybe I only now got your point. Did you mean that FKs shuffling + deterministic encryption may be as secure as non-deterministic encryption? Means it protects data from attacks, or reduce data value in case of exposure? – VB_ Feb 12 '21 at 21:04
  • @VB_: I don't know how you are using the data. But usually one creates some master data, replaces them with IDs, and joins data by these IDs. For instance, one extracts all names from one dataset and puts them to the names table. Extracts name form another data set and adds name to the name table if some are missing. The same with addresses, phone numbers etc. In the end there are references by IDs. For instance, you have a dataset where each order refers via a person ID some person. You just go through all orders and replace the original person IDs with IDs of other persons. – mentallurg Feb 13 '21 at 13:13
  • @VB_: If you want to have the same result next time, you need to keep this substitution table somewhere and apply it next time. – mentallurg Feb 13 '21 at 13:14
  • 1
    @VB_: The result is following: If smb. wants to guess a person by some known facts (e.g. this person bought the most expensive car or house in some city, and the person name can be found in the public sources), this will not work. After shuffling of relations your database will show, for instance, that the most expensive house was bought by a student who is working part time at McDonalds.Or the phone number of Bill Gates will be assigned to some person in a small town in Oregon. And so on. – mentallurg Feb 13 '21 at 13:22
  • thank you for details! – VB_ Feb 13 '21 at 14:52