4

I'm looking into how to store emails and data regarding GDPR. The reasoning is that it would be beneficial to store users emails linked to certain data (shop data about purchases and questionnaires). E.g.

  • User u email
  • User u purchased product x
  • User u questionnaire about experience of product x

I've read into how hashing the emails could allow for pseudonymized data, but I'm not sure if this is enough, for example.

"Although you no longer have the email addresses of all your users, you could easily compare your database to a list of known email addresses to identify which of those people use your service."

There will always be a situation in which people would be able to recover the anonymised data, so my question is, is hashing of emails enough for GDPR? If not then what is the minimum requirement from a cryptographic point of view?

user210772
  • 43
  • 2
  • 1
    `There will always be a situation in which people would be able to recover the anonymised data` Then it would not be *anonymized* but *pseudonymized*. https://www.protegrity.com/pseudonymization-vs-anonymization-help-gdpr/ – Esa Jokinen Jun 24 '19 at 05:41
  • 2
    GDPR doesn't say anything about cryptographic requirements, as it's a legal document, not technical. Encryption is only mentioned a couple of times in "such as..." listings, so encryption is not literally required at all. The other requirements might lead to conclusions that encryption is a good idea, though, but that's on the technical implementation level. – Esa Jokinen Jun 24 '19 at 05:52
  • @EsaJokinen I'm also referring to something like brute forcing combinations, it could still be anonymized but theoretically be recovered. So would pseudonymized be enough for GDPR even though it may not be the best solution? – user210772 Jun 24 '19 at 06:25
  • @EsaJokinen Your sentence only holds true, when you consider an email address a pseudonym. Often it is not (really). – Marcel Jun 24 '19 at 06:27
  • What do you ***want to do with the data***? Before considering controls you can use, you need to be very clear on what you want to do. Do you need the emails to be stored or do you just need to know the same person did all those things? Is hashing enough *for what*? You do not explain that part. Are you asking how to use cryptographic methods to anonymise the data? – schroeder Jun 24 '19 at 13:51
  • Regardless of laws, hashing isn't great for preserving privacy of [fields like email addresses](https://freedom-to-tinker.com/2018/04/09/four-cents-to-deanonymize-companies-reverse-hashed-email-addresses/) in case of a database breach. Want to find embarrassing information about your friend Bob? Do you have his email address? Then you can find him an a public data breach. Got a list of addresses you bought somewhere? Then you can trivially deanonymize a lot of people. Want to spear phish as many people as possible? Guess addresses. – Future Security Jun 24 '19 at 21:56
  • There is no great software-only solution. (Sorry.) Hashing things is still better than storing things plaintext or encrypted, if you're fine doing that to the data. HMAC is much better than plain hashing, in theory, but only if access to the key is restricted. (It's possible/likely that the key may be just as easy to steal as the DB...) – Future Security Jun 24 '19 at 22:12
  • @schroeder no the emails don't need to be stored, and there is no identifiable information of a user other than their email. As you mentioned it would just be helpful to know that if a user u had two entries then knowing that they are linked would be beneficial. I'm trying to use the email as a key into some data structure to link duplicate entries, the identity of the person is not important – user210772 Jun 25 '19 at 02:15

1 Answers1

1

The GDPR does not contain any technical details, or information regarding specific implementations.

Here's recital 26 of the GDPR about pseudonymous and anonymous data:

The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.

Pseudonymous data is meant to be recovered and linked to other personal information, in some way, by the people who are authorized to do so.
Anonymous data is not meant to be useful to identify a person, in any reasonable way for anybody, and the GDPR does not apply to this kind of data.

So the question is: do you want to pseudonymize or anonymize the email addresses? Since you said that you want to "store users' emails linked to certain data like purchases or questionnaires", then it means you actually want to be able to recover the addresses, and therefore you want to pseudonymize them. But I'm not sure if a simple hash would work. First of all, if you want to recover it, what are you going to do? Bruteforce the hash every time you want to send an email to the user? And second of all, anybody else would be able to recover it by bruteforcing it. A more reasonable pseudonymization would be achieved by encrypting the emails and giving the key only to authorized staff. Or map every email to an ID, and keep the map somewhere safe (offline and encrypted, for example). This way the email addresses will be recoverable (and therefore pseudonymous), but only a few authorized people will be able to link them to other personal information.

As you can see, it all depends on the specific purpose of your processing. Why are you keeping the emails? Who should be able to see them? Once you have defined this, you can implement "appropriate technical and organizational measures" (citing article 32 of the GDPR) to achieve your goal.

reed
  • 15,398
  • 6
  • 43
  • 64