32

I'm working on an application that is completely built upon user interaction. In my application logs, I log each interaction and print the email address to uniquely identify which user did which interaction.

This application log will not be visible to anyone other than:

  • Me
  • The next owner of the application if I would sell the project
  • An administrator I might hire if the workload gets too big

An example of a log record is something like this:

2019-01-24 14:27:20.954 INFO 32256 --- [whatever-info] s.p.s.t.d.m.s.SomeClassThatPrintsTheLog : Registering user with email address EMAIL_ADDRESS_WILL_BE_PRINTED_HERE@email.com.

Is this allowed under GDPR or should I mask the printed email address in any way? Or use another solution?

Peter Mortensen
  • 877
  • 5
  • 10
Titulum
  • 423
  • 1
  • 4
  • 8
  • 2
    Related question: [How to handle emails as username under GDPR?](https://security.stackexchange.com/questions/184519/how-to-handle-emails-as-usernames-under-gdpr) – Philip Rowlands Jan 28 '19 at 09:13
  • 2
    Side note - if email addresses can be changed in your application, logging the address won't always tell you who did what (data would be stale). You probably want some sort of permanent internal identifier. – Clockwork-Muse Jan 29 '19 at 18:26
  • @Clockwork-Muse that is a really good remark. But does this permanent internal identifier count as an identifier that uniquely identifies a customer? (And thus, fall under GDPR) – Titulum Jan 31 '19 at 12:43
  • Probably. [Because you could use it to get access to the "real" person record](https://law.stackexchange.com/questions/19260/what-counts-as-personal-data-under-gdpr) – Clockwork-Muse Jan 31 '19 at 16:34

4 Answers4

41

The goal of GDPR is about protecting personally identifiable information (PII) as much as possible. The interaction of a specific user with your application are pretty sure such PII.

If you really need to log this information you should inform your user about this process, i.e. the purpose of the data collection, how long the information gets stored and who gets access to the data. And you and whoever you sell the application to should never use the data for any other purpose as agreed to by the user. And of course you need to properly protect the information against misuse, i.e. use outside of the specified purpose. This specifically but not only includes if someone hacks into your application or server and steals this data.

Since use of the data is limited and protection (and fines) can be costly, it might be easier to not store these information in the first place. An alternative is to at least pseudonymize the PII as much as possible, i.e. in a way that the logged data are still usable for you but that no association to a specific user can be done even when having all the logged data. But since it is not really clear what you use these logs for no recommendations can be done for a specific process of such pseudonymization.

Be aware though that simply replacing each unique email address with another unique identifier might not be a sufficient pseudonymization. Depending on the data you log it might be possible to create user profiles and based on specific traits in the profiles associate these to real world users. See AOL search data leak for an example how such simple pseudonymization attempt went wrong.

Steffen Ullrich
  • 184,332
  • 29
  • 363
  • 424
  • 3
    "*An alternative is to at least pseudonymize the PII as much as possible, i.e. in a way that the logged data are still usable for you but that no association to a specific user can be done even when having all the logged data.*" Am I correct in assuming that if you log something like user ID, that is fine? So, if you say `user 42 did X` that is not going to identify that `firstname.lastname@domain.com` did the thing, unless you also have the database information. – VLAZ Jan 28 '19 at 12:10
  • 5
    @vlaz: Just taking the user id is not necessarily a sufficient pseudonymization. Depending on what you log on activity it might be possible to create a profile for this specific user id and based on this unique profile associate the user id with a real world person. This was for example done with pseudonymized search data released by AOL - see [wikipedia](https://en.wikipedia.org/wiki/AOL_search_data_leak) for more. Given that it is not known what you log in detail no specific process of sufficient pseudonymization can be recommended. – Steffen Ullrich Jan 28 '19 at 12:28
  • So hash of PII is not anonymized enough because it could be related to the PII unambiguously? – pabouk - Ukraine stay strong Jan 28 '19 at 12:32
  • @pabouk in light of the comment above, I suppose - yes, it's not enough. There is functionally no difference between PII hash and a user ID as they both could trace and profile a user. – VLAZ Jan 28 '19 at 12:34
  • 5
    PII is a phrase in common usage in the states but it’s not used in the GDPR. There, the term is ‘personal data’ which to my mind is a bit wider in scope. – Robin Whittleton Jan 28 '19 at 17:57
  • 2
    Hm. I am no expert but is it not also necessary to be able to remove a persons data upon request? You should probably add some way of scrubbing the log without making it useless. – Stian Yttervik Jan 29 '19 at 08:39
  • 2
    @StianYttervik depending on the nature of the logs and the reason for processing them, it ***might*** not be necessary to remove the personal data from the logs. the extreme example: if you want to record who submitted a RTBF request (especially if you want to filter backups later), then you need to record the person indefinitely. And that's ok in GDPR. RTBF is not a "magic wand" to scrub all references of a person from a system. – schroeder Jan 29 '19 at 16:49
  • 1
    @StianYttervik Another example of this are backups: As long as they are kept only for limited time, not removing data from backups in generally viewed as ok. – deviantfan Jan 29 '19 at 23:47
30

Logging data is not the issue under GDPR. The part that matters is what happens to the log, who can see it, how long it is stored, what the log is used for, and if you can satisfy the rights of the data subject once you process and store the data.

If you need to log the email in order to provide your service, then there is no problem to log it. But if you do log the data, you need to be very clear from the start, both with yourself and the data subjects, what will happen to it.

schroeder
  • 123,438
  • 55
  • 284
  • 319
  • "Need to" is key here! If there is a way to do it without the data, you are not allowed to use the data. Period. - Thus logging alone might be a violation, if the data isn't needed. – I'm with Monica Jan 29 '19 at 16:46
  • 3
    @AlexanderKosubek that's an extreme application of the regulation. If you notify the data subject of the use of the data, and the processing is in the data subject's interest, then there is wiggle room. Even in GDPR, there is no "period". It's more about "mindfulness of processing" than "thou shalt not". – schroeder Jan 29 '19 at 16:51
  • How does the wording of Art. 5, 1. (c) _not_ imply that only data that is actually needed may be processed? It specifically says "limited to what is necessary"... There might be wiggle room in the form "is this purpose legit?" but not really about "is this data necessary?" – I'm with Monica Jan 29 '19 at 16:58
  • 2
    @AlexanderKosubek you have flipped the logic, though. If I can do it with the data, then I can deem it necessary. You stated "If there is a way to do it without the data, you are not allowed to use the data." And that's not what was intended. Recital 39 adds a "reasonableness" clause, which adds wiggle room. Hence, your interpretation might be in the spirit of GDPR, but an extreme interpretation. – schroeder Jan 29 '19 at 17:02
8

Article 5 of GDPR specified the basic principles for processing data.

Article 5 "Principles relating to processing of personal data"

(1) Personal data shall be:

... (b) collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes; further processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes shall, in accordance with Article 89(1), not be considered to be incompatible with the initial purposes (‘purpose limitation’);

Storing personal information log files for the purpose of diagnosing problems with your application is not incompatible with the original purpose, but do protect the data using "appropriate technical and organisational measures ... according to risk".

But don't store your logs forever. E.g. Data Subjects (The GDPR term for a person) have the right to be forgotten. That also means that they should eventually be removed from logs, backups, etc. I believe that if you keep data for the last 90 days - that should be fine.

And lastly, if you are building a system that processes personal information about EU citizens, I would strongly recommend that you take a 1-2 day course on the matter, to learn the differences between controller, processor, data subject, personal information vs. sensitive personal information, etc.

Pete
  • 181
  • 3
  • 3
    Wait, a RTBF request can obligate a site to dig into and alter their *backups?!?* Does anyone at the EU legal system have any idea how utterly insane that is? – Mason Wheeler Jan 28 '19 at 21:57
  • 4
    @MasonWheeler Wait, what's the point of the right to be forgotten if all deleted data will be restored from backup on next crash? Do you have any idea how utterly insane that is? – Agent_L Jan 28 '19 at 22:35
  • 4
    @Agent_L This is an example of why the passive voice is so insidious and we are supposed to avoid using it. Expressed in the proper active voice, it's a "right to force others to forget about you." Which is insane already; it sounds like something horribly dystopic, straight out of a Phillip K. Dick novel. I'm just pointing out yet another point where its requirements are nonsensical. – Mason Wheeler Jan 28 '19 at 22:45
  • 1
    @MasonWheeler I've merely showcased how your argument sounds for the other side and how bad it is to bring up politics unwelcome. No matter how bad you phrase it, in a "proper" or "improper" voice, the "right to force others to stop abusing you" is the cornerstone of most societies including ours. – Agent_L Jan 29 '19 at 08:05
  • 6
    @MasonWheeler When a data subject exercises his right to be forgotten, the Controller has a certain time within to comply, 90 days if I remember correctly. So if you immediately remove the information from your production system, and keeps backups for 90 days - you're fine. Also remember that other legislation may overrule this right to be forgotten, so for example, if local legislation requires you to keep financial records for the last 5 years, then that "wins" over the right to be forgotten. – Pete Jan 29 '19 at 08:19
  • 4
    @MasonWheeler no, that's not what the requirement states. You do not need to dig into your tape backups. What you need to do is to ensure that if you apply your backups, that the data in question is not restored. While this is potentially a new functionality for some, it is not as crazy as it sounds in practice. – schroeder Jan 29 '19 at 09:39
  • 2
    @Agent_L Sure, but *remembering factual information* is not abuse, and this insane "right" is already being heavily abused by sleazy politicians, criminals, and corporate actors to cover up past misdeeds, as everyone with half a brain predicted it would be from day 1. – Mason Wheeler Jan 29 '19 at 12:36
  • @schroeder That's good to hear. Thanks for the clarification. – Mason Wheeler Jan 29 '19 at 12:36
  • 1
    @MasonWheeler Every information worth keeping is factual. And keeping it longer than necessary is abuse, that's the point. Eg ebay banned me personally for life because apparently few years of having account without buying is "suspicious activity". They don't have a record of what part of my activity was suspicious, all they have is the factual information that I was banned. That's exactly the kind of abuse GDPR seeks to eradicate. Nothing but *remembering factual information* here. Let's delete our political rants here, because they don't serve the question. – Agent_L Jan 29 '19 at 16:21
2

Here's a couple of quotes from the GDPR (emphasis added).

Recital 78:

The protection of the rights and freedoms of natural persons with regard to the processing of personal data require that appropriate technical and organisational measures be taken to ensure that the requirements of this Regulation are met. In order to be able to demonstrate compliance with this Regulation, the controller should adopt internal policies and implement measures which meet in particular the principles of data protection by design and data protection by default. Such measures could consist, inter alia, of minimising the processing of personal data, pseudonymising personal data as soon as possible, transparency with regard to the functions and processing of personal data, enabling the data subject to monitor the data processing, enabling the controller to create and improve security features.

Article 25 (Data protection by design and by default), paragraph 1:

Taking into account the state of the art, the cost of implementation and the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural persons posed by the processing, the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organisational measures, such as pseudonymisation, which are designed to implement data-protection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of this Regulation and protect the rights of data subjects.

What does this mean? That if you don't have a good reason to include email addresses in logs, then you probably shouldn't do it. You might log the user ID instead, which has a higher level of pseudonymization, and would still allow you to identify the user if you needed to. IDs are probably the right thing to use anyway to uniquely identify a user, regardless of the GDPR, because I suppose you can expect a user to always have the same ID, while the email address can usually be changed.

That said, even though I'm not a lawyer, I don't think you can get in much trouble for logging email addresses, as long as you are able to demonstrate that everything is stored and processed securely enough. On the other hand, good design choices will definitely help you to demonstrate that you have followed the best practices for security and privacy, and that you haven't put your users' data at risk by unnecessarily processing their personal data.

reed
  • 15,398
  • 6
  • 43
  • 64
  • 1
    user ID = email = PII in GDPR, so just switching to ID is not going to help – schroeder Jan 28 '19 at 11:14
  • If you need to log the *email* to verify email setup, then you can't get away from that – schroeder Jan 28 '19 at 11:15
  • 1
    @schroeder, of course it helps, because IDs have a much higher level of pseudonymity (I'd say 100%) than email addresses (which might even be enough to identify a person). And of course pseudonym data is still personal data. Pseudonymous data is not the same as anonymous data. – reed Jan 28 '19 at 11:33
  • 3
    @schroeder, a randomly generated user ID can *help* to safeguard customer data if the rest of the PII is in just one table, because deleting that entry anonymizes the data in logs and similar places if you can no longer match that ID to a person. – o.m. Jan 28 '19 at 11:51
  • @o.m. I was just in the process of writing a similar comment (though you worded it much better than I was going to). It doesn't completely solve the problem, but it makes satisfying a request by a user to remove all of their PII much easier. – Anthony Grist Jan 28 '19 at 11:53
  • @reed I'm afraid GDPR disagrees with you. – schroeder Jan 28 '19 at 12:23
  • @o.m. yes, and that's fine, but that's not what is being proposed by reed. These details are *very* important, and one cannot simply say that userID is sufficient. – schroeder Jan 28 '19 at 12:23
  • 1
    If the userID can be associated with a person, the userID becomes PII. If you can break the association, then that's great. But you *cannot* simply state that userID is sufficient. You need to have a structure that facilitates the disassociation. That means that the *whole* point is not about the userID at all, but rather how you process the data. – schroeder Jan 28 '19 at 12:25
  • Please read the comments and the recent edit to Steffan's answer. – schroeder Jan 28 '19 at 12:43
  • 1
    @schroeder, my answer never suggests that an ID is *sufficient* for anything. It just says that in this specific case (supposedly useless emails in logs) IDs are probably going to be a better option to achieve security and privacy by design, and minimizing the risks. IDs help with pseudonymization in this case because the "additional information" needed to identify the person will not be in the log itself, but somewhere else (probably a database). Email addresses generally provide much less pseudonymity. I did not make any assumptions on anything else. – reed Jan 28 '19 at 13:39
  • "user ID ... which has a higher level of pseudonymization" This is false. You are claiming that the ID is sufficient for this use case. It is not. Even if the userID needs to be correlated with other data, it can stiil be PII. – schroeder Jan 28 '19 at 15:18
  • @schroeder and here you spelled out the difference: the email address is PII, the userID *can* be PII. It's quite context dependent especially with regard to using appropriate measures to secure user information. The other big issues is that you are typically required to delete PII on request or when a contract ends. Meaning, if you have emails in the logs you technically would need to delete all entries that contain the mail in your log archive (unless the logs are completely deleted within the necessary time frame anyway). If you have IDs in the logs, it might suffice to delete the [cont] – Frank Hopkins Jan 28 '19 at 16:04
  • @schroeder association from ID to PII, e.g from some database. – Frank Hopkins Jan 28 '19 at 16:05
  • @Darkwing I get all that, but the answer, as written, does not include any of these subtleties which makes it incorrect advice. – schroeder Jan 28 '19 at 16:46
  • @schroeder, there must be a misunderstanding about pseudonymity and personal data. Pseudonymization means that a parameter like "John Doe" (name) is replaced with ID123 (ID), and the correlation between name and ID is kept separate (by technical and organizational means). Pseudonymous data **is** always personal data, because it can be used to identify a person. Pseudonymous doesn't mean anonymous. Anonymous data would be something like: "Most people named John are overweight". No way to identify a person from that. – reed Jan 28 '19 at 16:57
  • @schroeder, so pseudonymous data is useful because it can help you with some tasks. For example, it allows you to share the data with somebody else (for reports, statistics, support, etc.), without giving them unnecessary (and excessive) personal data. So it should be clear why using IDs in logs is better than emails. IDs contain less information (ID123 vs name.surname@domain) and are generally harder to link to anything unless you also have access to other information. And I never said an ID is enough for anything, I don't know where you are reading that. – reed Jan 28 '19 at 16:57
  • I'm not equating pseudonymous with anonymous. My entire point is that you cannot swap email with userID and get pseudonymisation. If you want to achieve pseudonymisation, then that's a separate process, which does not depend on userID and can still be done with the email address. It's the *process* that gets you pseudonymisation. Your answer glosses over this rather large aspect of GDPR. To focus on userID is to go astray before you start. – schroeder Jan 28 '19 at 19:49