EU's General Data Protection Regulation (GDPR), and the German DSGVO implementation, are very strict when it comes to individual-related data (such as IP addresses). However this question is not about the GDPR, but how to implement the regulation with nginx HTTP access log while keep the possibility of "identifying" the anonymous user within a user journey (to border a user journey from other ones).
My current implementation is, that I do not record the remote IP and port at all. I purged the environment variables for upstreams/proxies/etc and simple does not have remote IP and port information with the access logs.
Now I am facing the issue that I need to follow a path of a user journey. I just simply does not have any way of "identifying" which requests are within which user journey. I want to point out, that I also do not use cookies, etc.
The legacy approach to "identify" an "anonymous user" is to look for the remote IP and the date information. Within the same day, the same remote IP would most likely be the same user. However, as mentioned above, I do not log remote IP and port information. And I don't want that even now.
My current though is to hash the remote IP address with the remote port and date of the request. I would have the date information with the logs but not the remote port, so I cannot - without heavy brute forcing - recover the remote IP, an individual-related data. This approach would help to give back some level of user journey identification, which would help me quite a bit.
A general workflow to accomplish this approach would be:
- The request is accepted by nginx,
- nginx performs a hash operation with remote IP, remote port and current date (e.g.
md5_hex("$remote_addr $remote_port $current_date")
) and stores the hash in a new variable (e.g.$remote_ip_anonymous
), - the log_format would be having the $remote_ip_anonymous variable.
The hash would alter, even when the remote IP and remote port would be the same, due to the current date salt. And it would alter, when the remote port is changed. So this should be fine with GDPR or at least the lowest data security category, while the actual remote IP would be a mayor data security category with GDPR.
Enough with the theory... how would I implement such remote IP anonymization? Do I have to use the nginx Perl module or Lua module, or is there another (faster) way of getting that hash and store it into the nginx variable?