I am working on a data acquisition and preprocessing tool for a brain tumor database. The tool preprocesses the data and harmonizes it so it can be united in a one big database. Due to privacy restrictions, data protection laws and our own morals we are required to generate pseudonyms for the patients. What is the best way to achieve that?
There are multiple brain scans per patient and it is important that they can be linked to each individual so we are able to trace the brain tumor over time. From my limited understanding I cannot use an algorithm like bcrypt for that reason, as it will generate a different hash every time due to the unique salt?
My current plan was to use SHA-3 256 with the patient name + a local salt which is different for every hospital (I think in this context the salt is also called pepper?). However as far as I understood it sha-3 is susceptible for brute force attacks?
This approach also has the disadvantage that the same patient will generate different hashes in case he or she is scanned at another institution. However this is as acceptable as patients unfortunately die to fast to migrate between hospitals. However it would still be nice if we could avoid this problem if the research network wants to incorporate less fatal diseases in the future.
I have an electron + vue JS running the UI connected via websockets (socket.io) to a python flask server running the heavy computations from within a docker container. As everything is running locally I guess my setup is not really vulnerable to man in the middle attacks so it would be acceptable to do the hashing on the python backend?
Which approach would you recommend?
Thanks a lot for your time!
PS: Answering questions: For the start we have hospitals in Germany, Austria and Switzerland participating (this means EU + Swiss data protection law). However if opportunities arise we would like to expand internationally, also to the US.
The goal of the platform is to supply researchers with data. Currently every institution is working with its own data, however modern approaches, especially deep learning would highly profit from bigger training data sets.
So it is about hiding the identity of the patients from the researchers using the comprehensive database which will be supplied in a curated fashion. The local institutions have access to their raw patient data anyway so there is no need to hide it from them.
The idea of using a local salt per institution was to avoid that users of clinic A who know patient "John Doe" can trace datasets from of "John Doe" when he visits clinic B.
I think instead of patient name + local salt it is a better idea to use the local pseudonym provided by the hospital information system. So in case someone manages to decipher the data they just end up with the local pseudonym. Unfortunately these systems are not standardised and I am not sure whether every system supplies such a record.
If possible I would really like to avoid the approach of having a lookup table. The computers processing the data are often offline and there are multiple computers per institution so it will be really hard to take control of it and keep it in sync and safe. Doctors are often stressed, bleary-eyed and have limited motivation to follow data protection protocols so I am sceptical whether they could handle this manual work.
PPS: After more reading into the subject I am wondering whether PGP could be a way out of my dilemma. Users could sign their data with my public key..I would then decode the information and generate pseudonyms for the patients with my own central lookup table. The pseudonyms would have no link to the patient data therefore there would be no way to link it back to the patients with access to the normal database.
So an attacker would have to intercept the RSA encrypted communication between our servers and decipher it to get access to the patient data. An attacker who is able to do that could probably just access the raw data on the hospital servers too which seems to be the way more attractive target, as with the raw data he would end up with for example the whole ct scans and not only the brains extracted out of them.