How to best hash patient names to generate a pseudonym?

Question

I am working on a data acquisition and preprocessing tool for a brain tumor database. The tool preprocesses the data and harmonizes it so it can be united in a one big database. Due to privacy restrictions, data protection laws and our own morals we are required to generate pseudonyms for the patients. What is the best way to achieve that?

There are multiple brain scans per patient and it is important that they can be linked to each individual so we are able to trace the brain tumor over time. From my limited understanding I cannot use an algorithm like bcrypt for that reason, as it will generate a different hash every time due to the unique salt?

My current plan was to use SHA-3 256 with the patient name + a local salt which is different for every hospital (I think in this context the salt is also called pepper?). However as far as I understood it sha-3 is susceptible for brute force attacks?

This approach also has the disadvantage that the same patient will generate different hashes in case he or she is scanned at another institution. However this is as acceptable as patients unfortunately die to fast to migrate between hospitals. However it would still be nice if we could avoid this problem if the research network wants to incorporate less fatal diseases in the future.

I have an electron + vue JS running the UI connected via websockets (socket.io) to a python flask server running the heavy computations from within a docker container. As everything is running locally I guess my setup is not really vulnerable to man in the middle attacks so it would be acceptable to do the hashing on the python backend?

Which approach would you recommend?

Thanks a lot for your time!

PS: Answering questions: For the start we have hospitals in Germany, Austria and Switzerland participating (this means EU + Swiss data protection law). However if opportunities arise we would like to expand internationally, also to the US.

The goal of the platform is to supply researchers with data. Currently every institution is working with its own data, however modern approaches, especially deep learning would highly profit from bigger training data sets.

So it is about hiding the identity of the patients from the researchers using the comprehensive database which will be supplied in a curated fashion. The local institutions have access to their raw patient data anyway so there is no need to hide it from them.

The idea of using a local salt per institution was to avoid that users of clinic A who know patient "John Doe" can trace datasets from of "John Doe" when he visits clinic B.

I think instead of patient name + local salt it is a better idea to use the local pseudonym provided by the hospital information system. So in case someone manages to decipher the data they just end up with the local pseudonym. Unfortunately these systems are not standardised and I am not sure whether every system supplies such a record.

If possible I would really like to avoid the approach of having a lookup table. The computers processing the data are often offline and there are multiple computers per institution so it will be really hard to take control of it and keep it in sync and safe. Doctors are often stressed, bleary-eyed and have limited motivation to follow data protection protocols so I am sceptical whether they could handle this manual work.

PPS: After more reading into the subject I am wondering whether PGP could be a way out of my dilemma. Users could sign their data with my public key..I would then decode the information and generate pseudonyms for the patients with my own central lookup table. The pseudonyms would have no link to the patient data therefore there would be no way to link it back to the patients with access to the normal database.

So an attacker would have to intercept the RSA encrypted communication between our servers and decipher it to get access to the patient data. An attacker who is able to do that could probably just access the raw data on the hospital servers too which seems to be the way more attractive target, as with the raw data he would end up with for example the whole ct scans and not only the brains extracted out of them.

Don't use a hospital-specific salt unless you specifically don't want to match up patients between hospitals. When you say everything is available locally, do you mean the original names too? Who exactly are the original names supposed to be hidden from? — Macil, Apr 06 '18 at 23:51
What legal jurisdiction are the hospitals from? Makes a big difference when it comes to anonymizing techniques and what is allowable. — nbering, Apr 07 '18 at 02:18
If this is in the United States, the governing legislation would be the Health Insurance Portability and Accountability Act (HIPAA), enforcement of which is covered by Health and Human Services (HHS). They have a guide on de-identification here: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html — nbering, Apr 07 '18 at 02:21
On a somewhat lighter note... I'm pretty sure your patients and their doctors would prefer that their chances of survival not be factored into hospital privacy policies. It's also just a bit grim. — nbering, Apr 07 '18 at 03:37
Thanks for your input, I tried to answer all your questions in the PS section at the end. — florian, Apr 07 '18 at 07:25
Who said the lookup had the be manual? The key is that only an authorized individual should have access to the lookup table, so they should be logged into their hospital's IT infrastructure to access the information. — nbering, Apr 07 '18 at 14:42
Well there is no way to sync a lookup table between the computers in many of the hospitals so doctors would have to do that manually if they want to preprocess data on multiple computers. — florian, Apr 08 '18 at 12:06

nbering · Answer 1 · 2018-04-07T03:45:11.403

I started commenting in regards to U.S. jurisdiction, so I'll just write an answer specific to that region, because this may help others. There are often a lot of questions and confusion about de-identification for releases of medical records for research purposes.

Here's the disclaimer: I work closely with people who handle this kind of information, but I would not consider myself to be an expert on the Health Insurance Portability and Accountability Act (HIPAA) and it's regulations. I recommend consulting with an experienced professional when matters of patient privacy are a concern.

All the information I cover below can be found in it's official (and more reliable) form on the Health and Human Services (HHS) website:

https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#standard

On Hashes as Identifiers in Anonymous Data

Hashing the person's name is definitely not de-identifying the record. If someone guessed that a record belonged to a particular patient, they could simply input the person's name, with the local salt, and confirm the patient's identity.

The best way to identify the patient's record would be with an identifier that is not generated from the patient's information at all - like a UUID or a cryptographically random string. Then, keep a lookup table that refers back to the patient's identifying information.

To quote from the De-Identification guide linked above:

(c) Implementation specifications: re-identification. A covered entity may assign a code or other means of record identification to allow information de-identified under this section to be re-identified by the covered entity, provided that:

(1) Derivation. The code or other means of record identification is not derived from or related to information about the individual and is not otherwise capable of being translated so as to identify the individual; and

(2) Security. The covered entity does not use or disclose the code or other means of record identification for any other purpose, and does not disclose the mechanism for re-identification.

De-Identification Types

There are two types of de-identification described in the Health Insurance Portability and Accountability Act.

Expert Determination, and
Safe Harbor

Expert Determination

In brief, expert determination is the process of having a professional statistician review the material, and remove any information that could be used to determine who the individuals are in the medical records.

This method has the advantage of potentially being able to leave more detail in the records, as long as it couldn't be used to trace to the particular individual who the records belong to.

Just in case you're wondering, if you are asking a question on Security Stack Exchange about de-identification, you are probably not qualified to certify the data in this way yourself, and would need to seek a professional to fill your needs. They would also need to be under Business Associate Agreement (BAA) to handle the data - as you should be if you are handling medical records for a U.S. hospital (again, I am assuming U.S. for this answer).

Safe Harbor

Safe Harbor de-identification is a little easier to do for a software developer. It just involves removing all of a list of 18 identifying fields from the medical records, including:

Names
Geographic Information
Dates
Telephone Numbers
Vehicle Identifiers
Fax Numbers
Device Identifiers and Serial Numbers
Email Addresses
URLs
Social Security Numbers
IP Addresses
Medical Record Numbers
Biometric Identifiers (ie. Finger Prints)
Health plan information
Full-face photographs and comparable images
Account Numbers
Any unique identifying number or characteristic (with some well-defined exceptions)
Certificate/License Numbers

This list is just a summary, refer to HIPAA for the specific descriptions of these items, especially the geographic information, for which they give very detailed instructions.

Note that information identifying the hospital or it's address may be considered as geographic information, so the records may need to be stripped of that information to meet Safe Harbor requirements.

Re-Identification

The guide I mentioned above also has some information on HHS's guidelines for matching a record back to the patient should the research reveal information that needs to be relayed back to the patient or their physician.

This process usually involves applying a unique random ID to the records, that are stored with the Covered Entity (usually a hospital), so that they can do a reverse-lookup to identify a patient from the de-identified records. That lookup table itself would be classified as PHI, so it could not be released with the records.

While what you wrote is interesting, I don't think it answers the question: `Which approach would you recommend?` — Neil Smithline, Apr 07 '18 at 03:12
You make a very good point. I got a little lost in my head around half way. ;) I'll add more about the specific question. — nbering, Apr 07 '18 at 03:14
Thanks for your input, I tried to answer all your questions in the PS section at the end. — florian, Apr 07 '18 at 07:26
I see your P.S. I can only speak to HIPAA in the US on the research data side, since it’s the only regulation I have training on. It’s also notable that the US actually has special provisions in the law to allow for release of records for medical research without patient consent, I’m not sure if other jurisdictions have such a thing. When you cross boundaries, you may need a consent mechanism to release enough metadata between institutions to connect the patient’s records in the two places. Just a guess. — nbering, Apr 07 '18 at 13:22