0

If I am implementing a tokenization system for PII within a database, is it considered bad practice, or riskier, to reuse tokens?

For example, if I am storing the name "Richard" multiple times, and they are all replaced with the token "Fxyw3Qq5yzXqDoiKqx", does that then introduce any additional risk, than if I were to use a unique identifer for each "Richard"?

Marc
  • 141
  • 1
  • 6
  • This is more of a data science question. It will also depend on a lot of other factors concerning your implementation. – schroeder May 22 '18 at 10:39

1 Answers1

4

Yes, it introduces risk, but might be required, depending on what you're doing with the data.

Imagine a database of lots of people. Perhaps it includes names, addresses, and dates of birth, but the dates of birth aren't encrypted or tokenised, to allow for easy sending out birthday reminders.

If an attacker who steals the database can identify the name associated with a given date of birth (maybe they are on the system, so know their own name and DoB), then they can now also identify anyone else with the same name. They still can't identify the associated address, but they may be able to start correlating data - looking for people whose dates of birth are known with the same first name, for example, aiming to discover the token which corresponds to common surnames. By repeating this process (which is laborious, depending on the data, and whether details can be reliably cross-referenced), they can build up more and more information about the contents of the database.

If the tokens are really encryption, they might also be able to find patterns which help reveal the encryption key, if the designers haven't implemented the encryption carefully - in some cases, this means that by identifying a single longer value, an attacker can decrypt any other shorter value in the system.

If, on the other hand, you use a unique token for each instance of the same name, an attacker can't perform that cross-referencing process. However, neither can you - if you wanted to extract all the records for people called "Richard", you would need to be able to recreate every token to compare against your search term, which could be a difficult process, or even impossible if the token generation process involved a one-way hashing step.

Basically, if you want to be able to search the data afterwards and get correlation of data, you probably need to consistently tokenise the same value to the same token. If you want to fully anonymise the data, perhaps to provide it to a third party for testing or where analysis is being performed on non-PII elements, it's safer to ensure each instance of the same value in the original data is distinct once processed.

Matthew
  • 27,233
  • 7
  • 87
  • 101