2

I have encountered the term "sensitive attribute" multiple times when reading up on the concept of k-anonimity -- but the texts never formally define what this term means.

Take this example of a k-anonymized table from Wikipedia:

+------+---------------+--------+-------------------+----------+-------------------+
| Name |      Age      | Gender | State of domicile | Religion |     Disease       |
+------+---------------+--------+-------------------+----------+-------------------+
| *    | 20 < Age ≤ 30 | Female | Tamil Nadu        | *        | Cancer            |
| *    | 20 < Age ≤ 30 | Female | Kerala            | *        | Viral infection   |
| *    | 20 < Age ≤ 30 | Female | Tamil Nadu        | *        | TB                |
| *    | 20 < Age ≤ 30 | Male   | Karnataka         | *        | No illness        |
| *    | 20 < Age ≤ 30 | Female | Kerala            | *        | Heart-related     |
| *    | 20 < Age ≤ 30 | Male   | Karnataka         | *        | TB                |
| *    | Age ≤ 20      | Male   | Kerala            | *        | Cancer            |
| *    | 20 < Age ≤ 30 | Male   | Karnataka         | *        | Heart-related     |
| *    | Age ≤ 20      | Male   | Kerala            | *        | Heart-related     |
| *    | Age ≤ 20      | Male   | Kerala            | *        | Viral infection   |
+------+---------------+--------+-------------------+----------+-------------------+

Where "Disease" is defined as the sensitive attribute. One can observe that this "sensitive attribute" does not hold in any kind of k-anonymity (k > 1)...

Is the sensitive attribute the piece of information which under no circumstances should be mapped to an individual? Or is it the attribute which shall not be generalized/suppressed for the purpose of data mining? Or is it something entirely different?

Denny
  • 45
  • 3

1 Answers1

1

In the k-Anonymity data model, there are different types of attributes:

  • Identifiers allow direct identification and must be removed.
  • Quasi-Identifiers allow linking with other datasets. It is assumed that the data holder knows which attributes are quasi-identifiers. In practice, this can be difficult since this depends on all the other data available in the universe.
  • Sensitive attributes are those attributes that shouldn't be linkable to an individual. What is sensitive depends on the context. Misjudging of sensitivity can lead to a weaker than expected anonymization.
  • Other attributes (neither sensitive nor quasi-identifiers) are not relevant.

In the k-anonymized data set you've shown (k=2), the data holder has anonymized attributes like name, age, gender, state, and religion. This indicates that the data holder considers these attributes to be identifiers or quasi-identifiers. However, there is no k-anonymization with regards to the disease attribute. This indicates that the data holder does not consider this attribute to be a quasi-identifier, but likely a sensitive attribute for which linkage should be prevented.

If I have information like “a female patient from Kerala between 20 to 30 years of age”, then I can't be certain about their disease – it could be either a viral infection or a heart-related illness. The anonymization at k=2 holds.

However, if I have information like “a cancer patient from Kerala” then I can easily identify the patient in the dataset. This is because I was able to use the disease attribute as a quasi-identifier, contrary to the assumptions of the data holder. This is an example of a background knowledge attack. Weaker variants of such attacks include background knowledge of correlation or distribution of attributes, allowing probabilistic identification.

This difficulty in determining actual quasi-identifiers is a core weakness in the k-anonymity model. To be safe, everything should be considered to be a quasi-identifier, but this will clearly reduce data quality. From an attacker perspective, understanding the differences between the data holder's assumptions and the actually available information may enable re-identification. Indeed, Sweeney's motivation for developing k-anonymization and similar methods was that she was able to link two supposedly anonymized datasets, since the data holders hadn't considered quasi-identifiers like ZIP code, gender, and date of birth.

amon
  • 1,068
  • 7
  • 9