What is the definition of "sensitive attribute" in the context of k-anonimity?

Question

I have encountered the term "sensitive attribute" multiple times when reading up on the concept of k-anonimity -- but the texts never formally define what this term means.

Take this example of a k-anonymized table from Wikipedia:

+------+---------------+--------+-------------------+----------+-------------------+
| Name |      Age      | Gender | State of domicile | Religion |     Disease       |
+------+---------------+--------+-------------------+----------+-------------------+
| *    | 20 < Age ≤ 30 | Female | Tamil Nadu        | *        | Cancer            |
| *    | 20 < Age ≤ 30 | Female | Kerala            | *        | Viral infection   |
| *    | 20 < Age ≤ 30 | Female | Tamil Nadu        | *        | TB                |
| *    | 20 < Age ≤ 30 | Male   | Karnataka         | *        | No illness        |
| *    | 20 < Age ≤ 30 | Female | Kerala            | *        | Heart-related     |
| *    | 20 < Age ≤ 30 | Male   | Karnataka         | *        | TB                |
| *    | Age ≤ 20      | Male   | Kerala            | *        | Cancer            |
| *    | 20 < Age ≤ 30 | Male   | Karnataka         | *        | Heart-related     |
| *    | Age ≤ 20      | Male   | Kerala            | *        | Heart-related     |
| *    | Age ≤ 20      | Male   | Kerala            | *        | Viral infection   |
+------+---------------+--------+-------------------+----------+-------------------+

Where "Disease" is defined as the sensitive attribute. One can observe that this "sensitive attribute" does not hold in any kind of k-anonymity (k > 1)...

Is the sensitive attribute the piece of information which under no circumstances should be mapped to an individual? Or is it the attribute which shall not be generalized/suppressed for the purpose of data mining? Or is it something entirely different?

score 1 · Accepted Answer · answered Mar 01 '21 at 10:45

In the k-Anonymity data model, there are different types of attributes:

Identifiers allow direct identification and must be removed.
Quasi-Identifiers allow linking with other datasets. It is assumed that the data holder knows which attributes are quasi-identifiers. In practice, this can be difficult since this depends on all the other data available in the universe.
Sensitive attributes are those attributes that shouldn't be linkable to an individual. What is sensitive depends on the context. Misjudging of sensitivity can lead to a weaker than expected anonymization.
Other attributes (neither sensitive nor quasi-identifiers) are not relevant.

In the k-anonymized data set you've shown (k=2), the data holder has anonymized attributes like name, age, gender, state, and religion. This indicates that the data holder considers these attributes to be identifiers or quasi-identifiers. However, there is no k-anonymization with regards to the disease attribute. This indicates that the data holder does not consider this attribute to be a quasi-identifier, but likely a sensitive attribute for which linkage should be prevented.

If I have information like “a female patient from Kerala between 20 to 30 years of age”, then I can't be certain about their disease – it could be either a viral infection or a heart-related illness. The anonymization at k=2 holds.

However, if I have information like “a cancer patient from Kerala” then I can easily identify the patient in the dataset. This is because I was able to use the disease attribute as a quasi-identifier, contrary to the assumptions of the data holder. This is an example of a background knowledge attack. Weaker variants of such attacks include background knowledge of correlation or distribution of attributes, allowing probabilistic identification.

This difficulty in determining actual quasi-identifiers is a core weakness in the k-anonymity model. To be safe, everything should be considered to be a quasi-identifier, but this will clearly reduce data quality. From an attacker perspective, understanding the differences between the data holder's assumptions and the actually available information may enable re-identification. Indeed, Sweeney's motivation for developing k-anonymization and similar methods was that she was able to link two supposedly anonymized datasets, since the data holders hadn't considered quasi-identifiers like ZIP code, gender, and date of birth.

What is the definition of "sensitive attribute" in the context of k-anonimity?

1 Answers1