In 1997, a PhD student named Latanya Sweeney heard about an
interesting data release. A health insurance organization from
Massachusetts had compiled a database of hospital visits by state employees, and
had thought that giving it to researchers could encourage innovation and
scientific discovery. Of course, there were privacy considerations: allowing
researchers to look at other citizens health records seemed pretty creepy. So
they decided to do the obvious thing, and remove all columns that indicated who
a patient was: name, phone number, full address, social security number, etc.

As you can probably guess, this didn't end so well. In this article, I'll
describe and analyze Sweeney's successful reidentification attack, and I'll
explain the privacy definition that Sweeney invented to prevent this type of
attack in the future: \(k\)-anonymity.

What went wrong?

Some demographic information was left in the database, so researchers could
still compile useful stats: ZIP code, date of birth, and gender were all part
of the data. Sweeney realized that the claims of the Massachusetts governor,
who insisted that the privacy of state employees was respected (all identifiers
were removed!), were perhaps a little bit over-optimistic. Since the governor
himself was a state employee, Sweeney decided to do the obvious thing and
reidentify which records of the "anonymized" database were the governor's.

With just $20, Sweeney bought the public voter records from Massachusetts, which
had both full identifiers (names, addresses) and demographic data (ZIP code and
date of birth), and contained the governor's information. Guess how many records
matched the governor's gender, ZIP code, and date of birth inside the hospital
database? Only one, and thus, Sweeney was able to know which prescriptions and
visits in the data were the governor's. She posted all of it to his office,
showing theatrically that their anonymization process wasn't as solid as it
should have been.

Several factors made this attack possible. Some are obvious, but not all:

The hospital data contained demographic information that could be used to
distinguish between different records.

A secondary database was available to figure out the demographic information
about the target.

The target was in both datasets.

And the demographic information of the target (ZIP code, date of birth, and
gender) was unique within both datasets: only one record had the demographic
values of the governor.

At first glance, these factors appear to be necessary: remove one of them and
suddenly, the attack no longer works. (Try it! It's a good mental exercise.)

How to prevent this attack?

As per our previous remark, removing one of the factors should be enough to
prevent attacks like these. Which ones can we afford to remove, while making
sure that the data can be used for data analysis tasks?

We could remove all demographic information from the data, or even all
information that might be linked to a person using auxiliary sources.
Unfortunately, this would also severely hinder the utility of the data:
correlations based on age, gender, and geographic info are very useful to
researchers!

Society probably should do something about the existence of public (or
commercially available) data sources that can be used in reidentification
attacks. However, this is a complex political issue, so a little bit out of
scope for a data owner who wants to publish or share an anonymized version of
their data — in practice, there's pretty much nothing we can do about it.

Again, there's not much we can do. We have no way to modify the secondary
(public) dataset. We could decrease the probability that a random target is
in our dataset by sub-sampling it, but all people in the sample would still
be at risk, so this is obviously not a satisfying solution.

Now, this is the interesting point. Maybe suppressing all demographic values
would render the data useless, but there might be a middle ground to make
sure that the demographic values are no longer unique in the dataset.

This last suggestion is the basic idea of \(k\)-anonymity.

Definition of \(k\)-anonymity

A dataset is said to be \(k\)-anonymous if every combination of values for
demographic columns in the dataset appears at least for k different records.

For example, this dataset is \(2\)-anonymous:

ZIP code

age

4217

34

4217

34

1742

77

1742

77

4217

34

This one isn't:

ZIP code

age

4217

34

1742

77

1743

77

4217

34

Notice that we need every combination of values to appear at least \(k\) times.
Thus, even if each individual value of each column appears \(2\) times in the
following dataset, it's not \(2\)-anonymous:

ZIP code

age

4217

34

1742

34

4217

77

1742

77

The intuition is that when a dataset is \(k\)-anonymous for a sufficiently large
\(k\), the last requirement for a successful reidentification attack is broken. An
attacker might find out the demographic information of their target using a
secondary database, but then this demographic information will be linked to \(k\)
different individuals, so it will be impossible to know which one is their info.

What types of data are reidentifying?

Note that we've only talked about "demographic information", which is pretty
vague. ZIP codes, age, gender are all good candidates for reidentification
attacks, because they're public (or easily findable) information that is also
often found in sensitive datasets (especially medical ones). In general, the
data owner should consider which columns might be used by the attacker they're
concerned about.

These columns, not necessarily sensitive themselves but which might be used in a
reidentification attack, are called quasi-identifiers (or QIs). There is no
universal list of quasi-identifiers, it depends on the attack model. If some
data types are almost always QIs (ZIP code, age, gender…), many more depend on
the context (like timestamps, medical conditions, physical characteristics…).
The question to ask is: would the person who's trying to attack our dataset have
access to these values through public or commercially available data?

I'll try to write more about attack modeling and data classification
later. This is not as easily explainable as the various mathematical definitions
of privacy: it has lots of human components and as such, is always a bit fuzzy.
Which makes it even more interesting! :D But I digress.

How to choose \(k\)?

Short answer: ¯\_(ツ)_/¯

Longer answer: nobody knows. In the healthcare world, when medical data is
shared with a small number of people (typically for research purposes), \(k\) is
often chosen between \(5\) and \(15\). This choice is very arbitrary and ad hoc. To
the best of my knowledge, there is no official law or regulation which suggests
a specific value. Some universities, companies or other organizations have
official guidelines, but the vast majority don't.

To pick a parameter for a privacy definition, one needs to understand what's the
link between the parameter value, and the risk of a privacy incident happening.
But this is difficult: if \(k\)-anonymity is relatively easy to understand,
estimating risk quantitatively is extremely tricky. I'm also going to
write a bit about this later on!

Regulators don't want to include specific parameter values in laws or
guidelines, since there is no convincing argument to be made for a given
choice, and the level of risk depends on many more fuzzy parameters (how
valuable the data is, how bad would a privacy incident be, etc.).

Data owners don't know how to choose the parameter either, so they usually buy
the services of a privacy consultant to do this choice (and take care of the
anonymization process). This consultant doesn't know either what's the "good"
choice, but they usually have more practical experience of what are common
values in the industry for similar levels of risk.

This is my first "real" blog post, about the most basic anonymity
definition there is, and I've already reached my second digression to say
"notice how it's actually super fuzzy and thus, complicated to apply in
practice?". Isn't privacy fun? :D

How to make a dataset \(k\)-anonymous?

So, suppose we picked our quasi-identifiers and \(k=2\). Even with such a low
value for \(k\), our original dataset will likely not be \(k\)-anonymous: there will
be many records with unique combinations of quasi-identifier values.

The two main building blocks used to transform a dataset into a \(k\)-anonymous
table are generalization and suppression.

Building block 1: generalization

Generalization is the process of making a quasi-identifier value less precise,
so that records with different values are transformed (or generalized) into
records that share the same values. Consider the records in this table:

ZIP code

age

4217

34

4217

39

1742

75

1691

77

The numerical values of these records can be transformed into numerical
ranges, so that the resulting table verifies \(2\)-anonymity:

ZIP code

age

4217

30-39

4217

30-39

1***

75-79

1***

75-79

The idea of generalization is to make demographic information more imprecise to
satisfy our privacy requirements, but still allow useful data analysis to be
done. In our example, changing precise ages into age ranges is probably enough
to analyze whether a disease affects young or old people disproportionately.

Transforming a numerical value into a range is one of the most typical ways of
performing generalization. Other ways include removing a value entirely (e.g.
transforming a gender value into "gender unknown"), or using a generalization
hierarchy (e.g. transforming an ICD-10 diagnosis code into a
truncated code, or the corresponding block).

Two types of generalization

Generalization strategies can be classified into two categories: global and
local. Consider the following table:

ZIP code

age

4217

34

4217

34

1742

34

1742

31

Global generalization means that a given value for a given column will always
be generalized in the same way: if you decide to transform age 34 into age range
30-34 for one record, all records that have ages between 30 and 34 will be
transformed into this fixed range of 30-34. Using global generalization, the
example could be transformed into:

ZIP code

age

4217

30-34

4217

30-34

1742

30-34

1742

30-34

Local generalization doesn't have that constraint: it allows you to pick a
different generalization for each record. A value 34 in the age column might
stay untouched for one record, and generalized for other:

ZIP code

age

4217

34

4217

34

1742

30-34

1742

30-34

Global generalization usually makes it easier to do data analysis on generalized
data; while local generalization allows to keep more utility at the cost of a
slightly more complex data representation.

Building block 2: suppression

In our previous example, our records had relatively "close" demographic values,
which allowed generalization to keep reasonably accurate information while still
ensuring \(2\)-anonymity. What if the table is instead:

ZIP code

age

4217

34

4217

39

1742

75

1691

77

9755

13

The first four records can be grouped in two pairs as above, but the last record
is an outlier. Grouping it with one of the pairs above would mean having very
large ranges of values (age between 10 and 39, or ZIP code being completely
removed), which would significantly reduce the utility of the resulting data. So
a simple solution to deal with such outlier values is simply to remove them from
the data. Using both generalization and suppression on this example could lead
to the same \(2\)-anonymous table as before:

ZIP code

age

4217

30-39

4217

30-39

1000-1999

75-79

1000-1999

75-79

Using this method, there are usually strictly less records in the transformed
table than in the original. On large datasets, allowing a small percentage of
suppressed records typically allows the result to be \(k\)-anonymous without
requiring too much generalization.

Algorithms

\(k\)-anonymity is the oldest privacy definition, it's relatively simple to
understand, so it has been quickly adopted by the healthcare community for their
data anonymization needs. As a result, there has been a lot of research on how
to transform a dataset into a \(k\)-anonymous table.

The problem of finding an optimal strategy for \(k\)-anonymity is NP-hard, for
basically any reasonable definition of optimality. This paper
(pdf) presents a few such results, if you're interested in this
kind of thing ^^

A list of approximation algorithms for the optimal \(k\)-anonymization problem can
be found in this paper (pdf) (Table 4, page 11). 18
different algorithms are listed, and I don't even think the list is exhaustive!
The paper contains many links to the original papers, and to some comparisons
between methods. Sadly, there is no unified benchmark to know how all these
algorithms perform on various data analysis tasks.

In practice

Unless you're a PhD student working on your literature review, you're probably
not looking for a bunch of links to research papers about complicated
\(k\)-anonymization algorithms. If you're a data owner trying to transform a
dataset to get a \(k\)-anonymous table, you may be looking for software instead.

As of 2017, the main open-source tool for data anonymization is ARX. Its
interface is a bit difficult to understand at first, but it works fairly well on
small to moderately large datasets, and implements a lot more than just
\(k\)-anonymity algorithms. It used to feature only global generalization
techniques1, but this apparently changed recently.

There are other tools available online, but none of them is anywhere as usable
as ARX. Many of them are listed in the Related software page of ARX's
website. I've tried most of them, only to get convinced that none of them really
reached the point of being a usable product. UTD Anonymization Toolbox is
probably the only one worth a look: it requires to use the command-line and
impractical configuration files to work, but it implements a local
generalization algorithm (the first of its kind, named Mondrian
(pdf), a very cool technique with better utility preservation
than global generalization algorithms).

On the commercial side, I've only heard of a toolkit developed by the consulting
company Privacy Analytics. The intended audience seems to be people who know
little about privacy: it looks very shiny, but I didn't manage to understand
which anonymity property or algorithms they were using ^^ You can get a free
trial by filling up a form on their website, but I can only assume the real
version is very expensive, since there is no mention of price anywhere.

How convincing really is \(k\)-anonymity?

\(k\)-anonymity is simple to understand, and it seems intuitively obvious that
reidentification attacks are well mitigated when a dataset is transformed to
become \(k\)-anonymous. However, it only mitigates this particular kind of attack.
We assumed that all that the attacker wanted was to select a target, point at a
record, and say "this record corresponds to my target" with a high certainty.
This matches Sweeney's original attack, but how realistic is this?

When an attacker successfully reidentifies someone in a dataset, it's not
necessarily a privacy issue. Consider the voter files from earlier. By law, this
data is public, and contains full names. It's very easy for an attacker to
point at a random record and shouting "hey, I reidentified this person!": the
identification is right there in the dataset. This "attack" always succeeds,
but it's not really interesting, nor particularly creepy… Why is that?

In Sweeney's example, the creepy thing isn't just finding the data subject
associated with a given record. The sensitive information linked with the
record (in our leading example, diagnostics and drug prescriptions) is where the
creepiness comes from! The leak of sensitive information associated to one
given individual is the problem, not the reidentification itself.

\(k\)-anonymity doesn't really capture this idea. The definition just prevents you
from knowing the real identity of an anonymized record. But maybe there are
other attacks that allow you to find out sensitive information about someone,
without finding with absolute certainty which record is theirs?

As I'll explain in future articles, other types of attacks do exist, and many
other definitions have been proposed in order to mitigate them too. Nonetheless,
\(k\)-anonymity is still used in the healthcare world, in large part because of
its simplicity and utility preservation compared to other definitions.

A previous version of this post claimed that only global generalization
was available in ARX. Sorry for the factual mistake! I should have read the
docs more closely =) ↩