Suppose that you're a doctor who specializes in treating a rare genetic
disease. You want to run a study with all the patients that you can find, but
there are not many of them. You end up with only about 40 subjects.

After you've done experiments and collected clinical data, you want to share
this data with other researchers. You look at the attributes, and deduce that
ZIP code and age are likely to be used in reidentification attacks. To share it
in a safe way, you're thinking of \(k\)-anonymity.

When trying to find a strategy to obtain \(k\)-anonymity, you find out that you
would have to lose a lot of information. For \(k=10\), a rather small value, you
end up with buckets like \(20\le age\lt 50\). That makes sense: you have only few
people in your database, so you have to bundle together very different age
values.

But when you think about it, you start questioning whether you really need
\(k\)-anonymity. Maybe the main sensitive information is whether a given
individual belongs to the database. If you already know that the person is in
there, finding the exact row that belongs to this person might not give you much
more information…

Let's look at two different rows in this database.

ZIP code

age

85535

79

60629

42

At first glance, the amount of information for this two individuals seems to be
the same. But let's take a look at the values…

85535 corresponds to a place in Arizona named Eden. Approximately 20 people
live in this ZIP code. How many people do you think are exactly 79 years old
in this particular ZIP code? Probably only one.

60629 corresponds to a part of the Chicago metropolitan area. More than
100,000 people live there. How many of them are 42 years old? A thousand, at
least, and probably more!

It seems that it would be very easy to reidentify the first row, but that we
don't have enough information to reidentify the second row. But according to
\(k\)-anonymity, both rows might be completely unique in the dataset.

Obviously, \(k\)-anonymity doesn't fit this use case. We need a different
definition: that's where \(k\)-map comes in.

Definition

Just like \(k\)-anonymity, \(k\)-map requires you to determine
which columns of your database are quasi-identifiers. This answers the
question: what can your attacker use to reidentify their target?

But this information alone is not enough to compute \(k\)-map. In the example
above, we assumed that the attacker doesn't know whether their target is in the
dataset. So what are they comparing a given row with? With all other individuals
sharing the same values in a larger, sometimes implicit, dataset. For the
previous example, this could be "everybody living in the US", if you assume the
attacker has no idea who could have this genetic disease. Let's call this larger
table the reidentification dataset.

Once you picked the quasi-identifiers and the reidentification dataset, the
definition is straightforward. Your data satisfies \(k\)-map if every combination
of values for the quasi-identifiers appears at least \(k\) times in the
reidentification dataset.

In our example, this corresponds to counting the number of people in the US who
share the quasi-identifier values of each row in your dataset. Consider our tiny
dataset above:

ZIP code

age

85535

79

60629

42

We said earlier than the values of the first row matched only one person in the
US. Thus, this dataset does not satisfy \(k\)-map for any value of \(k\ge 2\).

How do we get a larger \(k\)? We could generalize the first value like this:

ZIP code

age

85***

79

60629

42

ZIP codes between 85000 and 85999 include the entire city of Phoenix. There
are 36,000+ people between 75 and 84 years old in Phoenix, according to some
old stats. It's probably safe to assume that there are more than 1,000 people
who match the quasi-identifiers values of the first row. We saw earlier that the
second row also matched 1,000+ people. So this generalized dataset satisfies
1000-map.

Attack model considerations

Wait a second, why does this feel like cheating? What happened there, to give us
such a generous number so easily? This comes from the generous assumptions we
made in our attack model. We assumed that the attacker had zero information on
their target, except that they live in the US (which is implied by the presence
of ZIP codes). And with only the information (ZIP code, age), you don't need a
lot of generalization to make each row of your dataset blend in a large crowd.

To make this attack model stronger, you could assume that the attacker will use
a smaller reidentification database. For example, suppose that your genetic
disease you're studying requires regular hospital check-ups. The attacker could
restrict their search only to people who have visited a hospital in the last
year. The number of possible "suspects" for each value tuple gets smaller, so
the \(k\) of \(k\)-map decreases too1.

\(k\)-map is inherently a weak model. So when choosing the quasi-identifiers and
reidentification dataset, you have to think hard at what an attacker could do.
If your attacker doesn't have lots of resources, it can be reasonable to assume
that they won't get more data than, say, the voter files from your state. But if
they can figure out more about your users, and you don't really know which
reidentification dataset they could use, maybe \(k\)-anonymity is a safer
bet2.

And now, some practice

OK, enough theory. Let's learn how to compute \(k\)-map in practice, and anonymize
your datasets to make them verify the definition!

… There's one slight problem, though.

It's usually impossible.

Choosing the reidentification dataset is already a difficult exercise. Maybe you
can afford to make generous assumptions, and assume the attacker doesn't know
much. At best, you think, they'll buy voter files, or a commercial database,
which contains everyone in your state, or in the US. But… then what?

To compute the maximum \(k\) such as your dataset verifies \(k\)-map, you would
first need to get the reidentification dataset yourself. But commercial
databases are expensive. Voter files might not be legal for you to obtain (even
though an evil attacker could break the law to get them).

So, most of the time, you can't actually check whether your data satisfies
\(k\)-map. If it's impossible to check, it's also impossible to know exactly which
strategy to adopt to make your dataset verify the definition.

Exception 1: secret sample

Suppose you're not releasing all your data, but only a subset (or sample) of
a bigger dataset that you own. Then, you can compute the \(k\)-map value of the
sample with regard to the original, bigger dataset. In this case, choosing
\(k\)-map over \(k\)-anonymity is relatively safe.

Indeed, your original dataset is certainly smaller than the reidentification
dataset used by the attacker. Using the same argument as above, this means that
you will obtain a lower bound on the value of \(k\). Essentially, you're being
pessimistic, which means that you're on the safe side.

Even if the attacker has access to the original dataset, they won't know which
records are in the sample. So if the original dataset is secret, or if you've
chosen the sample in a secret way, \(k\)-map is a reasonable definition to use,
and you can compute a pessimistic approximation of it.

Exception 2: representative distribution

This case is slightly different. Suppose that you can make the assumption that
your data is a representative (or unbiaised) sample of a larger
dataset. This might be a good approximation if you selected people (uniformly)
at random to build your dataset, or if it was gathered by a polling
organization.

In this case, you can compute an estimate of the \(k\)-map value for your data,
even without the reidentification dataset. The statistical properties which
enable this, and the methods you can use, are pretty complicated: I won't
explain them in detail here. They are mentioned and compared in this
paper, which has references to the original versions of each of them.

Exception 3: using humans

For the case of our doctor earlier, if the dataset is small enough, a motivated
data owner could actually do the job of an attacker "by hand". Go through each
record, and try to map it to a real person, or estimate the chances of it being
possible. We pretty much did that in this article!

This is very approximative, and obviously not scalable. But for our imaginary
doctor, it might be a reasonable solution!

Implementations

ARX implements the methods from exceptions 1 and 2. Documentation for the
first one can be found here. Instructions to estimate the number of
unique values assuming uniformity can be found here. Originally,
μ-ARGUS was the first software with this feature, but I couldn't run
it on my machine, so I can't say much about it.

Conclusion

You might wonder why I wrote an entire article on a definition that is hardly
used because of how impractical it is. In addition to the unique problems that
we talked about in this article, the limitations of \(k\)-anonymity also apply.
It's difficult to choose \(k\), non-trivial to pick the quasi-identifiers, and
even trickier to model the reidentification database.

The definition also didn't get a lot of attention from academics. Historically,
\(k\)-anonymity came first4. Then, people showed that \(k\)-anonymity was
sometimes not sufficient to protect sensitive data, and tried to find stronger
definitions to fix it. Weaker definitions were, of course, less interesting.

Nonetheless, I find that it's an interesting relaxation of \(k\)-anonymity. It
shows one of its implicit assumptions: the attacker knows that their target
belongs to the dataset. This assumption is sometimes too pessimistic: it might
be worth considering alternate definitions.

Choosing a privacy model is all about modeling the attacker correctly. Learning
to question implicit assumptions can only help!

There is a generic version of this argument. Let's call your
database \(D\), and suppose \(R\) and \(R^\prime\) are two possible reidentification
databases. Suppose that \(R^\prime\) is "larger" than \(R\) (each element of \(R\)
appears in \(R^\prime\)). Then if \(D\) satisfies \(k\)-map with regard to \(R\), it
also satisfies \(k\)-map with regard to \(R^\prime\). The reverse is not true. ↩

One simple consequence of the previous footnote is that if a dataset
\(D\) verifies \(k\)-anonymity, then it automatically verifies \(k\)-map for any
reidentification dataset3. ↩

I didn't say this explicitly, but the reidentification dataset is
always assumed to contain all rows from your dataset. It's usually not the
case in practice because data is messy, but it's a safe assumption. Hoping
that your attacker will just ignore some records in your data would be a bit
overly optimistic. ↩

Latanya Sweeney first mentioned the idea behind \(k\)-map in
this 2002 paper (pdf), several years after the
introduction of \(k\)-anonymity. ↩