Ted is writing thingshttps://desfontain.es/privacy/2017-10-10T00:00:00+02:00k-map, the weird cousin of k-anonymity2017-10-10T00:00:00+02:002017-10-10T00:00:00+02:00Damien Desfontainestag:desfontain.es,2017-10-10:/privacy/k-map.html<p>Weakening <span class="math">\(k\)</span>-anonymity, really? This sounds weird, but this can actually quite reasonable. Let's learn why!</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script><p><strong>Suppose</strong> that you're a doctor who specializes in treating a rare genetic
disease. You want to run a study with all the patients that you can find, but
there are not many of them. You end up with only about 40 subjects.</p>
<p>After you've done experiments and collected clinical data, you want to share
this data with other researchers. You look at the attributes, and deduce that
ZIP code and age are likely to be used in reidentification attacks. To share it
in a safe way, you're thinking of <a href="k-anonymity.html"><span class="math">\(k\)</span>-anonymity</a>.</p>
<p>When trying to find a strategy to obtain <span class="math">\(k\)</span>-anonymity, you find out that you
would have to lose a lot of information. For <span class="math">\(k=10\)</span>, a rather small value, you
end up with buckets like <span class="math">\(20\le age\lt 50\)</span>. That makes sense: you have only few
people in your database, so you have to bundle together very different age
values.</p>
<p>But when you think about it, you start questioning whether you really need
<span class="math">\(k\)</span>-anonymity. Maybe the main sensitive information is whether a given
individual belongs to the database. If you already know that the person is in
there, finding the exact row that belongs to this person might not give you much
more information…</p>
<p>Let's look at two different rows in this database.</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">85535</td>
<td align="center">79</td>
</tr>
<tr>
<td align="center">60629</td>
<td align="center">42</td>
</tr>
</tbody>
</table>
<p>At first glance, the amount of information for this two individuals seems to be
the same. But let's take a look at the values…</p>
<ul>
<li><a href="https://www.unitedstateszipcodes.org/85535/">85535</a> corresponds to a place in Arizona named Eden. Approximately 20 people
live in this ZIP code. How many people do you think are exactly 79 years old
in this particular ZIP code? Probably only one.</li>
<li><a href="https://www.unitedstateszipcodes.org/60629/">60629</a> corresponds to a part of the Chicago metropolitan area. More than
100,000 people live there. How many of them are 42 years old? A thousand, at
least, and probably more!</li>
</ul>
<p>It seems that it would be very easy to reidentify the first row, but that we
don't have enough information to reidentify the second row. But according to
<span class="math">\(k\)</span>-anonymity, both rows might be completely unique in the dataset.</p>
<p>Obviously, <span class="math">\(k\)</span>-anonymity doesn't fit this use case. We need a different
definition: that's where <span class="math">\(k\)</span>-map comes in.</p>
<h1 id="definition">Definition</h1>
<p>Just like <a href="k-anonymity.html"><span class="math">\(k\)</span>-anonymity</a>, <span class="math">\(k\)</span>-map requires you to determine
which columns of your database are <em>quasi-identifiers</em>. This answers the
question: what can your attacker use to reidentify their target?</p>
<p>But this information alone is not enough to compute <span class="math">\(k\)</span>-map. In the example
above, we assumed that the attacker doesn't know whether their target is in the
dataset. So what are they comparing a given row with? With all other individuals
sharing the same values in a larger, sometimes implicit, dataset. For the
previous example, this could be "everybody living in the US", if you assume the
attacker has no idea who could have this genetic disease. Let's call this larger
table the <em>reidentification dataset</em>.</p>
<p>Once you picked the quasi-identifiers and the reidentification dataset, the
definition is straightforward. Your data satisfies <span class="math">\(k\)</span>-map if every combination
of values for the quasi-identifiers appears at least <span class="math">\(k\)</span> times <em>in the
reidentification dataset</em>.</p>
<p>In our example, this corresponds to counting the number of people in the US who
share the quasi-identifier values of each row in your dataset. Consider our tiny
dataset above:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">85535</td>
<td align="center">79</td>
</tr>
<tr>
<td align="center">60629</td>
<td align="center">42</td>
</tr>
</tbody>
</table>
<p>We said earlier than the values of the first row matched only one person in the
US. Thus, this dataset does not satisfy <span class="math">\(k\)</span>-map for any value of <span class="math">\(k\ge 2\)</span>.</p>
<p>How do we get a larger <span class="math">\(k\)</span>? We could generalize the first value like this:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">85***</td>
<td align="center">79</td>
</tr>
<tr>
<td align="center">60629</td>
<td align="center">42</td>
</tr>
</tbody>
</table>
<p>ZIP codes between 85000 and 85999 include the entire city of <a href="https://en.wikipedia.org/wiki/Phoenix,_Arizona">Phoenix</a>. There
are 36,000+ people between 75 and 84 years old in Phoenix, according to some
<a href="http://phoenix.areaconnect.com/statistics.htm">old stats</a>. It's probably safe to assume that there are more than 1,000 people
who match the quasi-identifiers values of the first row. We saw earlier that the
second row also matched 1,000+ people. So this generalized dataset satisfies
1000-map.</p>
<h1 id="attack-model-considerations">Attack model considerations</h1>
<p>Wait a second, why does this feel like cheating? What happened there, to give us
such a generous number so easily? This comes from the generous assumptions we
made in our attack model. We assumed that the attacker had <em>zero</em> information on
their target, except that they live in the US (which is implied by the presence
of ZIP codes). And with only the information (ZIP code, age), you don't need a
lot of generalization to make each row of your dataset blend in a large crowd.</p>
<p>To make this attack model stronger, you could assume that the attacker will use
a <em>smaller</em> reidentification database. For example, suppose that your genetic
disease you're studying requires regular hospital check-ups. The attacker could
restrict their search only to people who have visited a hospital in the last
year. The number of possible "suspects" for each value tuple gets smaller, so
the <span class="math">\(k\)</span> of <span class="math">\(k\)</span>-map decreases too<sup id="fnref-generic"><a class="footnote-ref" href="#fn-generic">1</a></sup>.</p>
<p><span class="math">\(k\)</span>-map is inherently a <em>weak</em> model. So when choosing the quasi-identifiers and
reidentification dataset, you have to think hard at what an attacker could do.
If your attacker doesn't have lots of resources, it can be reasonable to assume
that they won't get more data than, say, the voter files from your state. But if
they can figure out more about your users, and you don't really know which
reidentification dataset they could use, maybe <span class="math">\(k\)</span>-anonymity is a safer
bet<sup id="fnref-safer"><a class="footnote-ref" href="#fn-safer">2</a></sup>.</p>
<h1 id="and-now-some-practice">And now, some practice</h1>
<p>OK, enough theory. Let's learn how to compute <span class="math">\(k\)</span>-map in practice, and anonymize
your datasets to make them verify the definition!</p>
<p>… There's one slight problem, though.</p>
<p>It's usually impossible.</p>
<p>Choosing the reidentification dataset is already a difficult exercise. Maybe you
can afford to make generous assumptions, and assume the attacker doesn't know
much. At best, you think, they'll buy voter files, or a commercial database,
which contains everyone in your state, or in the US. But… then what?</p>
<p>To compute the maximum <span class="math">\(k\)</span> such as your dataset verifies <span class="math">\(k\)</span>-map, you would
first need to get the reidentification dataset yourself. But commercial
databases are expensive. Voter files might not be legal for you to obtain (even
though an evil attacker could break the law to get them).</p>
<p>So, most of the time, you can't actually check whether your data satisfies
<span class="math">\(k\)</span>-map. If it's impossible to check, it's also impossible to know exactly which
strategy to adopt to make your dataset verify the definition.</p>
<h4 id="exception-1-secret-sample">Exception 1: secret sample</h4>
<p>Suppose you're not releasing all your data, but only a <em>subset</em> (or <em>sample</em>) of
a bigger dataset that you own. Then, you can compute the <span class="math">\(k\)</span>-map value of the
sample with regard to the original, bigger dataset. In this case, choosing
<span class="math">\(k\)</span>-map over <span class="math">\(k\)</span>-anonymity is relatively safe.</p>
<p>Indeed, your original dataset is certainly <em>smaller</em> than the reidentification
dataset used by the attacker. Using the same argument as above, this means that
you will obtain a <em>lower bound</em> on the value of <span class="math">\(k\)</span>. Essentially, you're being
pessimistic, which means that you're on the safe side.</p>
<p>Even if the attacker has access to the original dataset, they won't know which
records are in the sample. So if the original dataset is secret, or if you've
chosen the sample in a secret way, <span class="math">\(k\)</span>-map is a reasonable definition to use,
and you can compute a pessimistic approximation of it.</p>
<h4 id="exception-2-representative-distribution">Exception 2: representative distribution</h4>
<p>This case is slightly different. Suppose that you can make the assumption that
your data is a <a href="http://arx.deidentifier.org/anonymization-tool/configuration/#a27"><em>representative</em></a> (or <em>unbiaised</em>) sample of a larger
dataset. This might be a good approximation if you selected people (uniformly)
at random to build your dataset, or if it was gathered by a polling
organization.</p>
<p>In this case, you can compute an estimate of the <span class="math">\(k\)</span>-map value for your data,
even without the reidentification dataset. The statistical properties which
enable this, and the methods you can use, are pretty complicated: I won't
explain them in detail here. They are mentioned and compared in <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2528029/">this
paper</a>, which has references to the original versions of each of them.</p>
<h4 id="exception-3-using-humans">Exception 3: using humans</h4>
<p>For the case of our doctor earlier, if the dataset is small enough, a motivated
data owner could actually do the job of an attacker "by hand". Go through each
record, and try to map it to a real person, or estimate the chances of it being
possible. We pretty much did that in this article!</p>
<p>This is very approximative, and obviously not scalable. But for our imaginary
doctor, it might be a reasonable solution!</p>
<h4 id="implementations">Implementations</h4>
<p><a href="http://arx.deidentifier.org/">ARX</a> implements the methods from exceptions 1 and 2. Documentation for the
first one can be found <a href="http://arx.deidentifier.org/anonymization-tool/configuration/#a27">here</a>. Instructions to estimate the number of
<em>unique</em> values assuming uniformity can be found <a href="http://arx.deidentifier.org/anonymization-tool/risk-analysis/#a56">here</a>. Originally,
<a href="http://neon.vb.cbs.nl/casc/..%5Ccasc%5Cmu.htm">μ-ARGUS</a> was the first software with this feature, but I couldn't run
it on my machine, so I can't say much about it.</p>
<h1 id="conclusion">Conclusion</h1>
<p>You might wonder why I wrote an entire article on a definition that is hardly
used because of how impractical it is. In addition to the unique problems that
we talked about in this article, the limitations of <span class="math">\(k\)</span>-anonymity also apply.
It's difficult to choose <span class="math">\(k\)</span>, non-trivial to pick the quasi-identifiers, and
even trickier to model the reidentification database.</p>
<p>The definition also didn't get a lot of attention from academics. Historically,
<span class="math">\(k\)</span>-anonymity came first<sup id="fnref-history"><a class="footnote-ref" href="#fn-history">4</a></sup>. Then, people showed that <span class="math">\(k\)</span>-anonymity was
sometimes not sufficient to protect sensitive data, and tried to find <em>stronger</em>
definitions to fix it. Weaker definitions were, of course, less interesting.</p>
<p>Nonetheless, I find that it's an interesting relaxation of <span class="math">\(k\)</span>-anonymity. It
shows one of its implicit assumptions: the attacker knows that their target
belongs to the dataset. This assumption is sometimes too pessimistic: it might
be worth considering alternate definitions.</p>
<p>Choosing a privacy model is all about modeling the attacker correctly. Learning
to question implicit assumptions can only help!</p>
<div class="footnote">
<hr>
<ol>
<li id="fn-generic">
<p>There is a generic version of this argument. Let's call your
database <span class="math">\(D\)</span>, and suppose <span class="math">\(R\)</span> and <span class="math">\(R^\prime\)</span> are two possible reidentification
databases. Suppose that <span class="math">\(R^\prime\)</span> is "larger" than <span class="math">\(R\)</span> (each element of <span class="math">\(R\)</span>
appears in <span class="math">\(R^\prime\)</span>). Then if <span class="math">\(D\)</span> satisfies <span class="math">\(k\)</span>-map with regard to <span class="math">\(R\)</span>, it
also satisfies <span class="math">\(k\)</span>-map with regard to <span class="math">\(R^\prime\)</span>. The reverse is not true.&#160;<a class="footnote-backref" href="#fnref-generic" title="Jump back to footnote 1 in the text">&#8617;</a></p>
</li>
<li id="fn-safer">
<p>One simple consequence of the previous footnote is that if a dataset
<span class="math">\(D\)</span> verifies <span class="math">\(k\)</span>-anonymity, then it automatically verifies <span class="math">\(k\)</span>-map for any
reidentification dataset<sup id="fnref-assumption"><a class="footnote-ref" href="#fn-assumption">3</a></sup>.&#160;<a class="footnote-backref" href="#fnref-safer" title="Jump back to footnote 2 in the text">&#8617;</a></p>
</li>
<li id="fn-assumption">
<p>I didn't say this explicitly, but the reidentification dataset is
always assumed to contain all rows from your dataset. It's usually not the
case in practice because data is messy, but it's a safe assumption. Hoping
that your attacker will just ignore some records in your data would be a bit
overly optimistic.&#160;<a class="footnote-backref" href="#fnref-assumption" title="Jump back to footnote 3 in the text">&#8617;</a></p>
</li>
<li id="fn-history">
<p>Latanya Sweeney first mentioned the idea behind <span class="math">\(k\)</span>-map in
<a href="https://desfontain.es/PDFs/PhD/AchievingKAnonymityPrivacyProtectionUsingGeneralizationAndSuppression.pdf">this 2002 paper<sup> (pdf)</sup></a>, several years after the
introduction of <span class="math">\(k\)</span>-anonymity.&#160;<a class="footnote-backref" href="#fnref-history" title="Jump back to footnote 4 in the text">&#8617;</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Book review: Twitter and Tear Gas2017-10-09T00:00:00+02:002017-10-09T00:00:00+02:00Damien Desfontainestag:desfontain.es,2017-10-09:/privacy/twitter-tear-gas.html<p>A short review of <em>Twitter and Tear Gas: The Power and Fragility of Networked Protest</em>, by Zeynep Tufekci. tl;dr: you should read it.</p><p><strong>I recently finished</strong> reading <em>Twitter and Tear Gas: The Power and Fragility
of Networked Protest</em>, by <a href="https://en.wikipedia.org/wiki/Zeynep_Tufekci">Zeynep Tufekci</a>. It's a long yet dense essay
on how modern protests work, and why they sometimes don't. Tufekci has a long
experience as an activist in many different protests around the world. She also
has a strong education in technology and in social sciences, and her work
focuses on the intersection between the two. In short, she is the perfect person
to write a book conceptualizing modern protests and their use of technology.
Unsurprisingly, the essay makes for a fascinating and enlightening read.</p>
<p>Here's an example. What does it mean when many people march in the streets? It
displays power: the power to send the word out, to convince people to join, to
organize logistics. But the <em>actual march</em> isn't scary to people in power: the
<em>implications</em> are. If an organization is able to gather many people for a
march, then this movement is capable of other things. Boycotts, strikes,
fundraisers for your political opponents, influence in the media…</p>
<p>All those things <em>actually</em> cause headaches to politicians, and make change more
likely. The protest itself merely serves as a signal. A few decades ago, it was
a <em>strong</em> signal: only very powerful movements could put a large number of
people on the streets. So if you could pull off a large protest, it meant that
your movement could do all those other annoying things. Social media and
technological tools change this. With them, it's much easier to plan an event,
get the word out, and have many people rally around a cause for an afternoon.
This should be good news for protesters… Except it also means that large
protests are no longer such a show of strength. "Easier" also means "less
impressive". And the people in power have understood this.</p>
<p>Consider movements such as the anti-war demonstrations of the Bush era, Occupy,
or the more recent Woman's March. Politicians were able to pretty much ignore
protesters: once everyone gets home, nothing happens. Only the most motivated of
political opponents might cause <em>actual</em> issues later on. Worse, their number is
not directly related to the size of the protest itself. So the protest can be
very impressive (especially when comparing it with historical protests), and
still not scare anyone in power.</p>
<p>I picked this particular insight to try and convince you to read the book… But
that's obviously only a tiny part of what is there. Tufekci provides simple
concepts to understand how tech interacts with social movements. It's rigorous,
detailed, and illustrated with plenty of historical examples. The author doesn't
assume you know these examples already (even for "famous" events, like civil
rights movements in the US). This is great for people like me with a limited
knowledge of history ^^</p>
<p><em>Twitter and Tear Gas</em> doesn't only evoke protests. At the intersection between
tech and social movements, there are also misinformation campaigns, online
harassment, social network policies and their consequences… Each of those is
discussed in the book, always with the same academic rigor, lively examples, and
clear writing.</p>
<p>The book is an excellent read from an intellectual perspective: it made many
ideas <em>clearer</em> and <em>simpler to understand</em> for me. This feeling is the best
indicator I know of good science! But you can also read this book as an
instruction manual. How to build "muscle" for a movement, how to orient it
towards the most efficients means of action, how to deal with misinformation and
censorship… Using the technological tools that were developed in the last few
decades.</p>
<p>Everyone working in tech could probably benefit from reading <em>Twitter and Tear
Gas</em>. If you're an activist, I'd say it's pretty much required reading. Go buy
it <a href="https://www.amazon.com/Twitter-Tear-Gas-Fragility-Networked/dp/0300215126/">there</a> or <a href="http://yalebooks.com/book/9780300215120/twitter-and-tear-gas">there</a>, or if you can't afford it, <a href="https://www.twitterandteargas.org/downloads/twitter-and-tear-gas-by-zeynep-tufekci.pdf">download it for
free</a>: it's licensed under <a href="http://technosociology.org/?p=1751">Creative Commons</a>! (This excellent model of
publishing alone is a good reason to buy the book.)</p>Biometrics: authentication or identification?2017-09-27T00:00:00+02:002017-09-27T00:00:00+02:00Damien Desfontainestag:desfontain.es,2017-09-27:/privacy/authentication-vs-identification.html<p>Know the difference. It can probably not save your life, but it can certainly avoid you saying nonsensical things on the Internet.</p><p><strong>Earlier this month</strong>, there was lots of chatter online about the new iPhone's
FaceID feature: it allows you to unlock your device just by looking at it.
Behind the scenes are some hardware and algorithms which create a 3D map of your
face, and determine whether you're the phone's rightful owner.</p>
<p>Many people seemed to not understand the difference between <em>authentication</em> and
<em>identification</em>. Both authentication and identification can use biometric data,
like facial recognition. Nonetheless, these use cases are fundamentally
different. I'll try to explain why — I hope this can enlighten the debate around
features like this a little bit.</p>
<h1 id="authentication">Authentication</h1>
<p>Authentication is what you do when you log in to some Internet service, or when
you unlock your phone. First, you <em>announce your identity</em> to the authentication
system (e.g. a log-in page or lock screen). Then, you try to <em>prove</em> to the
system that you're indeed who you pretend to be. For an Internet service,
identity can mean your e-mail address. For a phone, it's more implicit: you're
trying to prove you're the owner of the phone.</p>
<p>The attack model is the following: some evil person <em>pretends to be you</em>, and
tries to prove it to the authentication system to get access to your data. This
attacker can be of various types:</p>
<ul>
<li>an abusive partner who wants to look into your phone,</li>
<li>a scammer who wants to steal your identity,</li>
<li>a spy who wants to penetrate a company's network…</li>
</ul>
<p>Fundamentally, authentication protects against <em>unauthorized access to data</em>.</p>
<h1 id="identification">Identification</h1>
<p>Identification is trying to figure out <em>who someone is</em> based on some
characteristics they have, or data they produced. It's what the police does when
running a fingerprint against a database of suspects. It's what privacy
researchers do when they try to show that a data release has not been <a href="k-anonymity.html">properly
anonymized</a>.</p>
<p>The attack model here is that somebody tries to <em>find your identity</em>. To
succeed, an attacker needs to have a <em>list of suspects</em>, and enough information
to <em>distinguish</em> who you are among all possibilities.</p>
<h1 id="good-authentication-vector-vs-good-identification-vector">Good authentication vector vs. good identification vector</h1>
<p>From these different attack models, a first distinction emerges.</p>
<ul>
<li>If a piece of data is <em>secret</em>, it will work well as an authentication vector.
Passwords, codes embedded in security keys, or one-time SMS codes, are classic
examples.</li>
<li>If a piece of data is <em>public</em>, or at least known to the attacker, it can work
as an identification vector. Names, dates of birth or phone numbers are good
candidates.</li>
</ul>
<p>A second distinction is on the <em>amount</em> of information present in the data.</p>
<ul>
<li>To authenticate someone, you don't always need lots of info. For example, a
4-digit PIN code is enough to get a decent security on a phone, provided only
a few retries are allowed.</li>
<li>To identify someone, you need more than this. Even if you somehow get your
hands on a database which contains everyone's PIN code, each one would
correspond to many people. A PIN code alone wouldn't be enough: you need some
context or more data to reliably identify someone. <br><br></li>
</ul>
<h1 id="biometrics-for-authentication">Biometrics for authentication</h1>
<p>Biometrics seem to be blurring the line. Fingerprints are not exactly secret,
right? Your face is probably also all over social media. So how come they are
more and more used as authentication methods?</p>
<p>It turns out that the <em>secretness</em> of authentication data is not a required
property. All we need is <em>unforgeability</em>: an attacker must not be able to
impersonate you. If a secret is well-protected, it's difficult to falsify: the
attacker can't imitate what they don't know. But biometric info can be quite
unforgeable, even if it's not technically secret. It's easy to find what
someone's face or fingerprint looks like, but it's hard to create a fake version
of it.</p>
<p>Some folks have written excellent articles on the difficulty of bypassing
biometric authentication. So, instead of diving into the details, I'll simply
recommend <a href="https://www.troyhunt.com/face-id-touch-id-pins-no-id-and-pragmatic-security/">this excellent
post</a>
from Troy Hunt's blog.</p>
<!-- -->
<h1 id="biometrics-for-identification">Biometrics for identification</h1>
<p>Information being <em>public</em> doesn't mean that there exists a central database
containing everyone's data. This is especially true for biometric info. Most
attackers don't have access to global fingerprint or facial recognition
databases (yet)… But when they do, it definitely raises serious privacy
concerns. </p>
<p>Classic identification attacks focus on finding the person behind a pseudonym or
identifier. Identifiers can be phone numbers, e-mail addresses… Over time, you
can change pseudonyms and identifiers<sup id="fnref-changed"><a class="footnote-ref" href="#fn-changed">1</a></sup>. You can also maintain separate
identities, for example when you use a different email address for services you
don't trust.</p>
<p>Biometric identification doesn't have these nice properties. You can't change
your face or your fingerprints! And you can't use a different right thumb with
border agents of different countries, either.</p>
<p>Furthermore, you also have less <em>control</em> over your biometric information. You
can decide not to interact with a given online service if you don't trust it.
But if you're living a "normal" life in a Western city, your face will most
certainly be caught and recorded by many surveillance cameras.</p>
<p>Creating a facial recognition database is becoming simpler and cheaper. In
Russia, pro-Putin activists identified anti-government protestors using pictures
gathered from social media<sup id="fnref-findface"><a class="footnote-ref" href="#fn-findface">2</a></sup>. "Researchers" are creating algorithms to
detect sexual orientation<sup id="fnref-orientation"><a class="footnote-ref" href="#fn-orientation">3</a></sup> or gender identity<sup id="fnref-gender"><a class="footnote-ref" href="#fn-gender">4</a></sup>. They used
data from dating apps or video sharing services, and didn't ask anyone for
consent.</p>
<p>Using biometric data for identification is not inherently problematic. For
example, it helps catching violent criminals. Yet, the privacy concerns are most
definitely justified.</p>
<h1 id="are-those-really-distinct-problems">Are those really distinct problems?</h1>
<p>So, biometric identification is creepy, but biometric authentication isn't
always problematic. But wait. If people build biometric authentication systems…
How do they recognize someone's face or fingerprint if they don't store it
somewhere? Didn't the engineers behind FaceID had to build a biometric database?
Couldn't evil people use this for identification? </p>
<p>Not necessarily. For many of those tools, it is a specific design goal to <em>not</em>
make biometric identification easier. This is achieved through a series of risk
mitigation mechanisms<sup id="fnref-faceid"><a class="footnote-ref" href="#fn-faceid">5</a></sup>:</p>
<ul>
<li>The biometric data exists only on the user's phone, not in a central place.
The phone vendor doesn't need to unlock your phone! So it doesn't need this
information. The database doesn't exist in the first place.</li>
<li>The data lives in a specific piece of hardware called "Secure Enclave". This
chip encrypts and stores secrets <em>independently</em> of other parts of the phone.
Even if a hacker takes control of your phone, or a thief steals it, they can't
read the biometric data stored on it. Building a biometric database from
hacked iPhones is near-impossible.</li>
<li>Pictures taken during authentication are immediately discarded. Only the
pictures used for <em>enrollment</em> (when you set up FaceID) are stored. This way,
you know what is stored on your phone. This way, the chip doesn't store
pictures that you wouldn't want stored there.</li>
</ul>
<p>In addition, some fingerprint systems store only <em>partial</em> information on their
users. Remember how a 4-digit PIN was enough for certain authentication systems?
Similarly, partial biometric data can be enough to be a good authentication
vector. So even if data leaks, the exact info might not be enough to uniquely
identify someone.</p>
<p>Authentication is a <em>different problem</em> than identification. Thus, a system
designed for the former can also mitigate risk against the latter.</p>
<p>Does this mean we shouldn't worry about biometric authentication systems? Ha!
No.</p>
<h4 id="point-of-failure-1-the-tech">Point of failure 1: the tech</h4>
<p>I'm quite confident that Apple's new FaceID system is reasonably secure.
Zero-days vulnerabilities for iOS are worth
<a href="https://www.zerodium.com/program.html">millions</a>. That's a good sign that Apple
has a strong security team who know what they're doing.</p>
<p>But there should be a <em>lot</em> of healthy skepticism when anyone introduces a new
system like this. Data breaches happen all the time. If a biometric
authentication system is badly designed, the potential consequences are
catastrophic.</p>
<h4 id="point-of-failure-2-the-people">Point of failure 2: the people</h4>
<p>Did I convince you that authentication and identification are not the same
thing? Excellent. Will most people understand the distinction? I'm not exactly
optimistic =(</p>
<p>FaceID will probably make people more comfortable with facial recognition
itself. And if the technology gets normalized, this will lead to more
problematic uses being more easily accepted.</p>
<p>This week, I heard about future plans for the London public transportation
system. They are considering <a href="https://www.wired.co.uk/article/train-station-face-recognition-gateless-gate-technology">facial recognition</a> as a replacement for
magnetic cards containing tickets. Have your face recognized when you enter and
leave the subway, get charged later. This is an <em>identification</em> system. The
privacy implications are vastly different, and the consequences of security
incidents could be catastrophic.</p>
<p>Will people understand the difference?</p>
<div class="footnote">
<hr>
<ol>
<li id="fn-changed">
<p>Are you thinking "wait a second, I can't change my social security
number…"? Excellent point! This is one of the many reasons why SSNs make such
terrible identifiers.&#160;<a class="footnote-backref" href="#fnref-changed" title="Jump back to footnote 1 in the text">&#8617;</a></p>
</li>
<li id="fn-findface">
<p>Here's a scary
<a href="https://www.theguardian.com/technology/2016/may/17/findface-face-recognition-app-end-public-anonymity-vkontakte">article</a>
about this thing. Their success rate was pretty terrible, but this didn't stop
them. And the tech is getting better fast.&#160;<a class="footnote-backref" href="#fnref-findface" title="Jump back to footnote 2 in the text">&#8617;</a></p>
</li>
<li id="fn-orientation">
<p>Here's an
<a href="http://mashable.com/2017/09/11/artificial-intelligence-ai-lgbtq-gay-straight/">article</a>
that does a good job at explaining why this is terrible science (and ethics).&#160;<a class="footnote-backref" href="#fnref-orientation" title="Jump back to footnote 3 in the text">&#8617;</a></p>
</li>
<li id="fn-gender">
<p>Example <a href="https://www.theverge.com/2017/8/22/16180080/transgender-youtubers-ai-facial-recognition-dataset">press
coverage</a>.
I think I've seen good criticism of it at the time but I can't find it
anymore =(&#160;<a class="footnote-backref" href="#fnref-gender" title="Jump back to footnote 4 in the text">&#8617;</a></p>
</li>
<li id="fn-faceid">
<p>From Apple's <a href="https://images.apple.com/business/docs/FaceID_Security_Guide.pdf">FaceID Security
Guide</a>
(PDF).&#160;<a class="footnote-backref" href="#fnref-faceid" title="Jump back to footnote 5 in the text">&#8617;</a></p>
</li>
</ol>
</div>k-anonymity, the parent of all privacy definitions2017-08-14T00:00:00+02:002017-10-01T00:00:00+02:00Damien Desfontainestag:desfontain.es,2017-08-14:/privacy/k-anonymity.html<p>How a privacy researcher proved a politician wrong, and how she created the first ever definition of anonymity in the process.</p><p><strong>In 1997</strong>, a PhD student named <a href="https://en.wikipedia.org/wiki/Latanya_Sweeney">Latanya Sweeney</a> heard about an
interesting data release. A <a href="http://www.mass.gov/anf/employee-insurance-and-retirement-benefits/oversight-agencies/gic/">health insurance organization</a> from
Massachusetts had compiled a database of hospital visits by state employees, and
had thought that giving it to researchers could encourage innovation and
scientific discovery. Of course, there were privacy considerations: allowing
researchers to look at other citizens health records seemed pretty creepy. So
they decided to do the obvious thing, and remove all columns that indicated who
a patient was: name, phone number, full address, social security number, etc.</p>
<p>As you can probably guess, this didn't end so well. In this article, I'll
describe and analyze Sweeney's successful reidentification attack, and I'll
explain the privacy definition that Sweeney invented to prevent this type of
attack in the future: <span class="math">\(k\)</span>-anonymity.</p>
<div class="toc">
<ul>
<li><a href="#what-went-wrong">What went wrong?</a></li>
<li><a href="#how-to-prevent-this-attack">How to prevent this attack?</a><ul>
<li><a href="#definition-of-k-anonymity">Definition of \(k\)-anonymity</a><ul>
<li><a href="#what-types-of-data-are-reidentifying">What types of data are reidentifying?</a></li>
<li><a href="#how-to-choose-k">How to choose \(k\)?</a></li>
</ul>
</li>
<li><a href="#how-to-make-a-dataset-k-anonymous">How to make a dataset \(k\)-anonymous?</a><ul>
<li><a href="#building-block-1-generalization">Building block 1: generalization</a></li>
<li><a href="#two-types-of-generalization">Two types of generalization</a></li>
<li><a href="#building-block-2-suppression">Building block 2: suppression</a></li>
<li><a href="#algorithms">Algorithms</a></li>
<li><a href="#in-practice">In practice</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#how-convincing-really-is-k-anonymity">How convincing really is \(k\)-anonymity?</a></li>
</ul>
</div>
<p></p>
<h1 id="what-went-wrong">What went wrong?</h1>
<p>Some demographic information was left in the database, so researchers could
still compile useful stats: ZIP code, date of birth, and gender were all part
of the data. Sweeney realized that the claims of the Massachusetts governor,
who insisted that the privacy of state employees was respected (all identifiers
were removed!), were perhaps a little bit over-optimistic. Since the governor
himself was a state employee, Sweeney decided to do the obvious thing and
reidentify which records of the "anonymized" database were the governor's.</p>
<p>With just $20, Sweeney bought the public voter records from Massachusetts, which
had both full identifiers (names, addresses) and demographic data (ZIP code and
date of birth), and contained the governor's information. Guess how many records
matched the governor's gender, ZIP code, and date of birth inside the hospital
database? Only one, and thus, Sweeney was able to know which prescriptions and
visits in the data were the governor's. She posted all of it to his office,
showing theatrically that their anonymization process wasn't as solid as it
should have been.</p>
<p>Several factors made this attack possible. Some are obvious, but not all:</p>
<ol>
<li>
<p>The hospital data contained demographic information that could be used to
distinguish between different records.</p>
</li>
<li>
<p>A secondary database was available to figure out the demographic information
about the target.</p>
</li>
<li>
<p>The target was in both datasets.</p>
</li>
<li>
<p>And the demographic information of the target (ZIP code, date of birth, and
gender) was unique within both datasets: only one record had the demographic
values of the governor.</p>
</li>
</ol>
<p>At first glance, these factors appear to be <em>necessary</em>: remove one of them and
suddenly, the attack no longer works. (Try it! It's a good mental exercise.)</p>
<h1 id="how-to-prevent-this-attack">How to prevent this attack?</h1>
<p>As per our previous remark, removing one of the factors should be enough to
prevent attacks like these. Which ones can we afford to remove, while making
sure that the data can be used for data analysis tasks?</p>
<ol>
<li>
<p>We could remove all demographic information from the data, or even all
information that might be linked to a person using auxiliary sources.
Unfortunately, this would also severely hinder the utility of the data:
correlations based on age, gender, and geographic info are very useful to
researchers!</p>
</li>
<li>
<p>Society probably <em>should</em> do something about the existence of public (or
commercially available) data sources that can be used in reidentification
attacks. However, this is a complex political issue, so a little bit out of
scope for a data owner who wants to publish or share an anonymized version of
their data — in practice, there's pretty much nothing we can do about it.</p>
</li>
<li>
<p>Again, there's not much we can do. We have no way to modify the secondary
(public) dataset. We could decrease the probability that a random target is
in our dataset by sub-sampling it, but all people in the sample would still
be at risk, so this is obviously not a satisfying solution.</p>
</li>
<li>
<p>Now, this is the interesting point. Maybe suppressing all demographic values
would render the data useless, but there might be a middle ground to make
sure that the demographic values are no longer unique in the dataset.</p>
</li>
</ol>
<p>This last suggestion is the basic idea of <span class="math">\(k\)</span>-anonymity. </p>
<h2 id="definition-of-k-anonymity">Definition of <span class="math">\(k\)</span>-anonymity</h2>
<p>A dataset is said to <em>be <span class="math">\(k\)</span>-anonymous</em> if every combination of values for
demographic columns in the dataset appears at least for <em>k</em> different records.</p>
<p>For example, this dataset is <span class="math">\(2\)</span>-anonymous:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">77</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">77</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
</tbody>
</table>
<p>This one isn't:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">77</td>
</tr>
<tr>
<td align="center">1743</td>
<td align="center">77</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
</tbody>
</table>
<p>Notice that we need every <em>combination</em> of values to appear at least <span class="math">\(k\)</span> times.
Thus, even if each individual value of each column appears <span class="math">\(2\)</span> times in the
following dataset, it's not <span class="math">\(2\)</span>-anonymous:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">77</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">77</td>
</tr>
</tbody>
</table>
<p>The intuition is that when a dataset is <span class="math">\(k\)</span>-anonymous for a sufficiently large
<span class="math">\(k\)</span>, the last requirement for a successful reidentification attack is broken. An
attacker might find out the demographic information of their target using a
secondary database, but then this demographic information will be linked to <span class="math">\(k\)</span>
different individuals, so it will be impossible to know which one is their info.</p>
<h4 id="what-types-of-data-are-reidentifying">What types of data are reidentifying?</h4>
<p>Note that we've only talked about "demographic information", which is pretty
vague. ZIP codes, age, gender are all good candidates for reidentification
attacks, because they're public (or easily findable) information that is also
often found in sensitive datasets (especially medical ones). In general, the
data owner should consider which columns might be used by the attacker they're
concerned about.</p>
<p>These columns, not necessarily sensitive themselves but which might be used in a
reidentification attack, are called <em>quasi-identifiers</em> (or <em>QIs</em>). There is no
universal list of quasi-identifiers, it depends on the attack model. If some
data types are almost always QIs (ZIP code, age, gender…), many more depend on
the context (like timestamps, medical conditions, physical characteristics…).
The question to ask is: would the person who's trying to attack our dataset have
access to these values through public or commercially available data?</p>
<p><small> I'll try to write more about attack modeling and data classification
later. This is not as easily explainable as the various mathematical definitions
of privacy: it has lots of human components and as such, is always a bit fuzzy.
Which makes it even more interesting! :D But I digress. </small></p>
<h4 id="how-to-choose-k">How to choose <span class="math">\(k\)</span>?</h4>
<p>Short answer: ¯\_(ツ)_/¯</p>
<p>Longer answer: nobody knows. In the healthcare world, when medical data is
shared with a small number of people (typically for research purposes), <span class="math">\(k\)</span> is
often chosen between <span class="math">\(5\)</span> and <span class="math">\(15\)</span>. This choice is very arbitrary and ad hoc. To
the best of my knowledge, there is no official law or regulation which suggests
a specific value. Some universities, companies or other organizations have
official guidelines, but the vast majority don't.</p>
<p>To pick a parameter for a privacy definition, one needs to understand what's the
link between the parameter value, and the risk of a privacy incident happening.
But this is difficult: if <span class="math">\(k\)</span>-anonymity is relatively easy to understand,
estimating risk quantitatively is extremely tricky. <small> I'm also going to
write a bit about this later on! </small></p>
<ul>
<li>Regulators don't want to include specific parameter values in laws or
guidelines, since there is no convincing argument to be made for a given
choice, and the level of risk depends on many more fuzzy parameters (how
valuable the data is, how bad would a privacy incident be, etc.).</li>
<li>Data owners don't know how to choose the parameter either, so they usually buy
the services of a privacy consultant to do this choice (and take care of the
anonymization process). This consultant doesn't know either what's the "good"
choice, but they usually have more practical experience of what are common
values in the industry for similar levels of risk.</li>
</ul>
<p><small> This is my first "real" blog post, about the most basic anonymity
definition there is, and I've already reached my second digression to say
"notice how it's actually super fuzzy and thus, complicated to apply in
practice?". Isn't privacy fun? :D </small></p>
<h2 id="how-to-make-a-dataset-k-anonymous">How to make a dataset <span class="math">\(k\)</span>-anonymous?</h2>
<p>So, suppose we picked our quasi-identifiers and <span class="math">\(k=2\)</span>. Even with such a low
value for <span class="math">\(k\)</span>, our original dataset will likely not be <span class="math">\(k\)</span>-anonymous: there will
be many records with unique combinations of quasi-identifier values.</p>
<p>The two main building blocks used to transform a dataset into a <span class="math">\(k\)</span>-anonymous
table are <em>generalization</em> and <em>suppression</em>.</p>
<h4 id="building-block-1-generalization">Building block 1: generalization</h4>
<p>Generalization is the process of making a quasi-identifier value less precise,
so that records with different values are transformed (or <em>generalized</em>) into
records that share the same values. Consider the records in this table:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">39</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">75</td>
</tr>
<tr>
<td align="center">1691</td>
<td align="center">77</td>
</tr>
</tbody>
</table>
<p>The numerical values of these records can be transformed into <em>numerical
ranges</em>, so that the resulting table verifies <span class="math">\(2\)</span>-anonymity:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">30-39</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">30-39</td>
</tr>
<tr>
<td align="center">1***</td>
<td align="center">75-79</td>
</tr>
<tr>
<td align="center">1***</td>
<td align="center">75-79</td>
</tr>
</tbody>
</table>
<p>The idea of generalization is to make demographic information more imprecise to
satisfy our privacy requirements, but still allow useful data analysis to be
done. In our example, changing precise ages into age ranges is probably enough
to analyze whether a disease affects young or old people disproportionately.</p>
<p>Transforming a numerical value into a range is one of the most typical ways of
performing generalization. Other ways include removing a value entirely (e.g.
transforming a gender value into "gender unknown"), or using a <em>generalization
hierarchy</em> (e.g. transforming an <a href="https://en.wikipedia.org/wiki/ICD-10">ICD-10 diagnosis code</a> into a
truncated code, or the corresponding <a href="https://en.wikipedia.org/wiki/ICD-10_Chapter_I:_Certain_infectious_and_parasitic_diseases">block</a>).</p>
<h4 id="two-types-of-generalization">Two types of generalization</h4>
<p>Generalization strategies can be classified into two categories: <em>global</em> and
<em>local</em>. Consider the following table:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">31</td>
</tr>
</tbody>
</table>
<p>Global generalization means that a given value for a given column will <em>always</em>
be generalized in the same way: if you decide to transform age 34 into age range
30-34 for one record, all records that have ages between 30 and 34 will be
transformed into this fixed range of 30-34. Using global generalization, the
example could be transformed into:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">30-34</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">30-34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">30-34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">30-34</td>
</tr>
</tbody>
</table>
<p>Local generalization doesn't have that constraint: it allows you to pick a
different generalization for each record. A value 34 in the age column might
stay untouched for one record, and generalized for other: </p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">30-34</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">30-34</td>
</tr>
</tbody>
</table>
<p>Global generalization usually makes it easier to do data analysis on generalized
data; while local generalization allows to keep more utility at the cost of a
slightly more complex data representation.</p>
<h4 id="building-block-2-suppression">Building block 2: suppression</h4>
<p>In our previous example, our records had relatively "close" demographic values,
which allowed generalization to keep reasonably accurate information while still
ensuring <span class="math">\(2\)</span>-anonymity. What if the table is instead:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">39</td>
</tr>
<tr>
<td align="center">1742</td>
<td align="center">75</td>
</tr>
<tr>
<td align="center">1691</td>
<td align="center">77</td>
</tr>
<tr>
<td align="center">9755</td>
<td align="center">13</td>
</tr>
</tbody>
</table>
<p>The first four records can be grouped in two pairs as above, but the last record
is an outlier. Grouping it with one of the pairs above would mean having very
large ranges of values (age between 10 and 39, or ZIP code being completely
removed), which would significantly reduce the utility of the resulting data. So
a simple solution to deal with such outlier values is simply to remove them from
the data. Using both generalization and suppression on this example could lead
to the same <span class="math">\(2\)</span>-anonymous table as before:</p>
<table>
<thead>
<tr>
<th align="center">ZIP code</th>
<th align="center">age</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">4217</td>
<td align="center">30-39</td>
</tr>
<tr>
<td align="center">4217</td>
<td align="center">30-39</td>
</tr>
<tr>
<td align="center">1000-1999</td>
<td align="center">75-79</td>
</tr>
<tr>
<td align="center">1000-1999</td>
<td align="center">75-79</td>
</tr>
</tbody>
</table>
<p>Using this method, there are usually strictly less records in the transformed
table than in the original. On large datasets, allowing a small percentage of
suppressed records typically allows the result to be <span class="math">\(k\)</span>-anonymous without
requiring too much generalization.</p>
<h4 id="algorithms">Algorithms</h4>
<p><span class="math">\(k\)</span>-anonymity is the oldest privacy definition, it's relatively simple to
understand, so it has been quickly adopted by the healthcare community for their
data anonymization needs. As a result, there has been a <em>lot</em> of research on how
to transform a dataset into a <span class="math">\(k\)</span>-anonymous table.</p>
<p>The problem of finding an <em>optimal</em> strategy for <span class="math">\(k\)</span>-anonymity is <a href="https://en.wikipedia.org/wiki/NP-hardness">NP-hard</a>, for
basically any reasonable definition of optimality. <a href="https://desfontain.es/PDFs/PhD/OnTheComplexityOfOptimalKAnonymity.pdf">This paper<sup>
(pdf)</sup></a> presents a few such results, if you're interested in this
kind of thing ^^</p>
<p>A list of approximation algorithms for the optimal <span class="math">\(k\)</span>-anonymization problem can
be found in <a href="https://desfontain.es/PDFs/PhD/PublishingDataFromElectronicHealthRecordsWhilePreservingPrivacyASurveyOfAlgorithms.pdf">this paper<sup> (pdf)</sup></a> (Table 4, page 11). 18
different algorithms are listed, and I don't even think the list is exhaustive!
The paper contains many links to the original papers, and to some comparisons
between methods. Sadly, there is no unified benchmark to know how all these
algorithms perform on various data analysis tasks.</p>
<h4 id="in-practice">In practice</h4>
<p>Unless you're a PhD student working on your literature review, you're probably
not looking for a bunch of links to research papers about complicated
<span class="math">\(k\)</span>-anonymization algorithms. If you're a data owner trying to transform a
dataset to get a <span class="math">\(k\)</span>-anonymous table, you may be looking for software instead.</p>
<p>As of 2017, the main open-source tool for data anonymization is <a href="http://arx.deidentifier.org/">ARX</a>. Its
interface is a bit difficult to understand at first, but it works fairly well on
small to moderately large datasets, and implements a lot more than just
<span class="math">\(k\)</span>-anonymity algorithms. It used to feature only global generalization
techniques<sup id="fnref-edit"><a class="footnote-ref" href="#fn-edit">1</a></sup>, but this apparently <a href="http://arx.deidentifier.org/anonymization-tool/analysis/#a50">changed recently</a>.</p>
<p>There are other tools available online, but none of them is anywhere as usable
as ARX. Many of them are listed in the <a href="http://arx.deidentifier.org/overview/related-software/">Related software</a> page of ARX's
website. I've tried most of them, only to get convinced that none of them really
reached the point of being a usable product. <a href="http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php">UTD Anonymization Toolbox</a> is
probably the only one worth a look: it requires to use the command-line and
impractical configuration files to work, but it implements a local
generalization algorithm (the first of its kind, named <a href="https://desfontain.es/PDFs/PhD/MondrianMultidimensionalKANonymity.pdf">Mondrian<sup>
(pdf)</sup></a>, a very cool technique with better utility preservation
than global generalization algorithms).</p>
<p>On the commercial side, I've only heard of a toolkit developed by the consulting
company <a href="https://privacy-analytics.com/">Privacy Analytics</a>. The intended audience seems to be people who know
little about privacy: it looks very shiny, but I didn't manage to understand
which anonymity property or algorithms they were using ^^ You can get a free
trial by filling up a form on their website, but I can only assume the real
version is very expensive, since there is no mention of price anywhere.</p>
<h1 id="how-convincing-really-is-k-anonymity">How convincing really is <span class="math">\(k\)</span>-anonymity?</h1>
<p><span class="math">\(k\)</span>-anonymity is simple to understand, and it seems intuitively obvious that
reidentification attacks are well mitigated when a dataset is transformed to
become <span class="math">\(k\)</span>-anonymous. However, it only mitigates this particular kind of attack.
We assumed that all that the attacker wanted was to select a target, point at a
record, and say "this record corresponds to my target" with a high certainty.
This matches Sweeney's original attack, but how realistic is this?</p>
<p>When an attacker successfully reidentifies someone in a dataset, it's not
necessarily a privacy issue. Consider the voter files from earlier. By law, this
data is public, and contains full names. It's very easy for an attacker to
point at a random record and shouting "hey, I reidentified this person!": the
identification is <em>right there</em> in the dataset. This "attack" <em>always succeeds</em>,
but it's not really interesting, nor particularly creepy… Why is that?</p>
<p>In Sweeney's example, the creepy thing isn't just finding the data subject
associated with a given record. The <em>sensitive</em> information linked with the
record (in our leading example, diagnostics and drug prescriptions) is where the
creepiness comes from! The leak of <em>sensitive</em> information associated to one
given individual is the problem, not the reidentification itself.</p>
<p><span class="math">\(k\)</span>-anonymity doesn't really capture this idea. The definition just prevents you
from knowing the real identity of an anonymized record. But maybe there are
other attacks that allow you to find out sensitive information about someone,
without finding with absolute certainty which record is theirs?</p>
<p>As I'll explain in future articles, other types of attacks do exist, and many
other definitions have been proposed in order to mitigate them too. Nonetheless,
<span class="math">\(k\)</span>-anonymity is still used in the healthcare world, in large part because of
its simplicity and utility preservation compared to other definitions.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn-edit">
<p>A previous version of this post claimed that only global generalization
was available in ARX. Sorry for the factual mistake! I should have read the
docs more closely =)&#160;<a class="footnote-backref" href="#fnref-edit" title="Jump back to footnote 1 in the text">&#8617;</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Beginnings2017-07-17T00:00:00+02:002017-07-17T00:00:00+02:00Damien Desfontainestag:desfontain.es,2017-07-17:/privacy/beginnings.html<p>Blog intro. What's going to be there?</p><p><strong>Hi there!</strong> I'm <a href="../serious.html">Damien</a>. I have no idea how people usually
start blogs, so bear with me while I figure this out.</p>
<p>I see this place as a way to publish things that are too long for Twitter, too
opinionated for Wikipedia, and not pretentious enough for Medium. I'm going to
try and keep it to three themes: <em>privacy</em>, <em>research</em>, and <em>privacy research</em>.
I'm not (yet) a specialist in any of these. Hopefully, thanks for my jobs and
personal interest in those topics, I can add something valuable to what's
written online about them.</p>
<p>The following is the vision I have of these three themes. This should give an
idea of what I intend to talk about in this blog =)</p>
<h2 id="privacy">Privacy</h2>
<p>It's difficult to define what privacy encompasses. It's easier to realize when
you don't have enough privacy — through bad surprises, uneasy feelings of
creepiness, or real risks to your safety.</p>
<p>When a parent or a partner installs stealthy software on your phone to spy on
your texts and calls, that's an invasion of your privacy. When a company sells
your name, address and purchase history to some sketchy third-party that sends
you targeted ads, the uneasy feeling you get comes from a lack of privacy.
Full-body scanners in certain airports are an attack to one's bodily privacy.
Data leaks are a risk to users' privacy.</p>
<p>Privacy issues usually come from a lack of <em>transparency</em>, of <em>control</em>, or
both. In an ideal world, everybody would know exactly who has access to which
data about them and why. Personal data collection would not happen without
informed consent, and people would have a right to access, modify and delete
data that other people or organizations hold about them.</p>
<p>The fuzziness, and the complexity of the issues in this space, are part of what
I find interesting about them. I have done many privacy reviews for Google
products, and there is always something interesting and new with each of them.
Would users expect this behavior? Is this deletion action clear enough? Could
someone re-identify this aggregated data?</p>
<p>Like security, privacy is of particular importance for marginalized communities.
Having your phone number leaked online is much more problematic if you're a
high-profile political activist, or a closeted LGBTQ+ blogger. Harassment of
folks that belong to minorities is a major problem, and badly-designed sharing
interfaces or insufficient anti-abuse tools can lead to dramatic consequences.
Designing tools that deal with potentially sensitive data, and failing to
consider these specific risks, is highly irresponsible. And you can easily guess
what I think of compliance-based privacy programs…</p>
<p>I also try to avoid absolutist viewpoints. They are hardly ever constructive,
and they are often dangerous. I know people who refuse to use Signal because
it's not available without Google Play Services, while continuing to communicate
via cleartext SMS messages. For most practical problems, there is no perfect
solution. Focusing on defending against a hypothetical all-powerful targeted
attacker is usually pointless. Instead, I try to focus on realistic threat
models, usable tools, and risk mitigation.</p>
<h2 id="research">Research</h2>
<p>I started a part-time PhD after two years of software engineering at Google.</p>
<p>To solve an engineering problem, the path is quite straightforward. Grasp the
scope of the problem, design a solution, validate the design with coworkers and
stakeholders, write code, verify that the solution is "good enough", then
productionize it. Once the problem disappeared, there's no time to think about
it more: there are other problems to solve, other fires to put out.</p>
<p>The whole process is fun and rewarding, but I'm frustrated by the ending. What
if we could design a simpler or more efficient solution? Prove that it works in
a wider range of situations? Share the idea behind it with more people, and see
whether they get inspired and solve other problems? Doing all of this is not
immediately rewarding, but I think it can have a deeper, and longer-lasting
impact, than core engineering work.</p>
<p>I optimistically think that academia is the place to do that. Compare the
solution to what's out there already, make more experiments, write proofs,
figure out what additional impact it could have. Share the results with as many
people as possible. It might not be worth the time, but I think it's worthwhile
to give it a try. There are certainly interesting things to learn along the way.</p>
<p>The one thing that I'm afraid of is spending time solving the wrong problems.
Finding a "good problem" is not easy: a good problem must be difficult enough to
not have been already solved, but simple enough to have a chance at tackling it.
Identifying practical problems and their precise constraints is also hard, when
the main source of inspiration is other academics' work.</p>
<p>I'm frustrated about the lack of incentives to do research work as a software
engineer, but the incentives of academia are even more broken. Publication
metrics are a bad way to estimate one's impact, especially in the short term.
The peer review process is terribly implemented in practice. The whole system
makes it painfully slow to gather feedback, and the little feedback you get is
imprecise. The idea of having my work praised only to realize much later that it
didn't make a difference in practice… It's even scarier to me than the idea of
not finding joy and impact in my research, and deciding to quit.</p>
<p>But I'm not exactly pessimistic :D I feel lucky and enthusiastic about this
part-time project. Continuing to do engineering work for Google gives me an
endless input of complicated real-world problems to tackle, many of which seem
to be good candidates for research projects. I am surrounded by impressively
smart and passionate coworkers on both sides, whose feedback is invaluable. And
I don't feel extremely attached to the idea of having an academic career or even
getting the title at the end of my PhD, so I don't really feel the pressure to
publish everything and anything just to increment some counters.</p>
<p>All in all, this sounds like a fun and challenging adventure. I'm excited to see
what I'll learn along the way!</p>
<h2 id="privacy-research">Privacy research</h2>
<p>My research, like my engineering job at Google, will focus on privacy. This is a
field whose boundaries are not very well-defined, and that has very distinct
sub-fields. Some researchers focus on user research to understand the
perceptions of real people with regard to their personal data (there are a bunch
of them at Google). Very little math is involved. Some are designing algorithms
that have provable privacy-related properties, like private set intersection or
differentially private surveys. Lots of math there! ^^ Some study the problem of
<em>anonymizing</em> (or <em>de-identifying</em>) a dataset, so it can be used by more people
or shared with third parties. Some focus on onion routing, on online tracking,
on cryptocurrency, on privacy policies, on genetic privacy, on social networks,
and the list is far from exhaustive. So… what am I doing exactly?</p>
<p>My PhD project is about <em>making it easier for data owners to understand and
protect the personal information contained in their databases</em>. I see this goal
as having two main subcomponents.</p>
<ol>
<li>
<p><em>Risk analysis</em>. There are lots of organizations, companies or governments
which sit on large databases with personal information, and it's difficult
for them to realize how sensitive it is. Leaking your users' country of
origin is intuitively less of a problem than leaking their e-mail addresses,
which in turn is not as big a deal as leaking their credit card information.
Sadly, doing this type of inventory and risk analysis is currently pretty
difficult: it requires time, investment, and specific expertise. It
shouldn't have to be this way, so I'm working towards building tools that
make this easier.</p>
</li>
<li>
<p><em>Anonymization</em>. Once you realized how sensitive your data is, you hopefully
will want to take steps to protect it. There are many ways to lower the risk
of bad people having access to your database: encryption, access controls,
or many other security techniques. Another option is to modify the database,
in a way that makes sure that somebody with access to it can't deduce creepy
things about the individuals whose data is in the database. I'm working
towards making this process easier and more understandable for data owners.</p>
</li>
</ol>
<p>I could (and hopefully, I will!) talk at length about these two things. They
have already been studied by many people over the past ~15 years (especially
anonymization), but I think that there is a lot of room for more vulgarization
on the topic, and significant improvements to do on the research side. On the
anonymization topic in particular, I feel like it is urgent to work towards
bridging the gap between research advances and concrete use cases. </p>
<p>Maybe I'll realize along the way that I'm looking at the wrong problems, or that
it proves more difficult than I thought to improve the state of the art. But as
I've been told, that's part of what makes it challenging and fun ^^</p>