Abstract:
In this paper we study the privacy preservation properties of a
specific technique for query log anonymization: token-based hashing.
In this approach, each query is tokenized, and then a secure hash
function is applied to each token. We show that statistical
linguistic techniques may be applied to partially compromise the
anonymization. We then analyze the specific risks that arise from
these partial compromises, focused on revelation of identity from
unambiguous names, addresses, and so forth, and the revelation of
facts associated with an identity that are deemed to be highly
sensitive. Our goal in this work is twofold: to show that token-based
hashing is unsuitable for anonymization, and to present a concrete
risk analysis framework for evaluating other proposals.