Pervasive mobile communications make it easy to track individuals, a practice that both fosters new knowledge and raises privacy concerns. The uniqueness of human mobility patterns is critical to the latter, as it facilitates user re-identification in naively anonymised datasets. We propose a solution that guarantees the indistinguishability of spatiotemporal trajectories - an important step towards the open access of privacy-preserving datasets.

Collecting data generated by widespread digital transactions is an increasingly common practice. The likes of telecommunication network operators, mobile service providers, app developers and financial companies have the possibility to track the movements, preferences, activities and habits of large populations of individuals. Mining of such high-dimensional big data paves the way to new, compelling models across economic and scientific domains that could not be foreseen until a few years ago, and are in some cases becoming part of our everyday life. The other side of the coin is the emergence of novel privacy issues related to the collection, storage and exploitation of such sensible information.

A prominent case study are datasets of spatiotemporal trajectories collected, for example, via mobile network records available to telecommunication operators or geo-referenced time-stamped check-ins recorded by mobile applications. They have become an important instrument in large-scale analyses across a number of disciplines, including physics, sociology, demography, epidemiology, transportation and computer sciences: a recent survey is available in [1]. These datasets are commonly anonymised by replacing identifiers (e.g., name, phone number, account number, etc.) with random strings or non-reversible hashes.

However, this simple solution does not provide protection against attacks on individual privacy. Specifically, datasets of spatiotemporal trajectories suffer from elevate uniqueness: the distinctive patterns of each user allow him or her to be pinpointed among millions of other individuals with minimal knowledge, e.g., where he was at any five time instants during one year [2]. Uniqueness does not imply re-identification on its own; yet, it can pave the way to cross-database linkage.

Mitigating the uniqueness of spatiotemporal trajectories is then a very desirable facility towards robust (and open) datasets. However, attempts at ensuring indistinguishability of spatiotemporal trajectories through legacy techniques have failed. The typical approach is generalization: precision in space and time is reduced for all data up to the point where no individual trajectory is uniquely distinguishable in the dataset. Yet, the high dimensionality of the data (i.e., the large number of spatiotemporal samples recorded for each user) makes generalization ineffective: uniqueness is not removed even under very coarse spatial (i.e., tens of km) and temporal (i.e., days) granularities that disrupt data utility [2].

Figure 1: Spatial accuracy in a dataset 2-anonymised with GLOVE.

Figure 2: Temporal accuracy in a dataset 2-anonymised with GLOVE.

We perform an extensive analysis of the root causes behind the high uniqueness and poor anonimisability of datasets of spatiotemporal trajectories. By studying real-world datasets, we observe that typical human movement patterns are easily anonymised for most of their span (e.g., consider the mass of commuters sharing the same route on trains running between two cities in the morning and afternoon, every day). Each individual might also feature a small but not negligible number of ‘peculiar’ movements (e.g., one commuter goes to play a five-a-side football game in a pitch near his workplace on Tuesdays, delaying his trip back home). These latter movements result in spatiotemporal samples that are extremely hard to hide, and doom all other samples in the dataset to undergo a very high loss of accuracy if they are to be anonymised.

Building on these findings, we have developed GLOVE, an algorithm that runs classical generalisation on a per-sample basis, i.e., it enforces the minimum reduction of granularity required to hide each sample separately, so that each complete trajectory is indistinguishable from other k-1 trajectories in the same dataset. GLOVE thus implements the ‘k-anonymity privacy criterion’. Our approach proves very effective: when run on large-scale datasets describing weeks of mobility of tens of thousands of users, GLOVE completely removes data uniqueness. More importantly, it does so while retaining a substantial level of accuracy. Figures 1 and 2 show the accuracy in space and time (x axes) of the GLOVE-generalised samples in a 2-anonymised dataset, expressed as the average, median and first-third quartiles of the overall distribution. Different points on the curves map to thresholds (tags along curves) beyond which the generalisation cost is considered excessive, and samples are discarded: clearly, lower thresholds yield a better precision but discard more samples (y axis).

Overall, one can expect good precision (in the order of hundreds of metres in space and of tens of minutes in time) in the anonymised data; fine-tuning is then possible by using the suppression thresholds above.

For a detailed description of our analysis, the GLOVE algorithm, and the performance evaluation, we refer the reader to [3].