Apart from the actual message, each tweet contains about 144 fields of metadata which is freely accessible via the Twitter API, according to the researchers: Each of these fields provides additional information about: the account from which it was posted; the post (e.g. time, number of views); other tweets contained within the message; various entities (e.g. hashtags, URLs, etc); and the information of any users directly mentioned in it.” No matter how anonymous a Twitter user thinks he is, a glance at the metadata is usually enough to trace his tweets back to him and reconstruct his everyday life.

Anonymization of data is virtually useless

For their study, the researchers trained three different machine learning systems using data from 5 million Twitter users. The systems analyzed 14 different fields of metadata of their tweets, including the time the account was created, the time a tweet was published, and the number of favorites, followers and following.

This data enabled the systems to identify 1 user in a group of 10,000 with an accuracy of 96.7 %. Even when 60 % of the dataset was obfuscated, the accuracy rate was still 95 %. This shows that subsequent anonymization is virtually useless once personally identifiable information has been collected.

Metadata reveal more about you than you might think

“People wrongly assume that because the data is online, they aren’t vulnerable to identification,” Beatrice Perez of University College London, a co-author of the research paper, told Wired. No right-thinking person would tell a total stranger what their address is if approached on the street. But they might tell them how often they turn their bedroom light on and off. “That’s the mentality with metadata,” says Perez. “People think it’s not a big deal. But couple it with another piece of information and I know when you’re home or not.” Most people are simply not aware that they can be easily identified by their metadata.

The researchers hope that their work will help to raise awareness of the privacy risks associated with metadata. Their methods can be applied not only to Twitter, but to a vast class of platforms and systems that generate metadata with similar characteristics. The problem becomes particularly relevant when such metadata is publicly accessible via APIs because then anyone can theoretically misuse them to identify people.

Unlike most other Internet companies, Cliqz does not store any data that could be used to identify users and create user profiles. This is ensured by our Privacy by Design architecture. We strongly believe that data avoidance is the best protection.