The ability to thoroughly anonymize data has been suspect for years. In 2008, Arvind Narayanan and colleague Vitaly Shmatikov from the University of Texas at Austin discovered how easy it was to unmask customer data from supposedly anonymized Netflix databases (PDF).

Since 2006, other forms of data anonymization—or removing Personally-Identifiable Information (PII) from databases—have been found lacking. Anonymized web-browsing histories, for some reason, have managed to remain unscathed.

No longer anonymous

Narayanan, now an assistant professor of computer science at Princeton, and Stanford researchers Sharad Goel, Ansh Shukla, and Jessica Su, decided to see if anonymized web-browsing histories were actually anonymous or not, as ensuring online anonymity is a crucial. "Online anonymity protects civil liberties," write the authors. "Users who have their anonymity compromised may suffer harms ranging from persecution by governments to targeted frauds that threaten public exposure of online activities."

"We show—theoretically, via simulation, and through experiments on real user data—that de-identified (anonymous) web-browsing histories can be linked to social media profiles using only publicly available data."

The research team determined that anonymous web-browsing histories can be de-anonymized by linking web-browsing activity to social media profiles. The researchers came to that conclusion by determining most users subscribe to a distinctive set of other users on services such as Twitter, Facebook, or Reddit. "Since users are more likely to click on links posted by accounts that they follow, these distinctive patterns persist in their browsing history," explain the paper's authors. "An adversary can thus de-anonymize a given browsing history by finding the social media profile whose 'feed' shares the history's idiosyncratic characteristics."

"Such an attack is feasible for any adversary with access to browsing histories," continue Narayanan, Goel, Shukla, and Su. "This includes third-party trackers and others with access to their data (either via intrusion or a lawful request)."

How the attack works

Using historical evidence on de-anonymization linkage attacks (PDF) and pertinent information related to transactional records, location traces, credit-card metadata, and writing style, the research team created the de-anonymizing attack platform architecture depicted in Figure A.

Posit a simple model of web-browsing behavior in which a user's likelihood of visiting a URL is governed by the URL's overall popularity and whether the URL appeared in the user's Twitter feed.

Compute the likelihood (under the model) of generating a given anonymous browsing history.

Identify the user most likely to have generated that history.

The research team's conclusions

Narayanan, Goel, Shukla, and Su state there are many ways in which browsing histories may be de-anonymized online, but most methods are target-specific, adding, "Our attack is significant for its broad applicability. The technique is available to all trackers, including those with whom the user has no first-party relationship."

The researchers then offer examples of where their attack model works:

De-anonymizing a movie rental record based on reviews posted on the web

Long-term intersection attack against an anonymity system based on the timing of a user's tweets or blog posts

"These can be seen as behavioral fingerprints of a user, and our analysis helps explain why such fingerprints tend to be unique and linkable," adds Narayanan.