Cookies that give you away: The surveillance implications of web tracking

[Today we have another announcement of an exciting new research paper. Undergraduate Dillon Reisman, for his senior thesis, applied our web measurement platform to study some timely questions. -Arvind Narayanan]

Over the past three months we’ve learnt that NSA uses third-party tracking cookies for surveillance (1, 2). These cookies, provided by a third-party advertising or analytics network (e.g. doubleclick.com, scorecardresearch.com), are ubiquitous on the web, and tag users’ browsers with unique pseudonymous IDs. In a new paper, we study just how big a privacy problem this is. We quantify what an observer can learn about a user’s web traffic by purely passively eavesdropping on the network, and arrive at surprising answers.
At first sight it doesn’t seem possible that eavesdropping alone can reveal much. First the eavesdropper on the Internet backbone sees millions of HTTP requests and responses. How can he associate the third-party HTTP request containing a user’s cookie with request to the first-party web page that the browser visited, which doesn’t contain the cookie? Second, how can visits to different first parties be linked to each other? And finally, even if all the web traffic for a single user can be linked together, how can the adversary go from a set pseudonymous cookies to the user’s real-world identity?

The diagram illustrates how the eavesdropper can use multiple third-party cookies to link traffic. When a user visits ‘www.exampleA.com,’ the response contains the embedded tracker X, with an ID cookie ‘xxx’. The visits to exampleA and to X are tied together by IP address, which typically doesn’t change within a single page visit [1]. Another page visited by the same user might embed tracker Y bearing the pseudonymous cookie ‘yyy’. If the two page visits were made from different IP addresses, an eavesdropper seeing these cookies can’t tell that the same browser made both visits. But if a third page, however, embeds both trackers X and Y, then the eavesdropper will know that IDs ‘xxx’ and ‘yyy’ belong to the same user. This method applied iteratively has the potential of tying together a lot of the traffic of a single user.

Once we had this idea, we wanted to test if it would actually work in practice. Everything depends on just how densely third-party trackers are actually embedded on sites. We conducted automated web crawls of 65 simulated users’ web browsing over three months, and found that unique cookies are so prevalent that the eavesdropper can reliably link 90% of a user’s web page visits to the same pseudonymous ID. (We omitted pages that embed no ID cookies at all, but those are a minority.)

We also found that the cookie linking method is extremely robust and succeeds under a variety of conditions (Section 4.1). We considered how variations in cookie expiration dates, the size of the user’s history (i.e., the number of pages visited), and the types of pages visited affect the eavesdropper’s changes, and found the impact to be minimal. Perhaps most significantly, however, we found that this surveillance method can still link about 50% of a user’s history to the same pseudonymous ID even with just 25% of the current density of trackers on the web. This means that even if 75% of sites or trackers adopt mitigation strategies (such as deploying HTTPS), the eavesdropper still learns a lot.

Finally, we studied how an eavesdropper might learn the real-world identity behind a cluster of web pages associated with a pseudonymous ID. It turns out that this is surprisingly easy — many sites display real-world attributes such as real name, username, or email on unencrypted pages to logged in users, which means that the eavesdropper gets to see these identifiers. We conducted a survey of such leakage on popular sites, and found that over half of popular sites with account creation leak some form of real-world identity (Section 4.2).

While it’s no surprise that web traffic contains sensitive information about individuals, what we’ve shown is just how complete a profile can be extracted even if the user’s traffic is mixed with millions of other users. Further, an eavesdropper can connect these profiles to real-world identities without needing the co-operation of any websites. While HTTPS deployment by trackers can help, the only practical solution at the current time seems to be for users to install anti-tracking and anonymity tools.

[1] An exception is if the user routes traffic through Tor. Different requests can take different paths and the exit node IPs will be different. Thus, use of Tor with application-layer anonymization (e.g., Tor browser bundle) defeats our attack.

Comments

Efficiently extracting user identity information for unencrypted sites (e.g. Youtube, many Yahoo properties, LinkedIn) is a ~200 LOC script in the Bro IDS (I wrote it to evaluate the difficulty), and each additional site is really just 4 regular expressions to create a new rule.

Extracting the cookie information onto the existing logging infrastructure is only a couple of lines (the log already includes IP, time, page viewed, and referrer), which is sufficient to do the identity chaining.

Having already assumed the conclusion of this study to be true summer time ago, I have been using Ghostery for some time along with Disconnect. However, I think this study may show such mitigation efforts are not sufficient.

Even while using Ghostery at some point I sometimes choose to log on to a site, perhaps to purchase something or consume firewalled data. In this scenario my browser could be communicating my username and password in the open depending on the target site’s method of implementation of cookies, as well as an IP address and perhaps a unique browser identifier like a MAC address.

Even with Ghostery installed, wouldn’t this identifying data for the sites I choose to log on to be enough for the eavesdropper to accomplish his goal of identifying me and linking me to.specific data consumption pattern?

Not sure I understand the attack you’re supposing. If you have ghostery turned on, your browser won’t be talking to anyone other than the first party site. There won’t be any third-party cookies to make the connections.

I’m surprised there was really no mention here of NoScript (( noscript.net )) which is fairly useful. I also recommend you take a look at the most recent posts at https://odinn.cyberguerrilla.org/ which may be helpful to you. Cheers.

NoScript breaks half the internet, doesn’t break all trackers, and is a bit like bringing a flamethrower to a fist fight. You’re effectively disabling an entire technology to kill a tiny part of the services using that technology. Plugins such as Ghostery, Disconnect and DoNotTrackMe are a much better choice for these kind of problems.

How likely is it that observers are listening in on backbone junction points (routers)?

From what I understand intelligence agencies typically request compliance from corporations (service providers) which should indicate that they are having a hard time getting the data themselves.

I actually like the idea of being tracked. It should give the collectors a false sense of who I am. To make any use of it it has to be interpreted and interpretation by for instance law enforcement or intelligence will focus on well known indicators such as would identify “terrorists” or whatever. Since these indicators are necessarily superficial their dragnet can never capture the “real” fish since real people don’t look like stereotypes. Consequently the people who get “caught” are of no real interest and the people who are of real interest don’t get “caught”. Hence, I feel rather safe.

It gives me a warm fuzzy feeling to be tracked by clueless security personnel and systems lol. And the commercial profiles that are being constructed should provide an even better guise :p :p

NoScript just takes a while to train, and you can block GA, and other javascript immediately by choosing to not trust it.

disconnect is pretty good, and helps with GoogleFacebook,LinkedIn and other widgets.
but DoNotTrack, actually ALLOWS tracking of companies that have signed its TOE.

AdBlock does the same.

You need SelfDestructing Cookies, LSO Cleaner, and CleanPlaces for firefox. One of the biggest dangers is the browser signiture, and the NoSQL databases in Firefox , that have a complete download and website visit history.

On my home network I have an own DNS-server, which was set up mainly as a primitive (but nevertheless effective) ad-blocker.
Al requests going to ad-serving networks are redirected to an in-house dummy webserver.
Also, all requests to ‘doubleclick.net’ (and its ad-serving subdomains) are redirected to the trashbin.
I never sensed any troubles surfing without Doubleclick.net-cookies.
Could this be a working security measure? Consistently dumping all requests to cookie-serving sites?
What does not go out of the house, cannot be traced, not?

I do something similar, using a host block list to build my list of hosts to redirect to nowhere. It’s a good privacy measure, used in combination with noscript and the like should be helpful.

Freedom to Tinker is hosted by Princeton's Center for Information Technology Policy, a research center that studies digital technologies in public life. Here you'll find comment and analysis from the digital frontier, written by the Center's faculty, students, and friends.