[Editor’s note: The Center for Information Technology Policy (CITP) is delighted to welcome Arvind Narayanan as an Assistant Professor in Computer Science, and an affiliated faculty member in CITP. Narayanan is a leading researcher in digital privacy, data anonymization, and technology policy. His work has been widely published, and includes a paper with CITP co-authors Ed Felten and Joseph Calandrino. In addition to his core technical research, Professor Narayanan will be engaged in active public policy topics through projects such as DoNotTrack.us, and is sought as an expert in the increasingly complex domain of privacy and technology. He was recently profiled on Wired.com as the “World’s Most Wired Computer Scientist.”]

I’ve had a wonderful first month at Princeton as an assistant professor in Computer Science and CITP. Let me take a quick moment to introduce myself.

I’m a computer scientist by training; I study information privacy and security, and in the last few years have developed a strong side-interest in tech policy. I did my Ph.D. at UT Austin and more recently I was a post-doctoral researcher at Stanford and a Junior Affiliate Fellow at the Stanford Law School Center for Internet and Society.

Yesterday, I described a simple scenario where a plaintiff, who is having difficulty identifying an alleged online defamer, could benefit from subpoenaing data held by a third party web service provider. Some third parties—like Facebook in yesterday’s example—know exactly who I am and know whenever I visit or post on other sites. But even when no third party has the whole picture, it may still be possible to identify me indirectly, by combining data from different third parties. This is possible because loading one webpage can potentially trigger dozens of nearly simultaneous web connections to various third party service providers, whose records can then be subpoenaed and correlated.

Suppose that I post an anonymous and potentially defamatory comment on a Boing Boing article, but Boing Boing for some reason is unable to supply the plaintiff with any hints about who I am—not even my IP address. The plaintiff will only know that my comment was posted publicly at “9:42am on Fri. Feb 5.” But as I mentioned yesterday, Boing Boing—like almost every other site on the web—takes advantage of a handful of useful third party web services.

For example, one of these services—for an article that happens to feature video—is an embedded streaming media service that hosts the video that the article refers to. The plaintiff could issue a subpoena to the video service and ask for information about any user that loaded that particular embedded video via Boing Boing around “9:42am on Fri. Feb 5.” There might be one user match or a few user matches, depending on the site’s traffic at the time, but for simplicity, say there is only one match—me. Because the video service tracks each user with a unique persistent cookie, the service can and probably does keep a log of all videos that I have ever loaded from their service, whether or not I actually watched them. The subpoena could give the plaintiff a copy of this log.

In perusing my video logs, the plaintiff may see that I loaded a different video, earlier that week, embedded into an article on TechCrunch. He may notice further that TechCrunch uses Google Analytics. With two more subpoenas—one to TechCrunch and one to Google—and some simple matching up of dates and times from the different logs, the plaintiff can likely rebuild a list of all the other Analytics-enabled websites that I’ve visited, since these will likely be noted in the records tied to my Analytics cookie.

The bottom line: From the moment I first load that video on Boing Boing, the plaintiff gains the power to traverse multiple silos of data, held by independent third party entities, to trace my activities and link my anonymous comment to my web browsing history. Given how heavily I use the web, my browsing history will tell the plaintiff a lot about me, and it will probably be enough to uniquely identify who I am.

But this is just one example of many potential paths that a plaintiff could take to identify me. Recall from yesterday that when I visit Boing Boing, the site quietly forwards my information to the servers of at least 17 other parties. Each one of these 17 is a potential subpoena target in the first round of discovery. The information culled from this first round—most importantly, what other websites I’ve visited and at what times—could inform a second round of subpoenas, targeted to these other now-relevant websites and third parties. From there, as you might already be able to tell, the plaintiff can repeat this data linking process and expand the circle of potentially identifying information.

A recent privacy study from Berkeley shows how far such a strategy might reach. The Berkeley researchers found that nearly all of the top 100 sites on the web contain some sort of “web bug,” another term for the hidden web connection that allows a third party to automatically track a user on the site. Some of these sites will load dozens of web bugs on each page visit, which will litter user data far and wide on third party servers. Moreover, the study found that Google Analytics—by far the most popular website statistics service—was used by more than 70% of all sites they surveyed in March 2009. Once they add other Google-run services like Doubleclick and Adsense into the calculation, this figure rises to 88% of all sites that use some Google service—an astonishingly broad and dominant ability to follow users as they browse the web. But even other smaller, but still popular, third party entities have significant reach across thousands of sites across the web.

The traceability of any given site visitor will still depend on context: the number of third party services used by the site, the popularity of each third party service across the web, the types of identifying data that these parties collect and store, whether the speaker used any online anonymity tools, and many other site-specific factors.

Despite the variability in third party tracing capabilities, the nearly simultaneous connections to a few third party services means that the results of tracing can be combined. By sleuthing through information held in third party dossiers, logs and databases, plaintiffs in John Doe lawsuits will have many more discovery options than they had ever previously imagined.

As David mentioned in his previous post, plaintiffs’ lawyers in online defamation suits will typically issue a sequence of two “John Doe” subpoenas to try to unmask the identity of anonymous online speakers. The first subpoena goes to the website or content provider where the allegedly defamatory remarks were posted, and the second subpoena is sent to the speaker’s ISP. Both entities—the content provider and the ISP—are natural targets for civil discovery. Their logs together will often contain enough information to trace the remarks back to the speaker’s real identity. But when this isn’t enough to identify the speaker, the discovery process traditionally fails.

Are plaintiffs in these cases out of luck? Not if their lawyers know where else to look.

There are numerous third party web services that may hold just enough clues to reidentify the speaker, even without the help of the content provider or the ISP. The vast majority of websites today depend on third parties to deliver valuable services that would otherwise be too expensive or time-consuming to develop in-house. Services such as online advertising, content distribution and web analytics are almost always handled by specialized servers from third party businesses. As such, a third party can embed its service into a wide variety of sites across the web, allowing it to track users across all the sites where it maintains a presence.

Take for example the popular online blog Boing Boing. Upon loading its main page while recording the HTTP session, I noticed that my browser is automatically redirected to domains owned by no fewer than 17 distinct third party entities: 10 services that engage in advertising or marketing, five that embed media or integrate social networking functionality, and two that provide web analytics. By visiting this single webpage, my digital footprints have been scattered to and collected by at least 17 other online entities that I made no deliberate attempt to contact. And each of these entities will likely have stored a cookie on my web browser, allowing it to identify me uniquely later when I browse to one of its other partner sites. I don’t mean to pick on Boing Boing specifically—taking advantage of third party services is a nearly universal practice on the web today, but it’s exactly this pervasiveness that makes it so likely, if not probable, that all of my digital footprints together could link much of my online activities back to my actual identity.

To make this point concrete, let’s say I post a potentially defamatory remark about someone using a pseudonym in the comments section of a Boing Boing article. It happens that for each article, Boing Boing displays the number of times that the article has been shared on Facebook. In order to fetch the current number, Boing Boing redirects my browser to api.facebook.com to make a real-time query to the Facebook API. Since I happen to be logged in to Facebook at the time of the request, my browser forwards with the query my unique Facebook cookie, which includes information that explicitly identifies me—namely, my e-mail address that doubles as my Facebook username.

In order to integrate a bit of useful social networking functionality, Boing Boing enables Facebook, a third party in this situation, to learn which articles I visit on Boing Boing and the dates and times of my visits. The same is true for Tweetmeme, which can now positively link my Twitter account—which I’m also logged in to—with my Boing Boing visits. Even without an authenticated login, the 15 other third parties present on Boing Boing could track me using any number of different methods, including browser fingerprinting, to build detailed dossiers that slowly begin to piece together who I am.

From the perspective of a plaintiff’s lawyer, even if Boing Boing is unwilling or unable to produce any useful information, these third parties might be able to uniquely identify me as the likely defamer, or at least narrow the list of possible speakers down to a handful of users. But tracing speech is not always this easy. Tomorrow, I’ll discuss more complicated discovery strategies and the extent to which they are technically feasible.

Freedom to Tinker is hosted by Princeton's Center for Information Technology Policy, a research center that studies digital technologies in public life. Here you'll find comment and analysis from the digital frontier, written by the Center's faculty, students, and friends.