Tracking the Trackers: Where Everybody Knows Your Username

Click the local Home Depot ad and your email address gets handed to a dozen companies monitoring you. Your web browsing, past, present, and future, is now associated with your identity. Swap photos with friends on Photobucket and clue a couple dozen more into your username. Keep tabs on your favorite teams with Bleacher Report and you pass your full name to a dozen again. This isn't a 1984-esque scaremongering hypothetical. This is what's happening today.

[Update 10/11: Since several readers have asked – this study was funded exclusively by Stanford University and research grants to the Stanford Security Lab. It was not supported by any advocacy organization.]

In the language of computer science, clickstreams – browsing histories that companies collect – are not anonymous at all; rather, they are pseudonymous. The latter term is not only more technically appropriate, it is much more reflective of the fact that at any point after the data has been collected, the tracking company might try to attach an identity to the pseudonym (unique ID) that your data is labeled with. Thus, identification of a user affects not only future tracking, but also retroactively affects the data that's already been collected. Identification needs to happen only once, ever, per user.

Arvind noted five ways in which a user's identity may be associated with third-party web tracking data.

A third party is also a first party, e.g. Facebook, Twitter, or Google+.

A first party hands off ("leaks") identifying information to a third party.

A third party buys identifying information from a "matching service."

A third party exploits a security vulnerability to learn a user's identity.

A third party "deanonymizes" its data by matching it against identified data.

This post is an empirical study of identifying information leakage from first-party websites to third-party websites.1

Web Information Leakage

Leakage most often occurs when a first-party website stuffs information into a URL. For example, suppose Example Website sends users after they register to:

Third parties embedded in the page will receive the URL in a referrer header or equivalent2 – and therefore Leland Stanford's username, name, and email.

Another common form of leakage is through the page title. Suppose a website's landing page includes a title tag of:

Welcome, Leland Stanford!

Embedded third-party scripts often report back with the page title; in this case, they'd include Leland Stanford's name.

[Update 10/11: The original version of this post conflated the information OkCupid provides to Lotame and BlueKai. In the interest of complete accuracy, and in response to both a deluge of questions on OkCupid's intentional leakage and a note from BlueKai seeking clarification, I have updated this section with per-company intentional leakage. I have also included the results of a leakage test (with the methodology described below) on OkCupid. My apologies to BlueKai for the incorrect implication that it collects the same sensitive profile data that Lotame does. The amibiguous discussion was solely my error.]Leakage, in common parlance, implies unintentionality. In computer security, leakage is a term of art for an information flow – some instances of leakage are entirely intentional. For example, OkCupid, a free online dating website, appears to sell user profile information to the data providers BlueKai and Lotame. , including gender, age, ZIP code, relationship status, and drug use frequency. To learn which profile information OkCupid leaks, I modified each field of a profile and observed how values sent to the two companies changed. Here's what the companies appeared to receive:

(I also ran the leakage test described below on OkCupid. The username was sent to 27 third-party PS+1s (defined below), including crwdcntrl.net (Lotame) and bluekai.com (BlueKai). Since OkCupid does not limit who can see a profile – a user can only require that visitors be logged in – a username provides access to a user's entire profile.)

In a series of groundbreaking studies Balachander Krishnamurthy, Craig Wills, and Konstantin Naryshkin have demonstrated that information leakage is a pervasive problem (1, 2, 3). In their most recent paper, the authors examined signup and interaction with 120 popular sites for information leakage to third parties. They found that 56% leaked some form of private information, and 48% leaked a user identifier.

We roughly followed the same methodology as Krishnamurthy, Wills, and Naryshkin, with 1) a focus on identifying information leakage, 2) a greater number of sites, 3) and a public dataset.

Usernames as Identifying Information

Given the sizeable role usernames play in web information leakage, it's worth taking a moment to note how a username is identifying information. In some cases a username is just a user's name – for example, @jonathanmayer on Twitter. Even when it isn't the user's name, a username is often more than adequate for identifying a user.

Second, combining data from multiple accounts often provides a sufficiently comprehensive mosaic to identify an individual.4 Arvind, for example, usually goes by the username "randomwalker." The first page of a Google search turned up his yCombinator Hacker News account, which includes his job and links to his personal website, blog, and Twitter account.

Some websites (e.g. Quantcast) have responsibly recognized that a username is identifying information and have included username in their legal definition of "personally identifiable information" (PII).

For each of the 185 websites that met all three criteria, we used the FourthParty web measurement platform to create an account and interact with the site.5 We emphasized exploring content that dealt with a user's identity, such as profile and settings pages. After collecting data, we searched Request-URIs and Referrer headers for known personal information. We treated each public suffix + 1 (PS+1) as an independent entity, and we considered any PS+1 different from a first party's to be a third party.6

Results

A complete spreadsheet of results is available in Excel format. We encourage interested readers to examine the results for themselves. [Update 10/22: Before consulting the spreadsheet, please be sure to read Footnote 6 to understand the limitations of our methodology.] Please email if you would like FourthParty logs for a specific site.

The most frequent type of leakage was a username or user ID.7 We identified username or user ID leakage to a third party on 113 websites, 61% of the websites in our sample. The top five PS+1 recipients of username and user ID leakage were:

scorecardresearch.com (comScore), on 81 (44%) of the websites in our sample

google-analytics.com (Google Analytics), on 78 (42%) of the websites in our sample

quantserve.com (Quantcast), on 63 (34%) of the websites in our sample

doubleclick.net (Google Advertising), on 62 (34%) of the websites in our sample

facebook.com (Facebook), on 45 (24%) of the websites in our sample

Some websites leaked the username or user ID to dozens of third parties. For example, popular photo sharing website Photobucket embeds username in many of its URLs, and includes advertising on most of its pages; we observed the username get sent to 31 third-party PS+1s.

Other identifying information leaked in a number of instances. A sample:

Viewing a local ad on the Home Depot website sent the user's first name and email address to 13 companies.

Entering the wrong password on the Wall Street Journal website sent the user's email address to 7 companies.[Update 10/11: A number of readers have written in noting that the Wall Street Journal leak is not in our spreadsheet. We identified the Wall Street Journal leak in a different browsing session from the one reported in the spreadsheet – and by accident. In the interest of consistency – we did not test logging out and logging back in on other sites, nor logging in with the wrong password – we decided to discuss the leak in our post but not our spreadsheet.]

Changing user settings on the video sharing site Metacafe sent first name, last name, birthday, email address, physical address, and phone numbers to 2 companies.

Signing up on the NBC website sent the user's email address to 7 companies.

The mandatory mailing list page during CNBC signup sent the user's email address to 2 companies.

Clicking the validation link in the Reuters signup email sent the user's email address to 5 companies.

Interacting with Bleacher Report sent the user's first and last names to 15 companies.

Interacting with classmates.com sent the user's first and last names to 22 companies.

Implications

From a legal perspective, identifying information leakage is a debacle. Many first-party websites make what would appear to be incorrect, or at minimum misleading, representations about not sharing PII. Here are some examples.

Personal Information Disclosure: The Home Depot will not trade, rent or sell your personal information, without your prior consent, except as otherwise set out herein. [Does not describe sharing with third-parties for advertising or analytics.]

We will not sell, rent, or share your Personal Information with these third parties for such parties' own marketing purposes, unless you choose in advance to have your Personal Information shared for this purpose. Information about your activities on our Online Services and other non-personally identifiable information about you may be used to limit the online ads you encounter to those we believe are consistent with your interests. Third-party advertising networks and advertisers may also use cookies and similar technologies to collect and track non-personally identifiable information such as demographic information, aggregated information, and Internet activity to assist them in delivering advertising on our Online Services that is more relevant to your interests.

We do not tie the information gathered by Quantcast Tags to the personally identifiable information of visitors to a Web site.. . .We do not link Log Data to any other Personally Identifiable Information about you or otherwise attempt to discover your identity.

We don't collect or serve ads based on personally identifying information without your permission.

The better practice for all first-party and third-party websites would be to acknowledge that identifying information leakage is a fact of life on the web, and that identifying information may be shared with third parties.

As for policy, some strands of the Do Not Track debate echo a sentiment of "it's all anonymous," and so, "where's the harm?" We believe there is now overwhelming evidence that third-party web tracking is not anonymous. It is a legitimate policy question whether, on balance, Do Not Track should be enforced by law. But the difficult weighing of competing privacy risks and economics can't be short-circuited by claims of anonymity.

Thanks to Arvind Narayanan for comments on a draft.

[1] For purposes of this post, "identifying information" is information that with moderate probability and moderate effort can be used to identify a user. This post does not use a formulaic legal definition of "personally identifiable information" (PII), an approach that has been discredited by a growing body of computer science research. The Federal Trade Commission staff notably rejected the notion of PII in its draft privacy report last year.

[2] Some third parties encode the referring URL into their Request-URI.

[3] A username isn't, of course, all a third party has to go on. IP geolocation is another trivial source of information, and can help disambiguate when several individuals use similar usernames. How many Jonathan Mayers are there in Palo Alto, CA? Using the Stanford University network? This is a possible area for future research.

[4] While it is quite clear that in practice a username can often be used to discern a user's identity, confirmatory empirical research would be valuable.

[5] We used a fictional persona with unique biographical traits to minimize false positives.

[6] For readers who engage in detail with our data, we wish to emphasize several caveats to our methodology.

We did not study – and cannot study – what companies do when they receive personal information. It is likely that many of the information leaks we identified were logged. Some third parties may take precautions to prevent logging of identifying information, and we certainly laud such efforts. But for policy purposes, there is a tremendous difference between a tracking ecosystem that is anonymous and a tracking ecosystem that is suffused with identity but promises to ignore it.

Since some websites host content from multiple PS+1s (e.g. amazon.com and amazonaws.com), our definition of a third party introduces some false positives. That said, our findings appear to be quite robust. For example, thresholding for leakage at more than three third parties still leaves 84 websites (45%) leaking a username or user ID.

We did not examine POST request bodies or cookies, nor did we attempt to identify obfuscated or encrypted personal information.

Our interaction with websites was neither comprehensive nor representative of what the average user might do. We may have missed information leaks, and some of the information leaks we identified may have affected only a minority of users.

In the course of a user's browsing, identifying information for other users might leak. We did not gauge how easily a third party could identify which information was the user's. In most cases it appeared such a determination would be straightforward.

The regular expressions we used for matching birth year, birthday, gender, and last name had a not insignificant number of false positives. We recommend against relying solely upon those fields.

We did not explicitly take note of which stage of signup a leak occurred at.

We did not use a single sign-on (SSO) provider unless required. Where an SSO was mandatory, we manually labeled PS+1s associated with the SSO provider as first-party. Measuring information leakage when SSOs are used is a promising avenue for future research.

We did not attempt to discover third parties that have been CNAMEd into a first-party PS+1 (dubbed "hidden third-parties" in some papers).

[7] User IDs were, in our testing, almost always sufficient to locate at least a username, and sometimes additional identifying information. For example, with a Causes.com user ID, anyone can attain a link to a user's Facebook profile – which in turn provides a name, photo, and possibly more.

[8] Please note: we are not claiming any company has breached its self-regulatory commitments. The Digital Advertising Alliance (DAA) online advertising self-regulation imposes lax restrictions on personally identifiable information. First, personally identifiable information is defined to only include information that is used to identify a user.

Personally Identifiable Information is information about a specific individual including name, address, telephone number, and email address—when used to identify a particular individual.

Second, the DAA principles only require noting the use of PII in a privacy policy and getting consent to retroactively use PII before the privacy policy change.

PII is a term used primarily in two areas in the Principles and Commentary. First, PII is used in the Transparency principle so that consumers are informed specifi- cally about the collection and use of PII for Online Behavioral Advertising purposes. Second, PII is used in this Commentary to describe a specific example of a "material” change that would require Consent from the consumer under Principle V.

Comments

You can disable your cookies and remove connections like twitter and facebook. Some sites will not allow you access if you disable your cookies.

Coming from a Marketers POV, we strive to make sure that data used for ad tracking, campaign management, and other purposes are used internally only.

Although what Jonathan has reveled here is surprising, it should be remember that the technology is still pretty new and hopefully there will be some regulations that hinder websites from infringing on its site visitors right to privacy. The UK is already passing laws that prohibit sites from collecting information without stating its purpose for collecting it.

The only way to protect yourself 100% is get off the internet. No wait, that isn't full protection either :(

What you can do is create false information to confuse the databases. Different user names and logins, different birthdays etc. Have multiple throw away email addresses. Block cookies ...and if you have to accept them to use a service, mess with the data in the cookies :) and then delete the cookies when you're done. Use proxies to hide your IP when browsing.

And for goodness sake don't use sites like Facebook where you have to disclose REAL personal information. That's just dumb.

Remember that offline data goes online too. So don't use loyalty cards (or better still use other people's cards!). Don't deal with companies that ask for too much personal information.

But I'm not an expert in this. If the author of the article above wants to put together a more detailed guide I'd be delighted to read it.

I was combing your spreadsheet for more on that juicy nugget you dropped in your blog post about okcupid. I cannot find okcupid (or lotame) in your spreadsheet. am I just missing something? where did that particular data point come from?

Home Depot said it was still “researching carefully to determine if anything unusual occurred” but it believed it had not contravened its stated privacy policy, which was designed “to improve our product and service offerings and to enhance and personalise our customers’ shopping experiences.”

Even though you might only be sending the referring URL to these companies, there is nothing stopping them from indexing the page at a later time and learning what I'm interested in.

It's how the web works. Nothing is "free". If you're not paying for something then you're the something being sold.

However, it is important for sites to make sure they don't expose usernames and email addresses, as this is a security flaw. All reputable data collection companies don't collect or store this data. I wish this discussion would focus on this type of data leak instead.

Research Mayer writes this report with a forgone conclusion. This conclusion is evident by his use of terms like "lax restrictions" and ""identifying information" is information that with moderate probability and moderate effort can be used to identify a user"

I find it shocking that a school like Stanford would allow such research and conclusions without fully defining terms and methodologies. Failures include the reason for the exclusion of sites because "did not include so many features as to be impractical for study" (what does htis mean.

Also lacking is whether or not upon sign up at the sites, Mr. Mayer agreed to the terms of use on each site that states the infomraiton would be shared.

Did Mr. Mayer go directly to the site or did he click on ads from different locations? If so did he take into account cookie-stuffing at the ad level?

Also as in all of Mr Mayer's published "research" what was the impact if the user deleted cookies, browser history or cache?

Seems Mr Mayer is trying to make himself relevant on faulty research that Stanford would be wise to fully review before allowing their name to be associated with.

I love the excel spreadsheet that was shared. And, I am glad websites are sharing my personal information with each other. You know why? Because now I can stuff them with junk data. All you have to do is every time you create an account with a first-party like cbsnews.com, enter bogus information about birthdate, address etc but keep your name unique. Let them share it with some advertiser. Now, go to another site like nba.com and do the same - all bogus info except name. Again, they share it with the advertiser. Now the advertiser has two profiles for the same unique name. You can see where this is going. The beauty of automated data collection and sharing is you can turn it against them easily.

Remember, unless the site/service needs to verify your personal info against some government issues id, always enter false information but never the same in two places. And, make sure you always login to these first-party sites and enable all their bling - cookies/whatever. The more false information they collect and share, the more messed up your profile will get in their databases.

Really interesting read Jonathan. I will have to start visiting more websites on llama farming and watermelon seed spitting to really help all of these companies watching us build a hilariously inaccurate dossier on me.

Add new comment

Your name

E-mail

The content of this field is kept private and will not be shown publicly.

Comment *

Notify me when new comments are posted

Once you hit Save, your comment will be held for moderation before being published. You will not see a confirmation message once you hit the Save button but please be assured your comment has been submitted and we will review it.