Wednesday, October 23, 2013

Understanding cookies through user studies

A cookie is a small piece of data stored in the browser by websites. Although cookies are mostly invisible, they serve many purposes such as saving items in shopping carts, authenticating to websites, and displaying targeted ads or other personalized content. Understanding more about how websites use cookies allows us to write tools that manage cookies effectively.

In June 2013, the Mozilla User Research team ran a paid study of 573 Firefox users that included data on cookie and browsing events. The user population was census-balanced and included only US users. The study ran for a median of 18.8 days, during which time we observed 18.4 million attempts to set cookies by examining HTTP Set-Cookie headers. Each Set-Cookie header counts as a single event, even though it may contain multiple cookies. Storing multiple pieces of information across separate cookies or combining them into a single cookie are equally powerful. Set-Cookie headers are not the only method for setting cookies, but they are sufficiently prevalent to be representative. We did not observe read events due to volume constraints. We observed 2.84 million pages loaded, measured by counting tab-ready events.

N = 573

Tab-ready events

Set-Cookie events

Tab-ready events/day

Median

3552

12297

189

Total

2842270

12439439

Counting origins

Throughout this post we use top-level domains (from the Public Suffix list) plus one component to count origins. For example, we consider foo.example.com and bar.example.com to represent the same origin. The public suffix mechanism is not perfect, because a single organization may own many origins (e.g., doubleclick.net and google.com both belong to Google).
In total, study users visited 40682 unique origins (counted by tab-ready events) and received set-cookie events from 32786 unique origins. Below is the distribution of cookie events per tab event.

Who uses cookies?

Building effective cookie management tools requires understanding who sets cookies. Cookie activity is difficult to characterize because sites vary highly in both the number of cookies they set and the amount of third-party content (which may set cookies on behalf of the third-party site) that they include. Although each page load event incurs on average around 3.6 Set-Cookie events, many sites incur an order of magnitude more.

The graph below shows the 20 origins responsible for the most set-cookie events. These origins represent 0.05% of unique cookie-setting origins and are responsible for 42.7% of set-cookie events seen in the study data. Set-cookie attempts are either first-party, where the origin of the cookie being set is the same as the one in the location bar, or third-party, where the origins don't match.

Who uses third-party cookies?

Third-party cookies have many purposes. For example, social widget implementations usually rely on third-party cookies to display personalized content, and inline ads rely on third-party cookies to provide targeted ads and perform frequency capping. Of the 12.4 million set-cookie events in the study, 50.4% are for third-party cookies (shown in red in the graph above).

The graph below shows the top 20 origins setting third-party cookies, responsible for 41.1% of third-party set-cookie events. adnxs.com belongs to AppNexus, an ad exchange. Facebook sets mostly first-party cookies, but because Facebook's social widgets are included on many sites, Facebook sets many third-party cookies (which may have originally been created in a first-party context). Of the top 20 origins, 18 primarily offer advertising services.

It is interesting to compare this data to Table IV from Eubank et al.'s survey on third-party cookies. In the Eubank survey, the authors used simulated data from crawling Alexa's top 500 websites, included all types of third-party embedded data, and did not canonicalize domains using the public suffix list. Even though the methodology is different, many origins in the top 20 overlap.

How many third-party cookies are from origins the user knows?

One interesting question is whether or not users intentionally accept cookies, especially in the case of third-party cookies. We examine two possible heuristics for estimating whether a user interaction with an origin is intentional:

The user has already accepted cookies from the origin (pre-existing cookie condition)

The user has visited the origin by entering it into the location bar (simulated history condition)

Both of these conditions rely on previous interactions. Any potential changes to the way browsers handle third-party cookies must consider what to do with previous interactions (in this case, existing cookies and location bar history).

Pre-existing cookie condition

We did not ask study participants to clear cookies before beginning the study. Of the third-party set-cookie events, 90.8% of them were sent to users who had already accepted cookies from that origin. The graph below shows this percentage for the top 20 origins that set third-party cookies. In this graph, nearly all origins are above 75% with the exception of doubleclick.net. This dip can be explained by a handful of users who have a particular security addon installed.

Simulated history condition

Another heuristic for evaluating if a user has interacted with a site is whether that origin has appeared in the location bar, as measured by tab-ready events. This lets us count third-party origins that have previously appeared in a first-party context.

For each user, we take the entire set of origins extracted from tab-ready events to simulate that user’s history, then count whether the origins in the Set-Cookie events appear in the simulated history. The graph below shows this percentage for the top 20 origins of third-party cookies.

Overall, 19.6% of third-party cookie events came from origins in users’ simulated history in the course of the study. Not surprisingly, nearly all users had visited facebook.com and youtube.com, which are currently ranked 2nd and 3rd most visited sites according to Alexa. Interestingly, adnxs.com also appeared much of the time in simulated histories, even though the rank of adnxs.com is currently 576 in the US according to Alexa. From looking through tab-ready events, adnxs.com appeared in redirects and popups.

How long do cookies live?

The Set-Cookie HTTP header has an optional expiration time that tells the browser how long to keep the cookie. From the graph below, many cookies are long-lived, possibly longer-lived than the installation of the operating system or browser. 20% of third-party cookie expiration times were one week or less, and 51% of third-party cookie expiration times were longer than 6 months.

What's next?

Data from real users is crucial to understanding how websites use cookies and therefore what kind of technical solutions to cookie management make sense (or if indeed we should be concentrating on cookies at all). We hope that this is just the start of using data to shape our technologies and policies. Please join dev-privacy to continue the discussion.

Many thanks to Gregg Lind for deploying the study and to Jonathan Mayer, Alex Fowler, John Jensen, and Chris Karlof for reviewing this post.