You’re Being Tracked (and Tracked and Tracked) on the Web

The Wayback Machine reveals two decades of Web tracking and third-party requests

The number of third parties sending information to and receiving data from popular websites each time you visit them has increased dramatically in the past 20 years, which means that visitors to those sites may be more closely watched by major corporations and advertisers than ever before, according to a new analysis of Web tracking.

A team from the University of Washington reviewed two decades of third-party requests by using Internet Archive’s Wayback Machine. They found a four-fold increase in the number of requests logged on the average website from 1996 to 2016, and say that companies may be using these requests to more frequently track the behavior of individual users. They presented their findings at the USENIX Security Conference in Austin, Texas, earlier this month.

The authors—Adam Lerner and Anna Kornfeld Simpson, who are both PhD candidates, along with collaborators Tadayoshi Kohno and Franziska Roesner—found that popular websites make an average of four third-party requests in 2016, up from less than one in 1996. However, those figures likely underestimate of the prevalence of such requests because of limitations of the data contained within the Wayback Machine. Roesner calls their findings “conservative.”

For comparison, a study by Princeton computer science researcher Arvind Narayanan and colleagues that was released in January looked at one million websites and found that top websites host an average of 25 to 30 third parties. Chris Jay Hoofnagle, a privacy and law scholar at UC Berkeley, says his own research has found that 36 of the 100 most popular sites send more than 150 requests each, with one site logging more than 300. The definition of a tracker or a third-party request, and the methods used to identify them, may also vary between analyses.

“It’s not so much that I would invest a lot of confidence in the idea that there were X number of trackers on any given site,” Hoofnagle says of the University of Washington team’s results. “Rather, it’s the trend that’s important.”

Most third party requests are made through cookies, which are snippets of information that are stored in a user’s browser. Those snippets enable users to automatically log in or add items to a virtual shopping cart, but they can also be recognized by a third party as the user navigates to other sites.

For example, a national news site called todaysnews.com might send a request to a local realtor to load an advertisement on its home page. Along with the ad, the realtor can send a cookie with a unique identifier for that user, and then read that cookie from the user’s browser when the user navigates to another site where the realtor also advertises.

In addition to following the evolution of third party requests, the team also revealed the dominance of players such as Google Analytics, which was present on nearly one-third of the sites analyzed in the University of Washington study. In the early 2000s, no third party appeared on more than 10 percent of sites. And back then, only about 5 percent of sites sent five or more third party requests. Today, nearly 40 percent do. But there’s good news, too: pop-up browser windows seem to have peaked in the mid-2000s.

Narayanan says he has noticed another trend in his own work: consolidation within the tracking industry, with only a few entities such as Facebook or Google’s DoubleClick advertising service appearing across a high percentage of sites. “Maybe the world we’re heading toward is that there’s a relatively small number of trackers that are present on a majority of sites, and then a long tail,” he says.

Many privacy experts consider Web tracking to be problematic, because trackers can monitor a user’s behavior as they move from site to site. Combined with publicly-available information from personal websites or social media profiles, this behavior can enable retailers or other entities create identity profiles without a user’s permission.

“Because we don’t know what companies are doing on the server side with that information, for any entity that your browser talks to that you didn’t specifically ask it to talk to, you should be asking, ‘What are they doing?’” Roesner says.

But while every Web tracker requires a third-party request, not every third-party request is a tracker. Sites that use Google Analytics (including IEEE Spectrum) make third-party requests to monitor how content is being used. Other news sites send requests to Facebook so the social media site can display its “Like” button next to articles and permit users to comment with their accounts. That means it’s hard to tell from this study whether tracking itself has increased, or if the number of third-party requests has simply gone up.

Modern ad blockers can prevent sites from installing cookies and have become popular with users in recent years. Perhaps due in part to this shift, the authors also found that the behaviors that third parties exhibit have become more sophisitcated and wider in scope. For example, a new tactic avoids the use of cookies by recording a users’ device fingerprints, or identifiable characteristics such as screen size of their smartphone, laptop, or tablet.

When they began their analysis, the University of Washington researchers were pleased to find that the Wayback Machine could be used to track cookies and device fingerprinting through its storage of the original JavaScript code, which allows them to determine which JavaScript APIs are called on each website. Therefore, a user who is perusing the archived version of a site in the Wayback Machine winds up making all the same requests that the site was programmed to make at the time.

The researchers embedded their tool, which they call TrackingExcavator, in a Chrome browser extension and configured it to allow pop-ups and cookies. They instructed the tool to inspect the 500 most popular sites, as ranked by Amazon’s Web analytics subsidiary Alexa, for each year of the analysis. As it browsed the sites, the system recorded third-party requests and cookies, and the use of particular JavaScript APIs known to assist with device fingerprinting. The tool visited each site twice, once to “prime” the site and again to analyze whether requests were sent.

Until now, the team says academic researchers hadn’t found a way to study Web tracking as it existed before 2005. Hoofnagle of UC Berkeley says that using the Wayback Machine was a clever approach and could inspire other scholars to mine archival sites for other reasons. “I wish I had thought of this,” he says. “I’m totally kicking myself.”

Still, there are plenty of holes in the archive that limit its usefulness. For example, some sites prohibit automated bots such as those used by the Wayback Machine from perusing them.

The Tech Alert Newsletter

About the Tech Talk blog

IEEE Spectrum’s general technology blog, featuring news, analysis, and opinions about engineering, consumer electronics, and technology and society, from the editorial staff and freelance contributors.