Wednesday, June 10, 2015

Back when the web was new, it was fun to watch a website monitor and see the hits come in. The IP address told you the location of the user, and if you turned on the referer header display, you could see what the user had been reading just before. There was a group of scientists in Poland who'd be on my site regularly- I reported the latest news on nitride semiconductors, and my site was free. Every day around the same time, one of the Poles would check my site, and I could tell he had a bunch of sites he'd look at in order. My site came right after a Russian web site devoted to photographs of unclothed women.

The original idea behind the HTTP referer header (yes, that's how the header is spelled) was that webmasters like me needed it to help other webmasters fix hyperlinks. Or at least that was the rationalization. The real reason for sending the referer was to feed webmaster narcissism. We wanted to know who was linking to our site, because those links were our pats on the back. They told us about other sites that liked us. That was fun. (Still true today!)

Twenty years later, the referer header seems like a complete privacy disaster. Modern web sites use resources from all over the web, and a referer header, including the complete URL of the referring web page, is sent with every request for those resources. The referer header can send your complete web browsing log to websites that you didn't know existed.

Privacy leakage via the referrer header plagues even websites that ostensibly believe in protecting user privacy, such as those produced by or serving libraries. For example, a request to the WorldCat page for What you can expect when you're expecting results in the transmission of referer headers containing the user's request to the following hosts:

It turns out there's an easy way to plug this privacy leak in HTML5. It's called the referrer meta tag. (Yes, that's also spelled correctly.)

The referrer meta tag is put in the head section of an HTML5 web page. It allows the web page to control the referer headers sent by the user's browser. It looks like this:

<meta name="referrer" content="origin" />

If this one line were used on WorldCat, only the fact that the user is looking a WorldCat page would be sent to Google, AddThis, and BibTip. This is reasonable, library patrons typically don't expect their visits to a library to be private; they do expect that what they read there should be private.

Because use of third party resources is often necessary, most library websites leak lots of privacy in referer headers. The meta referrer policy is a simple way to stop it. You may well ask why this isn't already standard practice. I think it's mostly lack of awareness. Until very recently, I had no idea that this worked so well. That's because it's taken a long time for browser vendors to add support. Although Chrome and Safari have been supporting the referrer meta tag for more than two years; Firefox only added it in January of 2015. Internet Explorer will support it with the Windows 10 release this summer. Privacy will still leak for users with older browser software, but this problem will gradually go away.

There are 4 options for the meta referrer tag, in addition to the "origin" policy. The origin policy sends only the host name for the originating page.

For the strictest privacy, use

<meta name="referrer" content="no-referrer" />

If you use this sitting, other websites won't know you're linking to them, which can be a disadvantage in some situations. If the web page links to resources that still use the archaic "referer authentication", they'll break.

The prevailing default policy for most browsers is equivalent to

<meta name="referrer" content="no-referrer-when-downgrade" />

"downgrade" here refers to http links in https pages.

If you need the referer for your own website but don't want other sites to see it you can use

<meta name="referrer" content="origin-when-cross-origin" />

Finally, if you want the user's browser to send the full referrer, no matter what, and experience the thrills of privacy brinksmanship, you can set

<meta name="referrer" content="unsafe-url" />

Widespread deployment of the referrer meta tag would be a big boost for reader privacy all over the web. It's easy to implement, has little downside, and is widely deployable. So let's get started!

(June 15) W3C Public Draft Though this is labeled the "latest", the link above points to a more recent draft. As pointed out in the comments, there has been a change in the name of the "no-referrer" and "no-referrer-when-downgrade" policies.

I'm reprinting the article here so as to have a good place for discussion.

Alice, a 17 year old high school student, goes to her local public library and reads everything she can find about pregnancy. Noticing this, a librarian calls up some local merchants and tells them that Alice might be pregnant. When Alice visits her local bookstore, the staff has some great suggestions about newborn care for her. The local drugstore sends her some coupons for scent-free skin lotion. She reads "what you can expect..." at the library and a few months later she starts getting mail about diaper services.

Unthinkable? In the physical library, I hope this never happens. It would be too creepy!

In the digital library, this future could be happening now. Libraries and their patrons are awash in data that really isn't sensitive until aggregated, and the data is getting digested by advertising networks and flowing into "big data" archives. The scenario in which advertisers exploit Alice's library usage is not only thinkable, it needs to be defended against. It's a "threat model" that's mostly unfamiliar to libraries.

Recently, I read a book called Half Life. Uranium theft, firearms technology and computer hacking are important plot elements, but I'm not worried about people knowing that I loved it. The National Security Agency (NSA) is not going to identify me as a potential terrorist because I'm reading Half Life. On the contrary, I'd love for my reading behavior to be broadcast to the entire world, because maybe more people would discover what a wonderful writer S.L. Huang is. A lot of a library user's digital usage data is like that. It's not particularly private, and most would gladly trade usage information for convenience or to help improve the services they rely on. It would be a waste of time and energy for a library to worry much about keeping that information secret. Quite the opposite, libraries are helping users share their behavior with things like Facebook Like buttons and social media widgets.

Which is why Alice should be very worried and why it's important for libraries to understand new threat models. What breaches of user privacy are most likely to occur and which are most likely to present harm?

A 2012 article in the New York Times Magazine described a real situation involving Target (the retailer). Target's "big data" analytics team developed a customer model that identified pregnant women based on shopping behavior. Purchases of scent-free skin lotion, vitamin supplements, and cotton balls turned out to be highly predictive of subsequent purchases of baby diapers. Using the model, Target sent ads for baby-oriented products to the customers their algorithm had identified. In one case, an irate father whose daughter had received ads for baby clothes and cribs accused the store of encouraging his daughter to get pregnant. When a manager called to apologize, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

Among the companies collecting "big data" about users are the advertising networks, companies that sit in between advertisers and websites. They use their data to decide which ad from a huge inventory is most likely to result in a user response. If I were Alice, I don't think I would want my search for pregnancy books broadcasted to advertising networks. Yet that's precisely what happens when I do a search on my local public library's online catalog. I very much doubt that many advertisements are being targeted based on that searching ... yet. But the digital advertising industry is extremely competitive, and unless libraries shift their practices, it's only a matter of time that library searches get factored into advanced customer models.

But it doesn't have to happen that way. Libraries have a strong tradition of protecting user privacy. Once all the "threat models" associated with the digital environment are considered, practices will certainly change.

So let's get started. In the rest of this article, I'll examine the process of borrowing and reading an ebook, and identify privacy weaknesses in the processes that advertisers and their predictive analytics modeling could exploit.

Most library catalogs allow non-encrypted searches. This exposes Alice's ebook searches to internet providers between Alice and the library's server. The X-UIDH header has been used by providers such as Verizon and AT&T to help advertisers target mobile users. By using HTTPS for their catalogs, libraries can limit this intrusion. This is relatively easy and cheap, and there's no good excuse in 2015 for libraries not to make the switch.

Some library catalogs use social widgets such as AddThis or ShareThis that broadcast a user's search activity to advertising networks. Similarly, Facebook "Like" buttons send a user's search activity to Facebook whether or not the user is on Facebook. Libraries need to carefully evaluate the benefits of these widgets against the possibility that advertising networks will use Alice's search history inappropriately.

Statistics and optimization services like Google Analytics and NewRelic don't currently share Alice's search history with advertising networks, but libraries should evaluate the privacy assurances from these services to see if they are consistent with their own policies and local privacy laws.

When Alice borrows a book from a vendor such as OverDrive or 3M, it monitors Alice's reading behavior, albeit anonymously. At this date, it's very difficult for an advertiser to exploit Alice's use of reading apps from OverDrive or 3M. Although many have criticized the use of Adobe digital rights management (DRM) in these apps, both 3M and OverDrive use the "vendorID" method which avoids the disclosure of user data to Adobe, and at this date, there is no practical way for an advertising network to exploit Alice's use of these services. Here again, libraries should review their vendor contracts to make sure that can't change.

If Alice reads her ebook using a 3rd party application such as Adobe Digital Editions (ADE), the privacy behavior of the third party comes into play. Last year, ADE was found to be sending user reading data back to Adobe without encryption; even today, it's known to phone home with encrypted reading data. Other applications, such as Bluefire Reader, have a better reputation for privacy, but as they say "past performance is no guarantee of future returns".

If Alice wants to read her borrowed ebook on a Kindle (via OverDrive), it's very likely that Amazon will be able to exploit her reading behavior for marketing purposes. To avoid it, Alice would need to create an anonymous account on Amazon for reading her library books. Most people will just use their own (non-anonymous) accounts for convenience. If Alice shares her Amazon account with others, they'll know what she reads.

This is a classic example of the privacy vs. convenience tradeoff that libraries need to consider. A Kindle user trusts that Amazon will not do anything too creepy, and Amazon has every incentive to make that user comfortable with their data use. Libraries need to let users make their own privacy decisions, but at the same time libraries need to make sure that users understand the privacy implications of what they do.

The library's own records are also potential source of a privacy breach. This "small-data" threat model is perhaps more familiar to librarians. Alice's parents could come in and demand to know what she's been reading. A schoolmate might hack into the library's lightly defended databases looking for ways to embarrass Alice. A staff member might be a friend of Alice's family. Libraries need clear policies and robust processes to be worthy of Alice's trust.

In the digital environment, it's easy for libraries to be unduly afraid of using the data from Alice's searches and reading to improve her experience and make the library a more powerful source of information. Social networks are changing the way we think about our privacy, and often the expectation is that services will make use of personal information that's been shared. Technologies exist to protect the user's control over that data but advertising networks have no incentive to employ them. I want my library to track me, not advertising networks!. I want great books to read, and no, I'm not in the market for uranium-238!