Monday, June 12, 2017

I've written a chapter for a book, edited by Peter Fernandez and Kelly Tilton, to be published by ACRL. The book is tentatively titled Applying Library Values to Emerging Technology: Tips and Techniques for Advancing within Your Mission.

Digital Advertising in Libraries: or... How Libraries are Assisting the Ecosystem that Pays for Fake News

To understand the danger that digital advertising poses to user privacy in libraries, you first have to understand how websites of all stripes make money. And to understand that, you have to understand how advertising works on the Internet today.

The
goal of advertising is simple and is quite similar to that of libraries.
Advertisers want to provide information, narratives, and motivations to
potential customers, in the hope that business and revenue will result. The
challenge for advertisers has always been to figure out how to present the
right information to the right reader at the right time. Since libraries are
popular sources of information, they have long provided a useful context for
many types of ads. Where better to place an ad for a new romance novel than at
the end of a similar romance novel? Where better to advertise a new industrial
vacuum pump but in the Journal of Vacuum
Science and Technology? These types of ads have long existed without
problems in printed library resources. In many cases the advertising, archived
in libraries, provides a unique view into cultural history. In theory at least,
the advertising revenue lowers the acquisition costs for resources that include
the advertising.

On
the Internet, advertising has evolved into a powerful revenue engine for free
resources because of digital systems that efficiently match advertising to
readers. Google's Adwords service is an example of such a system. Advertisers
can target text-based ads to users based on their search terms, and they only
have to pay if the user clicks on their ad. Google decides which ad to show by
optimizing revenue—the price that the advertiser has bid times the rate at
which the ad is clicked on. In 2016, Search Engine Watch reported that some
search terms were selling for almost a thousand dollars per click. [Chris Lake, “The most expensive 100 Google Adwords keywords in the US,” Search Engine Watch (May 31, 2016).] Other
types of advertising, such as display ads, video ads, and content ads, are
placed by online advertising networks. In 2016, advertisers were projected to
spend almost $75 billion on display ads; [Ingrid Lunden, “Internet Ad Spend To Reach $121B In 2014, 23% Of $537B Total Ad Spend, Ad Tech Boosts Display,” TechCrunch, (April 27, 2014).] Google's Doubleclick network alone is found on over a million websites. [“DoubleClick.Net Usage Statistics,” BuiltWith (accessed May 12, 2017). ]

Matching
a user to a display ad is more difficult than search-driven ads. Without a
search term to indicate what the user wants, the ad networks need demographic
information about the user. Different ads (at different prices) can be shown to
an eighteen-year-old white male resident of Tennessee interested in sports and
a sixty-year-old black woman from Chicago interested in fashion, or a pregnant thirty-year-old
woman anywhere. To earn a premium price on ad placements, the ad networks need
to know as much as possible about the users: age, race, sex, ethnicity, where
they live, what they read, what they buy, who they voted for. Luckily for the
ad networks, this sort of demographic information is readily available, thank
to user tracking.

Internet
users are tracked using cookies. Typically, an invisible image element,
sometimes called a "web bug," is place on the web page. When the page
is loaded, the user's web browser requests the web bug from the tracking
company. The first time the tracking company sees a user, a cookie with a
unique ID is set. From then on, the tracking company can record the user's web
usage for every website that is cooperating with the tracking company. This
record of website visits can be mined to extract demographic information about
the user. A weather website can tell the tracking company where the user is. A
visit to a fashion blog can indicate a user's gender and age. A purchase of
scent-free lotion can indicate a user's pregnancy. [Charles Duhigg, “How Companies Learn Your Secrets,” The New York Times Magazine, (February 16, 2012).] The more information collected about a user, the more valuable a tracking
company's data will be to an ad network.

Many
websites unknowingly place web bugs from tracking companies on their websites,
even when they don't place adverting themselves. Companies active in the
tracking business include AddThis, ShareThis, and Disqus, who provide
functionality to websites in exchange for website placement. Other companies,
such as Facebook, Twitter, and Google similarly track users to benefit their
own advertising networks. Services provided by these companies are often placed
on library websites. For example, Facebook’s “like” button is a tracker that
records user visits to pages offering users the opportunity to “like” a
webpage. Google’s “Analytics” service helps many libraries understand the usage
of their websites, but is often configured to collect demographic information
using web bugs from Google’s DoubleClick service. [“How to Enable/Disable Privacy Protection in Google Analytics (It's Easy to Get Wrong!)” Go To Hellman (February 2, 2017).]

Cookies
are not the only way that users are tracked. One problem that advertisers have
with cookies is that they are restricted to a single browser. If a user has an
iPhone, the ID cookie on the iPhone will be different from the cookie on the
user's laptop, and the user will look like two separate users. Advanced
tracking networks are able to connect these two cookies by matching browsing
patterns. For example, if two different cookies track their users to a few
low-traffic websites, chances are that the two cookies are tracking the same
user. Another problem for advertisers occurs when a user flushes their cookies.
The dead tracking ID can be revived by using "fingerprinting"
techniques that depend on the details of browser configurations. [Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind Narayanan, and Claudia Diaz, “The Web Never Forgets: Persistent Tracking Mechanisms in the Wild.” In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS '14). ACM, New York, NY, USA, 674-689. DOI] Websites like Google, Facebook, and Twitter are able to connect tracking IDs
across devices based on logins.

Once
a demographic profile for a user has been built up, the tracking profile can be
used for a variety of ad-targeting strategies. One very visible strategy is
"remarketing." If you've ever visited a product page on an e-commerce
site, only to be followed around the Internet by advertising for that product,
you've been the target of cookie-based remarketing.

Ad
targeting is generally tolerated because it personalizes the user's experience
of the web. Men, for the most part, prefer not to be targeted with ads for women’s
products. An ad for a local merchant in New Jersey is wasted on a user in
California. Prices in pounds sterling don't make sense to users in Nevada. Most
advertisers and advertising networks take care not to base their ad targeting
on sensitive demographic attributes such as race, religion, or sexual
orientation, or at least they try not to be too noticeable when they do it.

The
advertising network ecosystem is a huge benefit to content publishers. A high
traffic website has no need of a sales staff—all they need to do is be accepted
by the ad networks and draw users who either have favorable demographics or who
click on a lot of ads. The advertisers often don't care about what websites
their advertising dollars support. Advertisers also don't really care about the
identity of the users, as long as they can target ads to them. The ad networks
don't want information that can be traced to a particular user, such as email address,
name or home address. This type of information is often subject to legal
regulations that would prevent exchange or retention of the information they
gather, and the terms of use and so-called privacy policies of the tracking
companies are careful to specify that they do not capture personally
identifiable information. Nonetheless, in the hands of law enforcement, an
espionage agency, or a criminal enterprise, the barrier against linking a
tracking ID to the real-world identity of a user is almost non-existent.

The
amount of information exposed to advertising networks by tracking bugs is
staggering. When a user activates a web tracker, the full URL of the referring
page is typically revealed. The user's IP address, operating system, and
browser type is sent along with a simple tracker; the JavaScript trackers that
place ads typically send more detailed information. It should be noted
that any advertising enterprise requires a significant amount of user
information collection; ad networks must guard against click-jacking,
artificial users, botnet activity and other types of fraud. [Samuel Scott, “The Alleged $7.5 Billion Fraud in Online Advertising,” Moz, (June 22, 2015).]

Breitbart.com is a good example of a content site supported by
advertising placed through advertising networks. A recent visit to the
Breitbart home page turned up 19 advertising trackers, as characterized by
Ghostery: [Ghostery is a browser plugin that can identify and block the trackers on a webpage.]

33Across

[x+1]

AddThis

adsnative

Amazon Associates

DoubleClick

eXelate

Facebook Custom Audience

Google Adsense

Google Publisher Tags

LiveRamp

Lotame

Perfect Market

PulsePoint

Quantcast

Rocket Fuel

ScoreCard Research Beacon

Taboola

Tynt

While
some of these will be familiar to library professionals, most of them are
probably completely unknown, or at least their role in the advertising industry
may be unknown. Amazon, Facebook and Google are the recognizable names on this
list; each of them gathers demographic and transactional data about users of
libraries and publishers. AddThis,
for example, is a widget provider often found on library and publishing sites.
They don't place ads themselves, but rather, they help to collect demographic
data about users. When a library or publisher places the AddThis widget on
their website, they allow AddThis to collect demographic information that benefits
the entire advertising ecosystem. For example, a visitor to a medical journal might
be marked as a target for particularly lucrative pharmaceutical advertising.

Another
tracker found on Breitbart is Taboola. Taboola is responsible for the
"sponsored content" links found even on reputable websites like Slate
or 538.com.
Taboola links go to content that is charitably described as clickbait and is
often disparaged as "fake news." The reason for this is that these
sites, having paid for advertising, have to sell even more click-driven
advertising. Because of its links to the Trump Administration, Breitbart
has been the subject of attempts to pressure advertisers to stop putting
advertising on the site. A Twitter account for "Sleeping Giants" has been encouraging
activists to ask businesses to block Breitbart from placing their ads. [Osita Nwanevu, “‘Sleeping Giants’ Is Borrowing Gamergate’s Tactics to Attack Breitbart,” Slate, December 14, 2016.] While
several companies have blocked Breitbart in response to this pressure, most
companies remain unaware of how their advertising gets placed, or that they can
block such advertising. [Pagan Kennedy, “How to Destroy the Business Model of Breitbart and Fake News,” The New York Times (January 7, 2017).]

I'm
particularly concerned about the medical journals that participate in
advertising networks. Imagine that someone is researching clinical trials for a
deadly disease. A smart insurance company could target such users with ads that
mark them for higher premiums. A pharmaceutical company could use advertising
targeting researchers at competing companies to find clues about their research
directions. Most journal users (and probably most journal publishers) don't
realize how easily online ads can be used to gather intelligence as well as to
sell products.

It's
important to note that reputable advertising networks take user privacy very
seriously, as their businesses depend on user acquiescence. Google offers users
a variety of tools to "personalize their ad experience." [If you’re logged into Google, the advertising settings applied when you browse can be viewed and modified.] Many of the advertising networks pledge to adhere to the guidance of the
"Network Advertising Initiative" [“The NAI Code and Enforcement Program: An Overview,”], an industry group. However, the competition in the web-advertising
ecosystem is intense, and there is little transparency about enforcement of the
guidance. Advertising networks have been shown to spread security vulnerabilities and
other types of malware when they allow JavaScript in advertising payloads. [Randy Westergren, “Widespread XSS Vulnerabilities in Ad Network Code Affecting Top Tier Publishers, Retailers,” (March 2, 2016).]

Given
the current environment, it's incumbent on libraries and the publishing
industry to understand and evaluate their participation in the advertising
network ecosystem. In the following sections, I discuss the extent of current
participation in the advertising ecosystem by libraries, publishers, and
aggregators serving the library industry.

Publishers

Advertising
is a significant income stream for many publishers providing content to
libraries. For example, the Massachusetts Medical Society, publisher of the New England Journal of Medicine, takes
in about $25 million per year in advertising revenue. Outside of medical and
pharmaceutical publishing, advertising is much less common. However,
advertising networks are pervasive in research journals.

Recently,
I revisited the twenty journals to see if there had been any improvement. Most
of the journals I examined had added tracking on their websites. The New England Journal of Medicine,
which employed the most intense reader tracking of the twenty, is now even more
intense, with nineteen trackers on a web page that had "only" fourteen
trackers two years ago. A page from Elsevier's Cell went from nine to sixteen
trackers. [“Reader Privacy for Research Journals is Getting Worse,” Go To Hellman (March 22, 2017). ] Intense
tracking is not confined to subscription-based health science journals; I have
found trackers on open access journals, economics journals, even on journals
covering library science and literary studies.

It's
not entirely clear why some of these publishers allow advertising trackers on
their websites, because in many cases, there is no advertising. Perhaps they don’t
realize the impact of tracking on reader privacy. Certainly, publishers that
rely on advertising revenue need to carefully audit their advertising networks
and the sorts of advertising that comes through them. The privacy commitments
these partners make need to be consistent with the privacy assurances made by
the publishers themselves. For publishers who value reader privacy and don't
earn significant amounts from advertising, there's simply no good reason for
them to continue to allow tracking by ad networks.

Vendors

The
library automation industry has slowly become aware of how the systems it
provides can be misused to compromise library patron privacy. For example, I
have pointed out that cover images presented by catalog systems were leaking
search data to Amazon, which has resulted in software changes by at least one
systems vendor. [“How to Check if Your Library is Leaking Catalog Searches to Amazon,” Go To Hellman (December 22, 2016).] These
systems are technically complex, and systems managers in libraries are rarely
trained in web privacy assessment. Development processes need to include
privacy assessments at both component and system levels.

Libraries

There
is a mismatch between what libraries want to do to protect patron privacy and
what they are able to do. Even when large amounts of money are at stake, there
is often little leverage for a library to change the way a publisher delivers
advertising bearing content. Nonetheless, together with cooperating IT and
legal services, libraries have many privacy-protecting options at their
disposal.

Use aggregators for journal content rather than the publisher sites. Many
journals are available on multiple platforms, and platforms marketed to
libraries often strip advertising and advertising trackers from the journal
content. Reader privacy should be an important consideration in selecting
platforms and platform content.

Promote the use of privacy technologies. Privacy Badger is an open-source browser plugin that knows about, and blocks tracking of,
users. Similar tools include uBlock Origin, and the aforementioned Ghostery.

Use proxy-servers. Re-writing proxy servers such as EZProxy are typically deployed to serve content to remote users, but they can also be
configured to remove trackers, or to forcibly expire tracking cookies. This is
rarely done, as far as I am aware.

Strip advertising and trackers at the network level. A more aggressive approach
is to enforce privacy by blocking tracker websites at the network level.
Because this can be intrusive (it affects subscribed content and unsubscribed
content equally) it's appropriate mostly for corporate environments where competitive-intelligence
espionage is a concern.

Ask for disclosure and notification. During licensing negotiations, ask the
vendor or publisher to provide a list of all third parties who might have
access to patron clickstream data. Ask to be notified if the list changes. Put
these requests into requests for proposals. Sunlight is a good disinfectant.

Join together with others in the library and publishing industry to set out
best practices for advertising in web resources.

Conclusion

The
widespread infusion of the digital advertising ecosystem into library
environments presents a new set of challenges to the values that have been at
the core of the library profession. Advertising trackers introduce privacy
breaches into the library environment and help to sustain an
information-delivery channel that operates without the values grounding that
has earned libraries and librarians a deep reserve of trust from users. The
infusion has come about through a combination of commercial interest in user
demographics, consumer apathy about privacy, and general lack of understanding
of a complex technology environment. The entire information industry needs to
develop understanding of that environment so that it can grow and evolve to
serve users first, not the advertisers.