Analysis and opinion by Christopher Soghoian, security and privacy researcher.

Thursday, October 07, 2010

My FTC complaint about Google's private search query leakage

Today, the Wall Street Journal published an article about a complaint I submitted to the FTC last month, regarding Google's intentional leakage of individuals' search queries with third party sites.

The complaint is 29 pages long, and so I want to try to explain it to those of you who don't have the time or desire to read through the whole complaint.

The complaint centers around an obscure feature in web browsers, known as the HTTP referrer header. Danny Sullivan, a widely respected search engine industry analyst has written that the http referrer header is "effectively the Caller ID of the internet. It allows web site owners and marketers to know where visitors came from." However, while practically everyone with a telephone knows about the existence of caller ID, as Danny also notes, the existence of the referrer header is "little known to most web surfers."

This header reveals to the websites you visit the URL of the page you were viewing before you visited that site. When you visit a site after clicking on a link in a search engine results page, that site learns the terms you searched for (because Google and the other search engines include your search terms in the URL).

Google does not dispute that it is leaking users search queries to third parties. A Google spokesperson told the Wall Street Journal today that its passing of search-query data to third parties "is a standard practice across all search engines" and that "webmasters use this to see what searches bring visitors to their websites."

Thus, we move on to the main point of my complaint, which is that the company does not disclose this "common practice" to its customers, and in fact, promises its customers that it will not share their search data with others.

For example, of the 49 videos in Google's YouTube privacy channel, not one single video describes referrer headers, or provides users with tips on how to protect themselves from such disclosure. On the other hand, the first video that plays when you visit the privacy channel tells the visitor that "at Google, we make privacy a priority in everything we do." Indeed.

Google only shares personal information with other companies or individuals outside of Google in the following limited circumstances:

* We have your consent. We require opt-in consent for the sharing of any sensitive personal information.

* We provide such information to our subsidiaries, affiliated companies or other trusted businesses or persons for the purpose of processing personal information on our behalf . . .

* We have a good faith belief that access, use, preservation or disclosure of such information is reasonably necessary to (a) satisfy any applicable law, regulation, legal process or enforceable governmental request, (b) enforce applicable Terms of Service, including investigation of potential violations thereof, (c) detect, prevent, or otherwise address fraud, security or technical issues, or (d) protect against harm to the rights, property or safety of Google, its users or the public as required or permitted by law.

The widespread leakage of search queries doesn't appear to fall into these three "limited circumstances." Perhaps Google doesn't consider search query data to be "personal information"? However, at least four years ago, it did. When fighting a much publicized request from the Department of Justice for its customers search queries, the company argued that:

"[S]earch query content can disclose identities and personally identifiable information such as user‐initiated searches for their own social security or credit card numbers, or their mistakenly pasted but revealing text."

Until October 3, the company's privacy policy also included the following statement:

We may share with third parties certain pieces of aggregated, non-personal information, such as the number of users who searched for a particular term, for example, or how many users clicked on a particular advertisement. Such information does not identify you individually.

I don't think that it is possible to reasonably claim that millions of individual search queries associated to particular IP addresses are "aggregated, non-personal information".

Google's customers expect their search queries to stay private

In its brief opposing DOJ's request, Google also argued that it has an obligation to protect the privacy of its customers' search queries:

Google users trust that when they enter a search query into a Google search box, not only will they receive back the most relevant results, but that Google will keep private whatever information users communicate absent a compelling reason . . .

The privacy and anonymity of the service are major factors in the attraction of users – that is, users trust Google to do right by their personal information and to provide them with the best search results. If users believe that the text of their search queries into Google's search engine may become public knowledge, it only logically follows that they will be less likely to use the service."

"Google does not publicly disclose the searches (sic) queries entered into its search engine. If users believe that the text of their search queries could become public knowledge, they may be less likely to use the search engine for fear of disclosure of their sensitive or private searches for information or websites."

Google already protects some of its users search queries

Since May of this year, Google has offered an encrypted search service, available at encrypted.google.com (in fact, it is the only search engine to currently offer such a service). In addition to protecting users from network snooping, one additional benefit of the service is that it also automatically protects users' query data from leaking via referrer headers.

However, Google has done a poor job of advertising the existence of its encrypted search website, and an even worse job in letting users know about the existence of search query referrer leakage. If users don't know that their queries are being shared with third parties, why would they bother to use the encrypted search service in the first place.

The remedy I seek

If Google wants to share its users' search query data with third parties, there is nothing I can do to stop it. That practice, alone, isn't currently illegal. However, the company should not be permitted to lie about its practices. If it wants to share its customers' search queries with third parties, it should disclose that it is doing so. Even moreso, it shouldn't be able to loudly, and falsely proclaim that it is protecting its users' search data.

However, since the company has for years bragged about the extent to which it protects its customers data, I think that it should be forced to stand by its marketing claims. Thus, I have petitioned the FTC to compel the company to begin scrubbing this data, and to take appropriate steps to inform its existing customers about the fact that it has intentionally shared their historical search data with third parties. This, I think, is the right thing to do.

9 comments:

While I agree with your point that Google shouldn't be allowed to be disingenuous, I feel like your complaint is directed at the wrong place. Google isn't the one sharing Referer data, it's the specification that does so. While users really have no idea that one website talks to another, I don't think we can just blame Google for that. After all, it actually your browser that's handing the Referer data to each site, not Google or any other website.

To claim otherwise would be to do one of two things. One, force web designers to use worse URLs that don't have useful information in them. Or two, do what Google is /proactively/ doing on their encrypted site, which is adding a layer of misdirection in between so that even if browsers follow the Referer standard, the website you browse to from Google will be obscured. Personally, I'd rather a world in which URLs look like /query/term than one in which they look like /9872387897823asdlc8a7, and the layer of misdirection might be a partial solution, but every website shouldn't have to do this on their server.

I don't have a strong opinion on whether the Referer data should be passed around, but you can't go attacking Google for something they've done on their encrypted site that they haven't done elsewhere, and which nobody else has done anywhere (that I'm aware of).

It's curious that you don't quote from the government's reply brief of 2/24/06:

Google plainly does not consider the content of search terms to be “personal information” for the purpose of this privacy policy. To the contrary, queries that are entered into Google’s search engine are routinely revealed to other websites, and Google makes no efforts to prevent this. This disclosure occurs as follows. First, when a user runs a Google search, Google returns a page of results with an address that includes the entered search terms (e.g., http://www.google.com/search?q=my+search+terms). The user may then click on a link as displayed on this results page. When the user does so, under the specification for the Hypertext Transfer Protocol (RFC 2616), his or her web browser will pass several pieces of information to the new website that he or she is visiting; one of these fields, known as the “referer,” or HTTP_REFERER, specifies the address of the previous web page that directed the user to the current website.8 As a result, when a user clicks on a link in a Google search results page, the address of that page – including the search terms embedded in the address (e.g.,my+search+terms) – is disclosed to the operator of the linked-to website. And this occursdespite the fact that the operator of that website will also receive information regarding theuser’s IP address, which may be associated with the search terms. The government’s request forproduction here, in contrast, seeks no such identifying information.

Google itself does not transmit this search information to other websites; instead, the individual user’s browser does so, in accordance with the official HTTP specification. However, Google could easily prevent users’ search terms from leaking out in this fashion, but it choosesnot to do so, and thus tacitly allow user search queries to be disclosed to websites visited byGoogle search users.[FN: Google could construct its search input form to use the HTTP POST method instead of the GET method. See generally Shishir Gundavaram, CGI Programming on the World Wide Web, ¶ 4.2 (O’Reilly 1996) (available at http://www.oreilly.com/openbook/cgi/ch04_02.html).] Moreover, Google affirmatively encourages its advertisers to use referrer logging to track the traffic on their websites. (Supp’l Stark Decl., ¶ 13.) This is, of course, inconsistent with Google’s present assertion as to the value it places on the confidentiality of the text of queries on its search engine.

It's worth restating the main point: Google is not accurately representing its behavior to consumers and has taken repeated actions to make sure that this disclosure happens, even when the new techniques it uses would prevent this. As the post notes, this has even included adding information to the referrer data that would not appear in the URL (link order).

This was trenchant and outstanding research. I'm not sure the first two commentators read far enough to see that the issue is that Google is deceiving people, due to the gap between its 'stated privacy policies' and how it designs search so it passes on info about the Web users. The question is whether the FTC will do anything, given Google's power, wealth an political connections. But meanwhile, salute to Chris for exposing this. Heck, Google didnt even deny it.

It isn't really google which is passing the search terms on - it is your own browser.

At least if you don't use the Firefox RefControl plug-in or other methods mentioned in the Complaint.

One thing missing from your Complaint was that queries are tied to a specific IP address, which can usually be tied to tied to a specific location(such as a residence) and often to a specific ISP client.

In the end, it seems to me that Google has designed the way the search works (GET vs. POST, what sort of links are returned, etc. etc.), and thus excuses about common browser behaviour in particular circumstances don't hold water.

If Google gave back URLs in search results that included query parameters with all the information Google knew about the searcher (such as Google ID, e-mail address, sex, age, and so on), you could just as well argue that it was "the browser" that passed this information on. Yet it was Google that "programmed" the browser to pass on this information (or not) when the user clicked the link.

The question we should focus on here, I think, is, when Google gives me a page of search results, how has it set up the links on this page to pass (or not) information to the sites I view when I click the links using common browsers?

(For Firefox users worried about this thing, by the way, <a href="https://addons.mozilla.org/en-US/firefox/addon/953/>RefControl</a> is a useful plugin.)

Can you really fault Google? I agree that most consumer have no idea how much data is out there on them, they don't even have a basic idea how the internet works. People surfing Facebook don't have the first idea how that website is built and operates, most who surf the web via browsers, don't even know how a browser translates code into what is rendered. You want Google to educate consumers about privacy online, but the consumer has already spoken and they don't care. Ignorance is bliss...

I got into this through the concern that if I have a really neat idea and I want to search to see if anyone else has it before me can I be sure that sening the idea into the Internet is provate and someone will not see it and maybe take business advantage of it before me.

Christopher Soghoian, Ph.D. is a Washington, DC based privacy and security researcher. He is the Principal Technologist in the Speech, Privacy and Technology Project at the American Civil Liberties Union.