Understanding the latest Facebook privacy train wreck

Revelations that Facebook and a number of other sites have been passing user …

Facebook's latest privacy problem should raise a few eyebrows. Facebook sharing "private" data in unexpected (and occasionally unwelcome) ways is nothing new, but this newest problem is unusual, in that it does something that Facebook's lengthy (and oft-updated) privacy policy explicitly says should not happen: it shares private user information with advertisers.

The research originally describing the problem looked at more than just Facebook. Many social networking sites, including LinkedIn, Digg, and Twitter, suffered the same leakage of personal data. Of the twelve sites looked at by the researchers, the only one that didn't leak data to advertisers was the one already owned by an advertising company—Orkut.

The private information was leaked to advertisers in a couple ways. The most widespread problem was the HTTP Referer (sic). In general, every time a Web browser requests a page from a Web server, it sends the Web server the URL of the current webpage—the "referer"—in addition to the name of the new page.

This information is used for a number of purposes. It's useful to website owners as it allows them to know, for example, which search engines are providing most of their traffic, and which internal links within the site are being followed (and by implication, which aren't). This lets site owners know which parts of their design are useful, the kind of communities that are interested in their content, the search terms that people are using to find their pages, and so on.

Another popular usage scenario is to block certain requests if the referral comes from a different domain, to prevent bandwidth leeching. If a Web server sees a request for an image but the HTTP Referer refers to a different site, indicating that the image has been linked or embedded on someone else's webpage, it can block the request.

Generally, this information is harmless, because usually the referring URL is not personally identifying. That's because most URLs aren't customized for individual users. Everyone performing the same search in Google will get the same URL for the results page; everyone visiting a page here on Ars will visit the same URL, and so on.

The problem this causes with social networking sites is that often, their referring URLs do contain identifying information, such as the ID or username. This makes sense—the entire point of these sites is that they are personalized and contain user-specific data—but it means that referring URLs become a whole lot more valuable.

To respond to this, Facebook uses—and has used for a long time—a redirect system for any link that takes users away from Facebook. Instead of going directly to the page you want (and hence using your profile URL as the referer), you are first directed to Facebook's redirect page, with the URL http://www.facebook.com/l.php?u=address-of-page-you're-going-to. The redirect page them immediately forwards you to the intended page. By doing this, Facebook ensures that the target site only sees the URL of the redirect page, and not the URL of your profile. In this way, private data is stripped out.

Unfortunately, though Facebook does this for external links, the company wasn't doing anything equivalent for ads embedded within Facebook pages. Facebook contains third-party advertising, and each request for these ads included with it the referring URL of the page containing the ad.

The company has since modified the way ads are embedded and linked within Facebook, so they should no longer see a meaningful referer. Facebook's URLs also appear to be constructed differently than was the case in the past; they place most information after a hash (#) symbol. The portion of the URL after the # is meant to indicate an internal reference on the page, and these internal references are not sent as part of an HTTP Referer in any case. This restructuring of URLs should provide a robust defense against data leakage through the HTTP Referer.

Twitter and Digg

The report described a similar problem on Twitter, but this one seems less clear-cut, and is arguably not a problem at all. Twitter uses Google Analytics for tracking usage data, and as part of this, Twitter embeds an image hosted by Google Analytics, hence providing to Google the address of each page used. However, Twitter URLs tend not to be personally identifying anyway.

Normally, the URL Twitter users look at is either just http://twitter.com/. Though Twitter does have profile pages, these are usually visited by other people who are looking at what the profile page's owner has tweeted. In other words, I don't normally look at (or click links on) http://twitter.com/DrPizza. Other people might, but their name doesn't get put into the URL, only mine.

As such, though the report criticized this as a source of data leakage, it's hard to see why. The only slight issue is that any links from private Twitter streams will include the Twitter username in the referral; the privacy leakage in this seems to be negligible. That said, Google Analytics is widely used, and it's possible that other sites do use it in such a way as to cause a problem—but no problematic usage was actually described in the report.

The third problem concerned Digg. Again, this was an area where there was the potential for a problem, but where the example provided in the report failed to actually demonstrate this. The concern here was the way Digg used cookies. Digg's cookies store, among other things, the username of the logged-in user. Any request made to digg.com will include that username cookie. For digg.com, that's just fine: we want the site we're visiting to know who we are, and if we didn't, we wouldn't register or log in at all.

Cookies are tied to a particular domain name. This can be done in two ways; they can be linked to a single, specific domain name (by specifying the full domain name in the cookie), or they can be linked to a domain and any subdomains. Digg uses both types of cookie; some tied to "digg.com" specifically, and some tied to ".digg.com," meaning that any sub-domain can read the cookie.

This was a concern, because Digg uses a digg.com subdomain for certain tracking features. z.digg.com is a third-party tracking server, but because it uses a subdomain of digg.com, it can see any cookies tied to ".digg.com."

However, the cookie containing the username is specific to digg.com. The z.digg.com tracking server hence has no access to it, so there is no information leakage.

The report appeared to be misleading on this point. The example cited does have a URL that includes a Digg username, but it got there through visiting a user page—it's there not because it's leaking private details of the current user, but rather it's reflecting the page they're visiting, and as such mirrors the information found in the HTTP Referer anyway.

Again, there is the potential for a mistake to be made—if the digg.com cookie allowed subdomain access then the identity of the viewer would indeed be leaked—but that's not a mistake that Digg actually makes.

So, the initial claims of the paper seem somewhat overblown. The situation may be muddied somewhat by third-party applications (which could potentially do things "wrong" even if Facebook et al. get them right), but it appears that in general, the social networking sites aren't routinely leaking personally identifying information to advertisers.

A number of sites are reported to be looking into ways to stop the more minor leakage—the data that, for example, the personal page being visited contains a link to a site—but this is a minor issue compared to the ability to track users directly.

One claim does look to have had some merit: Facebook probably was leaking information to advertisers, but doesn't seem to be any more. The Wall Street Journal's writers contacted Facebook, and state that the company acknowledged the potential for leakage and has taken protective measures in response.

What makes this disquieting is that the way the HTTP Referer works—and the data leakage it can cause—is well-known. Facebook has used redirections for off-site links for a long time, so it was certainly aware of the problem; it just failed to apply a solution across the board. Unlike other decisions the company has made, this almost certainly wasn't deliberate, but it definitely shows a kind of recklessness.

Given the amount of data Facebook and other social networking sites contain about us, this kind of reckless behavior is distressing, and will no doubt fuel the small—but growing—trend of deleting your profile to put an end to privacy concerns. Privacy is about more than having a bunch of checkboxes you can tick: it needs to be considered for every piece of development, or else these accidental problems will continue to arise.

61 Reader Comments

Sounds like this sort of HTTP_REFERER concern is going to be a problem for any site that wants to implement a RESTful API with meaningful URLs. Maybe a solution would be to include a JavaScript that traverses the DOM and changes any cross-domain a href URLs to an onclick that will open a new, unnamed window and set its location.

Edit: forgot about devices that can't open multiple windows. Guess you'll need to point everything through a generic redirect page.

Would I be right in thinking that this is something that your browser does? If so it sounds like it might be a good thing for a firefox plugin to disable, so the more paranoid among us can rest in peace.

It seems like generally when I'm on FaceBook and viewing pages that have ads the URL is pretty benign, along the lines of http://www.facebook.com/#!/home.php. And then if I click around I can get pages that do contain my username. It seems rather similar to the Twitter scenario in the article.

I think sites like Facebook should be held to a higher standard than run of the mill sites, but this also seems like the kind of error that is more likely to be accidental than nefarious.

But it was the final straw for me. If this had been the first or even just the third issue involving FB privacy, it probably wouldn't have been a big deal. But it just broke my tolerance for putting up with them. It was the proverbial straw.

But it was the final straw for me. If this had been the first or even just the third issue involving FB privacy, it probably wouldn't have been a big deal. But it just broke my tolerance for putting up with them. It was the proverbial straw.

This. I commented in one of the other FB threads this week that my account is currently in the deletion process, to be completed next Wednesday (it takes 14 days).

This was it for me, as well. Even if they "fix it" or "make it better" I can no longer trust them to "keep" things that way.

But it was the final straw for me. If this had been the first or even just the third issue involving FB privacy, it probably wouldn't have been a big deal. But it just broke my tolerance for putting up with them. It was the proverbial straw.

This. I commented in one of the other FB threads this week that my account is currently in the deletion process, to be completed next Wednesday (it takes 14 days).

This was it for me, as well. Even if they "fix it" or "make it better" I can no longer trust them to "keep" things that way.

Likewise, except on the subsequent Wednesday.

I do not support websites with out-of-the-blue, opt-out privacy failings.

I think it's funny how ars concludes this article commenting on how 'reckless' FB and the other sites were, but not even bothering to comment on the obvious recklessness with which their original article was posted less than 24 hours ago. So much drama stirred up all because they insisted on posting the first article quickly instead of waiting and doing the thoughtful analysis found in this one. Sad, such a waste of time.

I think it's funny how ars concludes this article commenting on how 'reckless' FB and the other sites were, but not even bothering to comment on the obvious recklessness with which their original article was posted less than 24 hours ago. So much drama stirred up all because they insisted on posting the first article quickly instead of waiting and doing the thoughtful analysis found in this one. Sad, such a waste of time.

From what I can tell, the complaints levelled at Facebook are accurate. WSJ is maintaining that they are, and that portion of the AT&T paper looks sound.

What makes this disquieting is that the way the HTTP Referer works—and the data leakage it can cause—is well-known. Facebook has used redirections for off-site links for a long time, so it was certainly aware of the problem; it just failed to apply a solution across the board. Unlike other decisions the company has made, this almost certainly wasn't deliberate, but it definitely shows a kind of recklessness.

I'm sorry. But I can't understand this reasoning. I am not, neither I ever was a facebook user. So I hope you take this commentary as not being fueled by some kind of prejudice towards the company.

But how exactly can you conclude this behavior wasn't deliberate? This is exactly ad clicking we are talking about here. Not just some hidden link on an obscure portion of users' webpage. It's virtually impossible for a business the size of Facebook and with the added years of experience and expertize, to not be aware of how their redirects are operating on such a sensitive area of their business. One of the areas that has had the most worldwide coverage in news and technological open forums. Debated to death for the past 20 years by renowned experts. Subject to the foundation of governmental departments and non governmental agencies. One of the biggest targets of awareness campaigns in internet since its inception; User Privacy. And exactly on Ad serving, the one most sensitive area to user privacy.

But even if that was the case, and Facebook wasn't aware, what does that tell us exactly?

That we are in deeper trouble! That in fact none of these social platforms can in fact offer any kind of sense of protection to their users because... well, because they may be missing something. Despite all the claims on their Privacy Policy pages. Which can rapidly become useless documents if legal action is not taken.

But worse. That when it finally happens, we have a community at large, including news agents, exploring and helping expanding the notion that this is ok, after all. That the poor fellas, they just missed it. It was not deliberate. Even when there was no evidence whatsoever it wasn't deliberate.

And to add to that we also gain the knowledge that these companies don't read their logs and don't check their privacy infrastructure periodically like they certainly do their security infrastructure. Because at the end of the day, when they get caught because they were being deliberate, or exposed because they indeed made a mistake, all goes back to how it was before.

Excuse me, Peter. But I honestly prefer to think this was deliberate. Because the alternative is a lot more scarier. This is not an isolated event. Stories like this have been pilling up on websites like Ars at an alarming rate. And, my guess, because in fact there has been no proper treatment given to them by legal enforcement agencies and the news media in general explores and helps expanding the notion that this is all just one big Oops. The poor saps won't do it again. Move on.

But how exactly can you conclude this behavior wasn't deliberate? This is exactly ad clicking we are talking about here. Not just some hidden link on an obscure portion of users' webpage. It's virtually impossible for a business the size of Facebook and with the added years of experience and expertize, to not be aware of how their redirects are operating on such a sensitive area of their business. One of the areas that has had the most worldwide coverage in news and technological open forums. Debated to death for the past 20 years by renowned experts. Subject to the foundation of governmental departments and non governmental agencies. One of the biggest targets of awareness campaigns in internet since its inception; User Privacy. And exactly on Ad serving, the one most sensitive area to user privacy.

Well, no, actually, it hasn't been "debated to death".

In general, the problem hasn't really existed. In general--and still the case with most websites--the HTTP Referer contains no identifying information. That's not to say that there were no sites that ever did it, but sites where the very URLs that people used identified them were exceptional. If you click an ad on Ars, the advertisers will know exactly what pay you were on when you clicked. That's how most sites operate, and that's because it's harmless. Passing this data is the web's default mode of operation, and it takes some effort to avoid it.

Historic privacy concerns have been concerned with such issues as cookies (which allow ad servers to identify that the same computer is connecting to their ad servers from disparate websites). The personal identity-in-URL scenario just wasn't something anyone had to deal with ten years ago; even five years ago it was pretty exceptional. So, no, not "debated to death" for the past 20 years. Shit, the WWW hasn't even existed for 20 years (it wasn't until 1991 that there was a public WWW).

There's not a shred of evidence that it was deliberate, and the fact that Facebook explicitly stated that they wouldn't pass such information to advertisers reinforces that. You don't say you won't do something and then deliberately do the opposite unless you want to get sued.

>> There's not a shred of evidence that it was deliberate, and the fact that Facebook explicitly stated that they wouldn't pass such information to advertisers reinforces that. You don't say you won't do something and then deliberately do the opposite unless you want to get sued.

This is totally not up for you or me to decide. It's quite irrelevant what you think on the matter. I'm not trying to be offensive. I'm trying to be factual. These actions should deserve proper investigation before anyone can actually know what happened. Besides willingly or unwillingly, there is a heavy dose of responsibility here that a company cannot just shrug off with an oops. Not to mention we don't know exactly for how long this has been going. For how long facebook users privacy has been violated.

>> In general, the problem hasn't really existed. In general--and still the case with most websites--the HTTP Referer contains no identifying information.

The problems didn't exist? That referer urls can contain such information is widely known by both users (who make a case of staying informed), web programmers and business players in the industry. Referer URLs are a well known potential problem concerning user privacy. You deny this? I don't want to go down peck on links about user privacy lists of dangers, things to do, things not to do and things to watch out for. I don't think I have to.

>> So, no, not "debated to death" for the past 20 years. Shit, the WWW hasn't even existed for 20 years (it wasn't until 1991 that there was a public WWW).

Allow me the faux pas. I'm not counting with my fingers. But you know exactly what I mean, so no need to be snarky. If it serves you of any consolation, I've been around since 93 and user privacy has always been an openly debated issue.

In any case, it is also quite irrelevant if this particular aspect of user privacy was a relevant aspect or not of the debate. It is known. Passing user sensitive information through a URL referrers is easily seen as a privacy problem. Doesn't take a genius and you don't need 19 (happy now?) years to finally find that out after Facebook shown it to you.

You simply cannot convince me, or anyone with forehead taller than 1 inch, that all this was just one big surprise. That the people involved in creating the web content for a business like Facebook never thought of that danger. They obviously did when they took care of regular links redirect page.

I am no Facebook apologist, though this is exactly the kind of alarmist material that gets the torches and pitchforks out and dumb, irrevocable things happen as a result. So from what I can tell the Referer header is doing its job, it's the content of the URIs that's in question.

So, just curious, how many social-media-whatevers don't mint URIs containing unique keys from their user database? Like, isn't that exactly what they show you to do in every make-your-own-social-network-in-10-minutes-in-Rails screencast? When did the idea of masking the content of URIs from referrers filter into the zeitgeist to the point we're all freaking out if web properties fail to implement it?

That said, I think we should be caring a lot about the content of URIs, and not just for privacy purposes. I see this event as more of a chin-scratching one than a cause for hyperventilation.

This is totally not up for you or me to decide. It's quite irrelevant what you think on the matter. I'm not trying to be offensive. I'm trying to be factual. These actions should deserve proper investigation before anyone can actually know what happened. Besides willingly or unwillingly, there is a heavy dose of responsibility here that a company cannot just shrug off with an oops. Not to mention we don't know exactly for how long this has been going. For how long facebook users privacy has been violated.

Of course they can shrug it off with an "oops". Facebook users already agreed as much when they signed up in the first place.

Quote:

The problems didn't exist? That referer urls can contain such information is widely known by both users (who make a case of staying informed), web programmers and business players in the industry. Referer URLs are a well known potential problem concerning user privacy. You deny this? I don't want to go down peck on links about user privacy lists of dangers, things to do, things not to do and things to watch out for. I don't think I have to.

I deny that there was any substantial issue of referer URLs containing personally identifiable data, yes.

Quote:

In any case, it is also quite irrelevant if this particular aspect of user privacy was a relevant aspect or not of the debate. It is known. Passing user sensitive information through a URL referrers is easily seen as a privacy problem. Doesn't take a genius and you don't need 19 (happy now?) years to finally find that out after Facebook shown it to you.

All this fun people are having makes me ever so mindful of the day that I lost my FACEbook in an accident years ago, and ever since, my doctors have warned me never--never--to DIGG at my FACE with my hands all a'TWITTER! I hear, and I obey.

The uproar over facebook privacy settings is proof of the devolution of the human race. when did everyone forget that this is the internet and everything is available. it's like spray painting your social security number on the side of a bridge and then telling everybody not to look at it.

The uproar over facebook privacy settings is proof of the devolution of the human race. when did everyone forget that this is the internet and everything is available. it's like spray painting your social security number on the side of a bridge and then telling everybody not to look at it.

Yes, or like having your phone number and address listed in the white pages but completely forgetting what's there for all the world to see; or discovering a bogus charge on your credit card bill and scratching your head because you know you haven't used your computer to charge anything in ages--but completely forgetting the smarmy waiter at dinner the other night who personally insisted on taking your CC to the register and paying your bill, and whom you also forgot was gone just a bit longer than you thought was appropriate before he returned and handed you back your card and receipts... Ah, you'd think from reading some of these stories that the only thing most people ever do in life, after birth. is to sit at home writing their autobiographies on FACEBOOK!...

fyi the new style of facebook urls was in place at least by april 16 (to match google's proposal for crawling ajax sites, I assume, but I never saw any formal announcement of the fact). that would mean this hasn't been an issue with them since before the whole "open graph" launch.

Between that fact and what DrPizza investigated, the WSJ report seems a late for the facts but well timed for the controversy. I can't find a date yet for the actual research paper.

Maybe I'm just confused, but I don't understand the difference between the Facebook and Twitter scenarios. In neither case do the URLs contain information identifying the user who is viewing the page, just the page that is viewed. At best, all they get are the ad preferences of all of the people viewing a person's profile, in aggregate.

Now, I suppose you could go through some incredibly complicated sociological models to derive data from the users who are friends with the user whose page is being viewed, but I can't envision any scenario in which that would be worth the investment for an advertising company.

do we care to understand, at that point ? The morals of these stories is that 3rd parties (Facebook in particular, but google/MS...neither) cannot be trusted to manage your privacy, either because they're greedy, or incompetent, or subcontract to who knows whom.Or all of the above.

In this "social Web 2.0 cloud" age, that's very unsettling. These guys will sell, lose, corrupt, misuse my data with no qualms, and no consequences to fear. And once they got something, they'll never, ever give it back.

Although I disagree with the conclusion in the second to last paragraph that says this "certainly wasn't deliberate". I cant imagine it not being deliberate. Ads are how they make money. The more likely scenario is that they know exactly what is going on with their primary revenue stream. They have people who spend all day thinking about nothing else.

A safer conclusion would have been: "it possibly wasn't deliberate" or even "it likely wasn't deliberate". To claim certainty is just plain wrong in my opinion.

What? Discussing User Privacy issues and concerns? You are, in fact, very wrong.

If you mean the specific case of referers, you are right of course. But then I never implied they were. If you go back to my original post, and read more carefully perhaps, you will see I was referring to User Privacy when I mentioned the 20 years. Not this particular issue.

HTTP Referers were first introduced, 5 years later, in 1996. And if you go to the original RFC (http://www.ietf.org/rfc/rfc1945.txt) and read 10.13, you will see that even there there's a note concerning privacy issues. I'm quoting:

Quote:

Note: Because the source of a link may be private information or may reveal an otherwise private information source, it is strongly recommended that the user be able to select whether or not the Referer field is sent. For example, a browser client could have a toggle switch for browsing openly/anonymously, which would respectively enable/disable the sending of Referer and From information.

But again, it's quite irrelevant when this particular issue of HTTP referers was made aware to developers, privacy advocates, and people in the industry in general. It is a well known problem. It's just not simply possible to build a case around the idea that this is just a hidden feature that a business the size and with the expertize of Facebook isn't aware or otherwise during development.

The proof that they are in fact aware, is that them (as many others in this industry) apply special redirect pages to their other urls.

I get it - you can sell someone's private information behind their back but you can't share a mp3 with your family member that you have already paid for? Oh the inequities of these technical world we live in. It just isn't stacked in our favor.