Me and my metadata – thoughts on online surveillance

The NSA documents Edward Snowden leaked have sparked a debate within the US about surveillance. While Americans understood that the US government was likely intercepting telephone and social media data from terrorism suspects, it’s been an uncomfortable discovery that the US collected massive sets of email and telephone data from Americans and non-Americans who aren’t suspected of any crimes. These revelations add context to other discoveries of surveillance in post 9-11 America, including the Mail Isolation Control and Tracking program, which scans the outside of all paper mail sent in the US and stores it for later analysis. (The Smoking Gun reported on the program early last month – I hadn’t heard of it until the Times report today.)

The Obama administration and supporters have responded to criticism of these programs by assuring Americans that the information collected is “metadata”, information on who is talking to whom, not the substance of conversations. As Senator Dianne Feinstein put it, “This is just metadata. There is no content involved.” By analyzing the metadata, officials claim, they can identify potential suspects then seek judicial permission to access the content directly. Nothing to worry about. You’re not being spied on by your government – they’re just monitoring the metadata.

Of course, that’s a naïve and oversimplified view of metadata, which turns out to be a surprisingly rich source of information on who people are, who they know and what they do. Congress has historically recognized that metadata is important and deserves protection – while the Supreme Court ruled in Smith vs. Maryland that phone numbers dialed should not be expected to be private information, as they are exposed to the phone company, Congress put restrictions on the use of “pen registers”, devices that can track what calls are made and received by a phone, requiring law enforcement to go to court to institute such tracking. The same logic in Smith vs. Maryland applies to the Mail Isolation Control and Tracking program – since information on envelopes is visible to the public, or at least to mail carriers, it’s monitorable and storable, even without “mail covers“, US Postal Service administrative orders used to trace mail coming to criminal suspects. And, perhaps, the policymakers who approved NSA’s surveillance projects would argue that the logic applies to email headers as well.

Put aside for the moment the question of whether monitoring metadata is reading public information or is more analogous to a pen register. There’s a scale issue that comes into play here. One major constraint on pen registers and mail covers historically has been the sheer amount of data they generate. Potential overreach by law enforcement is held in check by two factors – the need to get court or administrative approval to trace metadata, and the ability to process said metadata. As a result, USPS insiders report that it processes about 15,000 – 20,000 mail covers a year related to crime, and as security researcher Chris Soghoian discovered, internet and telecommunications companies charge law enforcement agencies for pen registers, putting some practical limits on their use.

But the NSA surveillance of email and phone networks, and the Mail Isolation Control and Tracking program have no such limits. While it’s likely quite expensive to scan all US mail, once you’ve committed to doing so, it’s comparatively cheap to store that information and analyze it at later dates, as investigators evidently did to arrest Shannon Richardson for sending ricin to President Obama and New York City mayor Bloomberg. And, since the costs of NSA surveillance are evidently borne primarily by internet and telephony companies, it’s downright cheap to keep metadata on email and phone calls. All the postal mail, email and phone calls.

It’s also much, much cheaper to analyze this data than in years past. The current frenzy for “big data” and “data science” has called attention to techniques that allow analysts to pull subtle patterns out of data – a New York Times story that suggests that retailer Target was able to identify pregnant customers based on their purchasing behavior (unscented lotion!) and target ad flyers to them gives a sense for the commercial applications of these techniques.

Sociologist Kieran Healy shows another set of applications of these techniques, using a much smaller, historical data set. He looks at a small number of 18th century colonists and the societies in Boston they were members of to identify Paul Revere as a key bridge tie between different organizations. In Healy’s brilliant piece, he writes in the voice of a junior analyst reporting his findings to superiors in the British government, and suggests that his superiors consider investigating Revere as a traitor. He closes with this winning line: “…if a mere scribe such as I — one who knows nearly nothing — can use the very simplest of these methods to pick the name of a traitor like Paul Revere from those of two hundred and fifty four other men, using nothing but a list of memberships and a portable calculating engine, then just think what weapons we might wield in the defense of liberty one or two centuries from now.”

If you are a member of a secret organization planning overthrow of the government, you’ve probably already thought hard about what your metadata might reveal. But if you’re an average citizen with “nothing to hide”, it may be less obvious why your metadata may not be something you are comfortable sharing. After all, Frank Rich recently proclaimed that “privacy jumped the shark in America long ago” and that we are all members of “the America that prefers to be out there, prizing networking, exhibitionism, and fame more than privacy, introspection, and solitude.” Lured by reality television and social networks, we all want to be watched and have therefore have given up our distaste for surveillance.

I think it’s possible to be both a heavy user of social media, and concerned about the security of your metadata. It simply requires understanding that, for many of us, social media is a performance. When I share links on Twitter, I’m aware that I’m constructing an image to my followers as someone who’s interested in certain topics and disinterested in others. I don’t share every article that I read, both because I suspect not all are interesting to my followers and also because I don’t really want my professional community to know just how much mental energy I spend worrying about who the Green Bay Packers will field at running back in the coming season.

This may not be how you use social media, but it probably should be. As danah boyd and others have pointed out, youth have had to figure out how to navigate a world in which their interpersonal and social interactions are archived, searchable and persist long enough to present a problem in adulthood – as a result, they’re continually engaged in “identity performance”, as well as in developing codes and other ways to speak on social networks to defy monitoring.

By contrast, most of us aren’t maintaining a persistent, public performance when we’re using telephones or email. (For an example of what this might feel like, consider this story from This American Life, where lawyers who work with Guantanamo detainees talk about how having the US government monitor their personal phone calls changes their behavior.) Our metadata can reveal things we may not want to share with others, or may not know ourselves.

As it happens, I have a pretty good sense for what my email metadata might tell an investigator. This fall, I co-taught a class with Cesar Hidalgo, Catherine Havasi and Sep Kamvar at the Media Lab titled “Big Data”. Two of the students who took the class, Daniel Smilkov and Deepak Jagdish, worked on a project called Immersion which uses Gmail metadata to map someone’s social network. I’m one of about 500 alpha testers of the software, developed by Cesar, Daniel and Deepak, and have been one of the poster boys for the project as it’s been on display at the Media Lab, as I’ve got the largest network of Gmail contacts of anyone who’s used the system. (This isn’t because I’m especially popular, I suspect. Most of my MIT colleagues use mit.edu addresses. As someone new to MIT, who maintains a number of different affiliations, I have been a heavy Gmail user.)

Here’s what my metadata looks like:

The largest node in the graph, the person I exchange the most email with, is my wife, Rachel. I find this reassuring, but Daniel and Deepak have told me that people’s romantic partners are rarely their largest node. Because I travel a lot, Rachel and I have a heavily email-dependent relationship, but many people’s romantic relationships are conducted mostly face to face and don’t show up clearly in metadata. But the prominence of Rachel in the graph is, for me, a reminder that one of the reasons we might be concerned about metadata is that it shows strong relationships, whether those relationships are widely known or are secret.

The other large nodes on the graph are associated with specific clusters. Rebecca is my co-founder at Global Voices and Ivan and Georgia run the organization day-to-day – they dominate the green cluster, which includes key people in that organization. Hal is my chief collaborator at the Berkman Center, and Colin is my boss – they dominate the orange cluster, which includes fellow Berkman folks as well as a number of prominent internet law and policy folks who work closely with the Center. Lorrie is assistant director at Center for Civic Media and is the person I work with most closely at MIT – the red cluster represents the people I work with at the Media Lab.

Anyone who knows me reasonably well could have guessed at the existence of these ties. But there’s other information in the graph that’s more complicated and potentially more sensitive. My primary Media Lab collaborators are my students and staff – Cesar is the only Media Lab node who’s not affiliated with Civic who shows up on my network, which suggests that I’m collaborating less with my Media Lab colleagues than I might hope to be. One might read into my relationships with the students I advise based on the email volume I exchange with them – I’d suggest that the patterns have something to do with our preferred channels of communication, but it certainly shows who’s demanding and receiving attention via email. In other words, absence from a social network map is at least as revealing as presence on it.

Another sensitive piece of information comes from how Immersion draws and codes clusters. Immersion’s algorithm is sensitive to who you include on the same email. Global Voices emails include Ivan, Georgia, Rebecca and others – people who I email when I email those three get placed in the same cluster. People who exist as bridges between clusters are particularly interesting, as they are people who appear in multiple roles in your social network. Joi Ito appears on my graph twice (as “Joi” and “Joichi”) because he uses multiple email addresses, but in either role, he’s a bridge between my MIT existence, my Global Voices existence and my Berkman life, which reflects my long and multi-faceted relationship with him. But he’s colored red, as a Media Lab person, whereas other bridge figures like danah boyd show up as blue, as they have close relationships with Rachel as well. In other words, I have important, long-standing, multifaceted relationships with both danah and Joi, but danah is part of my family life as well, while Joi is not.

My point here isn’t to elucidate all the peculiarities of my social network (indeed, analyzing these diagrams is a bit like analyzing your dreams – fascinating to you, but off-putting to everyone else). It’s to make the case that this metadata paints a very revealing portrait of oneself. And while there’s currently a waiting list to use Immersion, this is data that’s accessible to NSA analysts and to the marketing teams at Google. That makes me uncomfortable, and it makes me want to have a public conversation about what’s okay and what’s not okay to track.

While popular outcry over revelations about the NSA has been somewhat muted so far, it’s possible that widespreadprotests planned for July 4th will spark more dialog about what represents unconstitutional surveillance. Here’s hoping that conversation will take a close look at metadata and ask hard questions about whether or not this is information we are willing to share with governments and corporations, or whether we need to regulate and limit this power to monitor as we’ve historically done in the United States. Restore the Fourth.

39 Responses to Me and my metadata – thoughts on online surveillance

For me, the classic paper in this area is Paul Ohm’s analysis of why anonymization doesn’t work. He shows that small amounts of metadata, and a modicum of known facts, will reveal big amounts of private information (Ohm, 2010).

For example:
In 1997, two students at Massachusetts Institute of Technology (MIT) analyzed the Facebook profiles of 6,000 past and present MIT students. They demonstrated that they were able to predict, with a very high degree of certainty, whether someone was gay or not, based on their friendship group (Jernigan & Mistree, 2009).

In 2007 & 2008, Narayanan and Shmatikov used data from Flickr, Twitter and LiveJournal to show that they could automatically identify anonymous users on Twitter a third of the time, with a 12% error rate, by comparing their friend groups on the three networks (Narayanan and Shmatikov, 2009).

In 2009, Acquisti and Gross demonstrated that they could ‘guess’ a large number of American social security numbers using just the birth date and place of a person (Acquisti and Gross, 2009).

In 2009, Zheleva and Getoor demonstrated that friendship and group affiliation on social networks could be used to recover the information of private-profile users. They found that they could predict (with reasonable degrees of success) country of residence (Flickr), gender (Facebook), breed of dog (Dogster) and whether someone was a spammer (BibSonomy), even when 50% of the sample group were private-profile users (Zheleva and Getoor, 2009).

In 2011, Calandrino and others demonstrated that you could use the “You might also like” feature on Hunch, Last.fm, LibraryThing, and Amazon to predict individual purchasing, listening and reading habits of users of these systems. As long as you knew a small number of items that were true about a person, you could use the system to investigate their private behaviour on these sites (Calandrino et al, 2011).

I’m sure you are aware of similar examples.

I’m pretty sure that these techniques can be chained, so that if you are a prolific user of social networks, people can tell your gender, sexual orientation, country of residence, breed of dog, purchasing, listening, reading and spamming activities, your social security number and your name, even if you were anonymous.

Elena Zheleva and Lise Getoor, “To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles,” in Proceedings of the 18th international conference on World Wide Web, WWW ’09 (New York, NY, USA: ACM, 2009), 531–540.

If enough metadata tends to expose what is hidden, and, if many entities have the ability to collect metadata, and tools to correlate this data are easily obtained, then is the implication that we may not have the ability to hide what is hidden anymore. It’s simply not possible. We can try to suppress the collection of metadata, or we can suppress the tools. The tools we can’t suppress because, well, that simply doesn’t work. We can try to suppress collection – but that collection – by government at all levels, public services of many kinds, network service providers, commercial entities of any kinds – is ongoing as a matter of business. I want to think we can undo this state of affairs, but I can’t see a way it can be done.

“I want to think we can undo this state of affairs, but I can’t see a way it can be done.”

My belief is that our best hope is policy backed by law. I’ll cite two examples as models. First, health data is considered very private and its dissemination is heavily regulated, which demonstrates that privacy by policy is not impossible. Second, the judicial system has evidentiary requirements, which make certain types of evidence in inadmissible if law enforcement violated due process in obtaining it. The reasoning here is that letting some guilty people go free is preferable to allowing state force to do whatever it wants. This is the problem I am most concerned with in the case of the NSA: they are allowed to pass your information to law enforcement if it contains evidence of a crime, even though they obtained that information without probable cause.

The problem, as always, with making it illegal for government to do anything is that the government is then charged with policing itself. This never works, as proven by previous government actions, and now NSA spying. Just like when a cop shoots you and claims self defence, who investigates the cop? Other cops. And if you file suit to get redress from the government doing something illegal, who do you go to? A government judge. Short of some sort of Luddite revolution which removes this type of technology, I don’t see any way of reversing this trend.

Mr Stray,
Do you really think the NSA etc would not look at medical records? I can think of no reason why they wouldn’t. Secondly I don’t think the law is much of a barrier to them when terrorism and national security are used as the justification for their actions.