Data and its Discontents – notes and reflections from a panel at Microsoft Research Social Computing Symposium

As I’ve mentioned in years past, Microsoft Research’s Social Computing Symposium is my favorite conference to attend, mostly because it’s a chance to catch up with dozens of people I love and don’t get to see every day. I wasn’t able to blog the whole conference, in part because I was moderating a session, but I wanted to post my notes on the event to share these conversations more widely. I’ve added some of my thoughts at the end as well. Many thanks to Microsoft Research for running this event and to all participants in the panel.

The session is titled “Data and its Discontents”, and it was curated by RIT’s Liz Lawley and MSR/NYU’s danah boyd. They decided not to focus on “big data” – the theme of virtually every conference these days – but data through different lenses: art and creative practice, an ethical perspective, a rights perspective and through a speculative perspective.

The opening speaker is professor and artist Golan Levin (@golan), who’s based at CMU. He’s spent the last year working on an open hardware project, so he’s exploring other work, not his own. His exploration is motivated by a tweet from @danmcquillan: “in the longer term, the snowden revelations will counter the breathless enthusiasm for #bigdata in gov, academia, humanitarian NGOs by showing that massive, passive data collection inevitably feeds the predictive algorithms of cybernetic social control”

Levin offers the idea of “the quantified selfie” and suggests we consider it as a form of post-Snowden portraiture. In a new landscape defined by drones, data centers and secret rendition, can these portraits jolt us into new understanding, or give us some comfort by letting us laugh at the situation we are encountering? He shows us John Lennon’s FBI file, and a self-portrait Lennon drew and argues that they are the same thing, “two different GUIs for a single database query.”

Artist Nick Felton is blurring the line between data portrait and portrait by offering data-driven annual reports of his life, analyzing his personal data for the year: every street he walked down in NYC, every plant killed. In honor of the Snowden revelations, he is preparing a 2014 edition that examines the uneasy relationship between data and metadata.

A more confrontational artwork comes from Julian Oliver and Danja Vasilev, called The Man in Grey. Two figures in grey suits carry mirrored briefcases. The suitcases are “man in the middle suitcases”, sniffing packets from local wireless and displaying what they find on the suitcase monitors. The artwork makes visible a form of surveillance that’s possible (and, as Kate Crawford will later explain, commercializable.)

If the ethical issues associated with street-based surveillance don’t give you some pause, consider Kyle McDonald, a Brooklyn-based new media artist who pushes the legal questions around these issues even further. He became interested in the inadvertent expressions he made when he used the computer. Seeking more imagery, he installed monitoring software on all computers in the Apple stores he could reach in New York City, and captured a single frame each minute (only when someone was staring at the screen), uploading it to Tumblr. The images reveal some of the stress and anxiety many of us face when we stare into the screen of a computer – McDonald’s photos reveal expressions from empty to confused, unhappy and unsure.

Apple was pretty unhappy with McDonald’s project, and he was forced to de-install the software, and is not able to show the photos he captured – instead, he shows watercolor versions of the images. But Levin notes that such surveillance isn’t hard to accomplish, and that the project “pushed the legal boundaries of public photography”.

A piece that pushes those boundaries even further is Heather Dewey-Hagborg’s “Stranger Visions”. The artist collects detritus from public places that could contain traces of DNA – cigarette stubs, chewing gum, pubic hairs from the seats of public toilets – and scans the DNA to measure 50 markers associated with physical appearance. Based on these markers, she constructs 3D models of the people she’s “encountered” this way. The portraits are less literal than McDonald’s, but transgressive in their own way, built from information inadvertently left behind.

And that’s the point, Levin argues – “These inadvertent, careless biometric traces and our constructed identities are creating entries in a database whose scope is breathtaking.” None of the art Levin features in his talk was made post-Snowden – surveillance is a theme many artists engage with – but they take on an especially sinister character when we consider the mass surveillance thats become routine in America, as revealed by Edward Snowden.

Kate Crawford (@katecrawford) is a professor based at Microsoft Research and MIT’s Center for Civic Media. She’s a media theorist who’s written provocatively about changing notions of adulthood, about gender and mobile technologies, about media and social change, and she’s now working on an examination of the promises, problems and ethics around “big data”. She notes that danah asked speakers on the panel to be provocative, so she offers a barnburner of a talk, titled “Big Data and the City: Ethics, Resistance and Desire”

Her tour of big data starts in the nation of Andorra, a tiny nation in the Pyrenees that’s been facing hard times in the European economic crisis. The government decided to try a novel approach to economic recovery: they decided to gather and sell the data of their citizens, including bus and taxi data, credit card usage data and anonymized telephony metadata. The package of data and the opportunity to study Andorrans is being marketed as a “real-world, living lab”, opening the possibility of a “smart nation” that’s even more ambitious than plans for smart cities.

These labs, Kate tells us, are being established around the world, and according to their marketing brochures, they look remarkably similar no matter where they are located. “There’s always a glowing city skyline, then shots of attractive urbanites making coffee and riding bikes.” But behind the scenes, there’s a different image: a dashboard, usually a map, that’s a metaphor for the central controller – a government agency? a retailer? – to examine the data. You leave a data trail, and someone else gathers and analyzes it. What we’re seeing, Kate offers, is the wholesale selling of data-based city management.

This form of pervasive data collection raises questions of the line between stalking and marketing. Turnstile, a corporation that has set up hundreds of sensors in Toronto – gathers the wifi signals of passing devices, mostly laptops and phones. If you have wifi enabled on your phone, you are traceable as a unique identifier, and if you sign onto Turnstile’s free wifi access points, the system will link your device to your realworld ID via social media, if possible. You don’t agree to this release of data – Turnstile simply collects it. They’re using it to provide behavioral data to customers – an Asian restaurant discovers that many of their customers like to go to the gym, so they create a workout t-shirt to market to their customers. This leads Kate to offer a slide of a man wearing a t-shirt that reads “My life is tracked 24/7 by marketers and all I got was this lousy t-shirt.”

Often this pervasive tracking is justified in terms of predictive policing, improving traffic flow, and generally improving life in cities. But she wonders what kind of ethical framework comes with these designs. What happens if we can be tracked offline as easily as we are online? How do we choose to opt out of this pervasive tracking? She notes that the shift towards pervasive tracking is happening out of sight of the less-privileged – some of the people affected by these shifts may be wholly unaware they are taking place.

Behind these systems is the belief that more data leads us to more control. She notes that Adam Greenfield, author of “Against the Smart City”, argues that the idea of the smart city is a manifestation of nervousness about the unpredictability of urbanity itself. The big data city is, ultimately, afraid of risk and afraid of cities.

When people react to these shifts by arguing for rights to privacy, Kate warns that we need to move beyond an analysis that’s so individualistic. The affects are systemic and societal, not just personal, and we need to consider implications for the broader systems. Not only do these systems violate reasonable expectations of privacy and control of personal data – “this would never get past an IRB – human data is taken without consent, with no sense of how long it will be held and no information on how to control your data” – it has a deeper, more corrosive effect on societies.

She quotes James Bridle, creator of the site-specific artwork “Under the Shadow of the Drone“, who notes one difficulty of combatting surveillance: “Those who cannot perceive the network cannot act effectively within it and are powerless to change it”. Quoting De Certeau’s “Walking in the City, she sees the “transparency” of big data as “an implacable light that produces this urban text without obscurities…”

Faced with this implacable light, we can design technologies to minimize our exposure. We can use pervasive, strong cryptography; we can design geolocation blockers. We can opt out or, as Evgeny Morozov suggests, participate in “information boycotts”. But while this is fine for certain elites, Kate postulates, it’s not possible for everyone, all the time. In the smart city, you are still being tracked and observed unless you are taking extraordinary measures.

What does resistance look like to these systems when opt-in and opt-out blur? Citing Bruce Schneier, Kate suggests that we need to analyze these systems not in terms of individual technologies, but in terms of their synergistic effects. It’s not Facebook ad targeting or facial recognition or drones we need to worry about – it’s the behaviors that emerge when those technologies can work together.

What do we lose when we lose a space without surveillance. Hannah Arendt warned of the danger to the human condition from the illumination of private space, noting “there are a great many things which cannot withstand the implacable, bright light of the constant presence of others on the public scene.”

Kate offers desire lines, the unpredictable shortcuts that emerge in public spaces, as a challenge to the smart city. We need a reflective urban unplanning, an understanding of the organic ways in how cities should work, the anarchy of the everyday. This is a vision of cities that values improvisation versus rigidity, communities versus institutions. In the process, we need to imagine a different ethical model of the urban, a model that allows us to change our minds and opt for something different altogether. We need a model that allows us to reshape, to make shortcuts and desire lines. We need a city that lets us choose, or we will be forever followed by whoever is most powerful.

—-

Mark Latonero of USC Annenberg offers a possible counterweight and challenge to Kate’s concerns about big data. Latonero works at the intersection of data, tech and human rights, focusing on human trafficking. Human trafficking is common, and in severe cases, is a gross violation of human rights, sometimes involving indentured servitude or forced sex. It doesn’t have to involve transportation – he reminds us that human trafficking happens if someone is held against their will in Manhattan – and involves men, women, girls and boys.

His work has focused on human trafficking on girls and boys under 18 in the sex trade, a space where intervention is especially important as victims often experience severe psychological and physical trauma. (The children involved are also below the age of consent, which makes it easier in ethical terms – there are no considerations of whether a victim voluntarily chose to become a sex worker.)

Both victims and exploiters are using digital media, Mark tells us, if only mobile phones to stay in touch with family members. As a result, there are digital traces of trafficking behavior. Mark and colleagues are working to collect and analyze this data, including facial recognition as well as algorithmic pattern identification that could indicate situations of abuse. “It’s hard not to feel optimistic that this work could save a human life.”

But this work forces us to consider not only the promises of data and human rights, but the quagmires. This sort of work draws upon a kind of surveillance, and this kind of watching that’s intended for a social good that raises concerns about trust and control. “Gathering data in aggregate helps us monitor for human rights abuses, but intervention involves identifying and locating someone – a victim, or a perpetrator,” he explains. “Inevitably, there is a point where someone’s identity is revealed.” The question the human rights community has to constantly ask is “Is this worth it?”

Human rights work always involves data: data about humans, both about individual humans and aggregate data and statistics about groups of humans. At best, it’s a careful process relying on judgement calls made by human rights professionals. It’s worth asking whether it’s a process big data companies could help with. As we ask about the involvement of big data companies, we should ask about the balance between civil liberties risks and human rights benefits.

Despite those questions, the human rights community is moving head first into these spaces. Google Ideas, Palantir and Salesforce are assisting international human trafficking hotlines, analyzing massive data sets for patterns of behavior, hot spots where trafficking may be common. But all the questions we wrestle with when we consider big data – what are the biases in the data set? Whose privacy are we compromising and what are the consequences? – need to be considered in this space as well.

“Big data can provide answers, but not always the right ones,” Mark offers. One of the major issues for the collaboration between data scientists and human rights professionals is the need to work through issues of false positives and false negatives. Until we have a clearer sense of how we navigate these practical and ethical issues, it’s hard to know how to value initiatives like “data philanthropy”, where the private sector offers to share data for development or for protection of human rights.

There’s a growing community of data researchers who are able to bear witness to human rights violations. He shares Kate’s desire for an ethical framework, a way of balancing the risks and benefits. Is the appropriate model adopted from corporate social responsibility, which is primarily self-regulatory? Is it a more traditionally regulated model, based on pressure from NGOs, consumers and others? He references the “Necessary and Proportionate” document drafted by activists to demand limits to surveillance. If we could move towards an aspirational set of international principles on the use of big data to help human rights, we’d find ourselves in a proactive space, not playing catch up.

The session’s final speaker is Ramez Naam, a former Microsoft engineer who’s become a science fiction author. His talk, “Big Data 7000” offers two predictions: big data will be big, and will cause big problems. The net effect is about the who, not the what, he offers. It’s about who has access to these technologies, who sets the policies for their use.

Ramez shows a snippet of DNA base pairs, a string of ATCGs on a screen. “This is someone’s genome, probably Craig Ventner’s, and as promised, once we sequenced the genome, we ended all health problems, cracked ageing and conquered disease.” It turns out that genes are absurdly complex – they turn each other on and off in complex and unpredictable ways. “We can barely grok the behavior of half a dozen genes as a network.” To really understand the linkages between genes and disease, we’d need to collect lots more genetic data. Fortunately, the cost of gene sequencing is dropping much faster than Moore’s law, and there’s now the long-promised $1000 gene sequencer. But to really understand genes and disease, we’d need to collect behavioral and trait data about people whose genomes were sequenced – what was the person like, what diseases did they suffer, did they have high blood pressure, what was their IQ?

Personal monitoring tools like Fitbit generate lots of individual value, and potentially lots of societal value, by helping us understand what behavioral and diet interventions are most helpful. Will you get fitter on the paleo diet? Or will red meat kill you? Our data about behavior and health is so sparse that we don’t know which is true, despite one third of health spending on weigh loss and fitness programs and tools.

Is Nest a $3 billion distraction for Google? Or the first step towards the Google-powered smart electrical grid. Enormous financial and environmental benefits could come from a smart grid – if we could manipulate electrical usage we might be able to take thousands of “peaker” plants, plants that run for only a few hours a day, offline.

Given the field, we can imagine situations where more data would be helpful. Education? Sure – if we had more rigorous understandings of what teaching techniques work and fail, what makes a good teacher and a poor one, could we potentially transform that critical field?

Ramez pivots to the problems. There will be accidental disclosures of data. He suggests we look at two stories with Target, one where they accidentally revealed a daughter’s pregnancy to a distraught father by sending her coupons for baby supplies, and the recent leak where Target lost 70 million credit card numbers (including mine.) It could have been worse, and it probably will, Ramez argues. It could have been data about where you go, your SMS messages, your email – they will inevitably be released.

“The NSA is not the worst abuse of surveillance we’ve seen,” he points out. J. Edgar Hoover bugged Martin Luther King Jr’s hotel rooms with the approval of JFK and RFK, who were worried that MLK was a communist sympathizer. In the process, Hoover discovered that MLK was having an affair, and sent threatening letters to him promising to reveal the secret if MLK didn’t commit suicide. This is heinous abuse, on a scale that’s not been revealed in recent revelations. But if the current abuses are significantly more minor, the scale is massive, with millions of individuals potentially at risk of blackmail.

Still, what’s critical to consider is not the what, but the who. There are checks and balances between we the people, corporations, government. There are conflicts between all of these. We vote within a democracy, Ramez argues, and we can vote with our feet and with our dollars. Sometimes corporations and governments are in collusion – sometimes they’re in conflict. Sometimes government does the right thing, as with the Church Committee, which investigated intelligence activities and helped curb abuses. We may need to consider the legacy of the Committee closely as we examine the current situation with the NSA.

There’s some hope. Ramez reminds us that “leaking is asymmetric.” As a result, conspiracies are hard, because it’s hard to keep secrets. “If you’re doing something heinous, it’s going to get out,” he says, and that’s a check.

His talk is called Big Data 7000 and he closes by imagining big data 7 millennia ago, showing an image of a clay tablet covered with cuneiform. “When the Sumerians began writing in linear A – that was a dystopian period of big data.” Writing wasn’t empowering to the little people, Ramez tells us – the use of written language created top-heavy, oppressive civilizations. It’s the model Orwell had in mind when he wrote 1984. That image of the control of technology in one mighty hand, not distributed, is at the root of our technological fears.

But technology can be liberating – the rise of the printing press put technology into many hands, allowing for the spread of subversive ideas including civil rights . The future of the net, he hopes, is in from big data as something in the hands of the very few to data in the hands of the very many.

Hi, Ethan here again.

What I really appreciated about this panel was a move beyond rhetoric about big data that is purely at the extremes: Big data is the solution to all of life’s mysteries! Big data is an inevitable path to totalitarianism! What’s complicated about big data is that there’s both hype and hope, reasons to fear and reasons to celebrate.

The tensions Mark Latonero identifies between wanting surveillance to protect against human rights abuses, and wanting to protect human rights from surveillance are ones that every responsible big data scientist needs to be exploring. I was surprised to find, both at this event and in a recent series of conversations at Open Society Foundation, that these are tensions the human rights community is addressing head on, in part due to enthusiasm for the idea that better documentation of human rights abuses could lead to better interventions and prosecutions.

The smartest phrase I’ve heard about big data and ethics comes from my friend Sunil Abraham of the Bangalore Center of Internet and Society, who was involved with those conversations at OSF. He offers this formulation: “The more powerful you are, the more surveillance you should be subject to. The less powerful you are, the more surveillance you should be protected from.” In other words, it’s reasonable to both demand transparency from elected officials and financial institutions, while working to protect ordinary consumers or, especially, the vulnerable poor. Kate Crawford echoed this concern, tweeting a story by Virginia Eubanks that makes the case that surveillance is currently separate and unequal, more focused on welfare recipients and the working poor than on more privileged Americans.

There’s no shortcut to the hard conversation we need to have about big data and ethics, but the insights of these four scholars and those they cite is a great first step towards a richer, more nuanced and smarter conversation.

4 Responses to Data and its Discontents – notes and reflections from a panel at Microsoft Research Social Computing Symposium

“The more powerful you are, the more surveillance you should be subject to. The less powerful you are, the more surveillance you should be protected from.”

One question, when did ‘big data’ become synonymous for surveillance? The big data hype is mostly about the efficiency of translation, taking data that is problematic in one form, and transforming it to another in ways that save humans time. Surveillance is only one, admittedly popular, use case. The NSA is using many of the same tools for surveillance that the World Bank and UN are. Who’s more wrong for using them?

The point you make above about big data representing both ‘hype and hope’ alludes to the real problem here which is conflation. To say surveillance and the manipulation of data on behalf of the poor and vulnerable is bad, is to denounce not just the last 50 years of spying in the US, but also the last 50 years of philanthropic/humanitarian activity in developing countries. When the activities of humanitarian organizations couldn’t have been more removed from local populations they were conducted for.

For them ‘big data’ might have been just trying to figure out what the heck all these foreigners in Range Rovers were doing on their farmlands. Or why all of a sudden their governments where no longer beholden to their local authorities but to foreign ones with whom they couldn’t converse. If the ‘big’ in big data is essentially meant to be analogous with ‘inaccessible to most’, if you think about it, for most of human history data (information) was inaccessible to most.

From that perspective, the data collection technologies in use today are exponentially more transparent and accessible to poor & local populations than those of the past. There are simply more ways for them to participate. There are more ways to learn about what is being used to monitor you, to monitor back, to evade, to disrupt.

Collecting data via short codes to be placed on crisis map that were then analyzed without the permission of the poor has often been called a violation of privacy. But the code is often open source, the information collected is often released as open data, and the people collecting data are often members of the community the projects are meant to serve.

By no means does that make these new technologies accessible to all, but it is an improvement because it makes the conversation accessible to *more*.

This is a dramatic shift from how this same work was done not even 15 years ago. So, for me, focusing on ‘big data’ as a means for surveillance and violation of privacy is privileged point of view. Privileged in that we get to pick and choose what we like about innovations that ultimately serve us without considering how they might serve others. Privileged in that ignores the real disruption that has happened for people who were previously shut-out of conversations that defined their individual lives and collective ways of life.

Ultimately, it’s impossible to blame tools themselves without invalidating uses of that tool both good and bad. As you point out, the good data professionals wrestle with such questions all the time.