Big Data is our generation’s civil rights issue, and we don’t know it

Data doesn’t invade people’s lives. Lack of control over how it’s used does.

What’s really driving so-called Big Data isn’t the volume of information. It turns out Big Data doesn’t have to be all that big. Rather, it’s about a reconsideration of the fundamental economics of analyzing data.

For decades, there’s been a fundamental tension between three attributes of databases. You can have the data fast; you can have it big; or you can have it varied. The catch is, you can’t have all three at once.

I’d first heard this as the “three V’s of data”: Volume, Variety, and Velocity. Traditionally, getting two was easy but getting three was very, very, very expensive.

The advent of clouds, platforms like Hadoop, and the inexorable march of Moore’s Law means that now, analyzing data is trivially inexpensive. And when things become so cheap that they’re practically free, big changes happen—just look at the advent of steam power, or the copying of digital music, or the rise of home printing. Abundance replaces scarcity, and we invent new business models.

In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context.

That needs repeating:

You decide what data is about the moment you define its schema.

With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, Big Data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected—sometimes called a schema-less query. This means we collect information long before we decide what it’s for.

And this is a dangerous thing.

When bank managers tried to restrict loans to residents of certain areas (known as redlining) Congress stepped in to stop it (with the Fair Housing Act of 1968.) They were able to legislate against discrimination, making it illegal to change loan policy based on someone’s race.

“Personalization” is another word for discrimination. We’re not discriminating if we tailor things to you based on what we know about you—right? That’s just better service.

In one case, American Express used purchase history to adjust credit limits based on where a customer shopped, despite his excellent credit limit.

Johnson says his jaw dropped when he read one of the reasons American Express gave for lowering his credit limit: “Other customers who have used their card at establishments where you recently shopped have a poor repayment history with American Express.”

We’re great at using taste to predict things about people. OKcupid’s 2010 blog post “The Real Stuff White People Like” showed just how easily we can use information to guess at race. It’s a real eye-opener (and the guys who wrote it didn’t include everything they learned—some of it was a bit too controversial.) They simply looked at the words one group used which others didn’t often use. The result was a list of “trigger” words for a particular race or gender.

Now run this backwards.If I know you like these things, or see you mention them in blog posts, on Facebook, or in tweets, then there’s a good chance I know your gender and your race, and maybe even your religion and your sexual orientation. And that I can personalize my marketing efforts towards you.

That makes it a civil rights issue.

If I collect information on the music you listen to, you might assume I will use that data in order to suggest new songs, or share it with your friends. But instead, I could use it to guess at your racial background. And then I could use that data to deny you a loan.

Want another example? Check out Private Data In Public Ways, something I wrote a few months ago after seeing a talk at Big Data London, which discusses how publicly available last name information can be used to generate racial boundary maps:

This TED talk by Malte Spitz does a great job of explaining the challenges of tracking citizens today, and he speculates about whether the Berlin Wall would ever have come down if the Stasi had access to phone records in the way today’s governments do.

So how do we regulate the way data is used?

The only way to deal with this properly is to somehow link what the data is with how it can be used. I might, for example, say that my musical tastes should be used for song recommendation, but not for banking decisions.

Tying data to permissions can be done through encryption, which is slow, riddled with DRM, burdensome, hard to implement, and bad for innovation. Or it can be done through legislation, which has about as much chance of success as regulating spam: it feels great, but it’s damned hard to enforce.

But governments need to balance reliance on data with checks and balances about how this reliance erodes privacy and creates civil and moral issues we haven’t thought through. It’s something that most of the electorate isn’t thinking about, and yet it affects every purchase they make.

This should be fun.

47 Comments

Very good work. Your insight into the unforeseen risks of Big Data is insightful and thorough.

However, what would you say to the criticism that you are seeing lions in the darkness? In other words, the risk of abuse certainly exists, but until we see a clear case of Big Data enabling and fueling discrimination, how do we know there is a real threat worth fighting? Your argument could be seen as the Techno-Moral equivalent to Iraq’s WMDs.

I am just curious what your response is. What is your argument that we really ought to be worried enough to act now?

[…] Alistair Croll recently argued that Big Data is this generations civil rights issue. He explains, “In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context. That needs repeating: You decide what data is about the moment you define its schema.” […]

Excellent points here. Treating this as a civil (and moral) issue is an inspired way to build safeguards against the types of inferences that can come out of poor use of big data.

To Harry: The benefit I see here is raising awareness–both to help individuals better understand their digital footprints, but also to share ideas to help shape societal norms about use of data. Big data can give fantastic insight with proper methodology and good motivations. It can also further fuel confirmation bias about questionable motivations. Personally identifiable information can be abused–whether identity theft, or inferences drawn about people based on activity on Facebook, etc. Big data just accelerates it, and enables additional indirect inferences.

The indirect part reminds me of some of the underlying aspects of the financial issues caused by Bernie Madoff, or the housing/financial crisis enabled through predatory mortgages that became toxic investments. It’s easy to think everything is legit from what you see one or two steps from you…and also easy to assume everything beyond that is operating with the same expectations you have, even if it may not be.

While I agree with much of the article, I do take issue with the point made right at the start, ie “You can have the data fast; you can have it big; or you can have it varied. The catch is, you can’t have all three at once.” Yes you can. And you’ve been able to for really quite a long time. That’s what Teradata is for. And has been for 30 years.

My point was that these three things equal a constant, and it is the constant that’s changing dramatically (a few sentences later I used three Vs again, this time as “very, very,very expensive.”)

The point here is that the upfront investment in, say, a Hadoop cluster as a service, paid for by the drink, significantly lowers the barriers to entry when compared to the data warehouses and BI tools of the past. And this very accessibility is what makes plenty of groups who might not otherwise expend the effort suddenly leverage all the spare data lying around to make decisions—sometimes not fair ones.

[…] Big Data is Our Generation’s Civil Rights Issue, and We Don’t Know It (Solve for Interesting) – Every new technology can be a force for good AND a force for evil – this article discusses some old civil rights issues through the lens of the data explosion… […]

[…] say data, and how its aggregated and used to shape opportunities and experiences differentially is the civil rights issue of our day. Education research is an example of an area where the consequences of this can be disastrous if all […]

Time is a big cure for many things. Computing is growing so fast that the three ‘V’s will eventually be conquered. As to the quantification of people as data and the using of that data to limit their abilities, that is already being done. People over 50 no longer can enter the fast track to advancement, medical conditions blindly further limit your abilities. Interest rates are applied at too high a rate with penalties that stifel all possible advancement. Data collected by the credit companies, basically your credit worthiness is only useful to make more profit by the credit cards, not based on your ability or likelihood of paying back. Insurance is based blindly on age and past accidents, but not on the ability of the driver, just incidence. The future will soon offer information to the common man that will eliminate the abilities of many a company. Unfair tactics will be broadcast so that the common man can avoid the companies, severely handicapping those companies. Information on gas stations is already being forecast so that consumers will avoid the high priced ones. This will escalate and companies will have to watch their p’s and q’s. Make no bones about it, with information, it is a two way street. Right now the consumer has the disadvantage, but with tight enough communications, a network of citizens can put any single business, out of business.

[…] You can read Fertik’s piece at Scientific American and McDonald’s piece at PBS Idea Lab — they are this week’s recommended reads. You might also be interested in Alistair Croll’s related posts: “Thin walls and traffic cameras,” “New ethics for a new world” and “Big data is our generation’s civil rights issue, and we don’t know it.” […]

I think one thing that bears mentioning is the potential harm the abuses could do to the legitimate and useful uses for Big Data. It won’t be long before tools for defeating these types of data collection come along, and there will be the inevitable tug-of-war between companies that want to use/sell their data stores, and people that don’t want to be profiled based on the internet and social media history. That battle will muddy the waters for more principled and altruistic applications of big data (e.g. city planning, demographic science, journalism) and in some cases, thwart their development.
I think a relatively simple (if simplistic) patch is to make a major push to anonymize the data points during the collection. Sure with effort, someone can reconstruct the identity of the person from the fragmentary records, but it makes it much harder to use the data against someone personally. Meanwhile, the valuable aspect of big data, the correlations and relationships stemming from a single node can be maintained intact, you just won’t know the identity of the node.

[…] The tougher question is what we do about predictive analytics…the kind that show that when X and Y happen, Z is n% likely to occur as well. That’s fine when it delivers security or health, but what about when it indicates that someone is likely to do something bad but hasn’t yet? Is that fair? How far is far enough when the issue is the individual versus the collective good? Big Data will certainly redefine long-held expectations about civil rights. […]

[…] out of a demographic. And this goes back to another particularly brilliant point that Croll makes: the use of Big Data to define our target audience creates a Civil Rights issue. Because there’s a very thin line between offering something to people who are statistically […]

Big data was and is used by certain US states governments to draw voting districts, resulting in clear violation of the principle “one man, one vote” and effectively nullifying the right to vote for millions. If that is not a civil rights issue I do not know what is.

[…] The tougher question is what we do about predictive analytics…the kind that show that when X and Y happen, Z is n% likely to occur as well. That’s fine when it delivers security or health, but what about when it indicates that someone is likely to do something bad but hasn’t yet? Is that fair? How far is far enough when the issue is the individual versus the collective good? Big Data will certainly redefine long-held expectations about civil rights. […]

[…] on medication side effects How Open Data Can Reveal—And Correct—The Faults In Our Health System Big Data is our Generation’s Civil Rights Issue, and We Don’t Know It. From the Forum Creating Scales for Quantifying Action Sharing Anonymized […]

Sidebar

Related Posts

Last night, I attended a fascinating series of presentations on data visualization as part of London’s Big Data Week. Put on by a number of tech evangelists and companies around the world, it’s part of a global series of talks and hackathons focusing on data-driven innovation. In the final session, Dr. James Cheshire talked aboutRead more