Sharing personal data anonymously with Crowd Blending Privacy

In our upcoming report on the Big Social we talk about giant stockpiles of personal data containing browsing logs, location data, purchases patterns, social media data and how the combination of these sets of data can boost actionable analytics and maybe predict future events. With all of these data sets containing personal information, the issue of personal privacy rises. A new mathematical technique developed at Cornell University could offer a way for large data sets of personal data to be shared and analyzed while guaranteeing that no individual’s privacy will be compromised.

It’s all about data anonymity in this case. However, this is a sensitive issue. Remember Netflix and AOL who both released supposedly “anonymized” data so that anyone could analyze it? Researchers found out pretty quick that the data sets could be de-anonymized by cross referencing them with data available elsewhere. One way to fix these issues is known as differential privacy. It typically requires adding noise to a data set, which makes that data set less useful.

The Cornell group proposes an alternative approach called crowd-blending privacy. This method involves limiting how a data set can be analyzed to ensure that any individual record is indistinguishable from a sizeable crowd of other records and removing a record from the analysis if this cannot be guaranteed.

“We want to make it possible for Facebook or the U.S. Census Bureau to analyze sensitive data without leaking information about individuals. We also have this other goal of utility; we want the analyst to learn something. (…) The hope is that because crowd-blending is a less strict privacy standard it will be possible to write algorithms that will satisfy it and it could open up new uses for data”, says Michael Hay, who was involved with creating the technique while a research fellow at Cornell.

Analysts might favor this new approach because there is no need to add noise to a data set. Also, the researchers showed that crowd-blending is already close to matching the statistical strength of differential privacy. It also benefits consumer privacy: successfully anonymized data enables Facebook and other data brokers to act (sell, buy, share) data sets without putting our personal data at risk.

In the next few weeks we are going to further explore this ‘intersection’ of big data and privacy. If you have any thoughts on the subject, please leave a comment.

I think this technique is similar to the one applied to company employee surveys: reporting will only be done over x people, never of an individual. Still, if everyone of a certain group/department rates their manager with 1, the manager still knows what you voted. It’s only when at least one person voted differently, you have some anonymity. Crowd Blending sounds cool, and it assumes a certain trust in the technology and how people will use the technology (since the real data is still in the dataset). Also, I got lost in the statistical reasoning, but I’m not yet convinced that a truly evil attacker couldn’t come up with many different queries and patch together the original dataset or something very close to it.