How to Reconcile Big Data and Privacy

From the editors and reporters of Scientific American , this blog delivers commentary, opinion and analysis on the latest developments in science and technology and their influence on society and policy. From reasoned arguments and cultural critiques to personal and skeptical takes on interesting science news, you'll find a wide range of scientifically relevant insights here. Follow on Twitter @sciam.

Larry Greenemeier is the associate editor of technology for Scientific American, covering a variety of tech-related topics, including biotech, computers, military tech, nanotech and robots. Follow on Twitter @lggreenemeier.

In many ways “big data” and “encryption” are antithetical. The former involves harvesting, storing and analyzing information to reveal patterns that researchers, law enforcement and industry can use to their benefit. The goal of the latter is to obscure that data from prying eyes. That tension was at the core of a conference this week co-hosted by the White House Office of Science & Technology Policy and the Massachusetts Institute of Technology (M.I.T.), in which more than a dozen experts from academia, politics and industry explored ways encryption and other privacy-oriented technologies might protect the information at involved in big data efforts.

Functional encryption is the way to go, said MIT CSAIL professor Shafi Goldwasser during the panel. Alternatives such as anonymizing data records don’t work, she added. With so much data available about people freely available on social networks and other public sites, anyone looking to do harm can build a profile about their target by cross-referencing information from any number of online resources.

If data is simply being stored, encryption works wonderfully, said Nickolai Zeldovich, an associate professor at the M.I.T. Computer Science and Artificial Intelligence Lab (CSAIL), during Monday’s “Privacy Enhancing Technologies” panel. The trouble comes when you actually need to process and analyze that data. That’s why there is a need for systems that can do practical processing of encrypted data, he added.

Such practical efforts generally refer to so-called “homomorphic” encryption, which makes it possible to perform computations on encrypted data without decrypting it first. Since the late 1970s, researchers have been saying that fully homomorphic encryption—in which protected information can be sliced and diced any number of ways without revealing the actual data—is possible. Such systems would be a boon to cloud computing, providing a way to analyze information with minimal privacy risks to the people supplying that information.

In practice, however, computer scientists have been unable to develop a way to perform more than a handful of meaning operations on encrypted data. IBM claimed that computer scientist Craig Gentry had developed a practical, fully homomorphic system in 2009, but critics said the technology was too complex, slow and impractical for actual use in the cloud. IBM has since patented Gentry’s work and continues to develop it.

Efforts are underway to develop, if not fully homomorphic systems, those that can work with encrypted data in new and novel ways.

One such project is CryptDB, a system that enables analysis of encrypted data by placing a proxy server between the software requesting the data and the database storing that encrypted data. The proxy uses algorithms designed to compare and analyze encrypted information. In some cases the proxy has to remove different layers of encryption to better analyze the data, but the idea is that it would not fully encrypt the data into plain text. Despite the limited types of queries that CryptDB can perform, Google is a big supporter of the technology and uses it to provide encrypted queries in its cloud-based BigQuery service for searching massive datasets. Adding an extra piece of equipment like a proxy server in the search and retrieval process typically slows things down, but Zeldovich and his colleagues say they are making strides in alleviating that problem (pdf).

The M.I.T. panelists also posed security measures that didn’t rely on encryption. Differential privacy, for example, is an alternative to anonymizing data. This approach uses an automated data curator that can protect the privacy of the individuals in a data set while providing useful info to the person requesting the data, said Salil Vadhan, Harvard University Vicky Joseph Professor. As noted in a December 2012 article on Scientific American’s Web site, “A differentially private data release algorithm allows researchers to ask practically any question about a database of sensitive information and provides answers that have been ‘blurred’ so that they reveal virtually nothing about any individual’s data—not even whether the individual was in the database in the first place.”

Another option is for engineers to code privacy-policy requirements directly into software that collects, stores, and analyzes data. Such “accountable systems” would be written to automatically analyze whether a particular use of data violates a law, said Daniel Weitzner, M.I.T. CSAIL principal research scientist. “Using an analogy, we can operate economies all over the world with a reasonably high degree of public trust,” he added. “We do this because we have a set of consistent rules applied in a consistent way. I think we ought to have a similar goal for the way information is used.”

Monday’s conference stemmed from Pres. Barack Obama’s call earlier this year for a comprehensive review of big data’s impact on Americans’ lives, livelihoods and relationship with the government. Obama tasked White House Counselor John Podesta to lead the review process, which will culminate a few months from now with a report expected to impact policy, funding and research related to big data.

About the Author: Larry Greenemeier is the associate editor of technology for Scientific American, covering a variety of tech-related topics, including biotech, computers, military tech, nanotech and robots. Follow on Twitter @lggreenemeier.

Privacy situation is a bit like Wild West town, where merchants sell bigger and bigger locks, but nobody declared break-in a crime and nobody sends a sheriff to fight it. Think about this analogy, and you realize how the today approach to privacy is hurting everybody and stopping development.

This seems so confused.
I’m not an expert in the field of encryption, but as I understand the ‘metadata’ being collected by service providers would be encrypted at its source of production – Point of Sale Systems, telephone exchange equipment, cell phone towers, what have you. if the service provider were authorized to share this metadata with the NSA or other gov’t agencies, they data would simply be decrypted by the service provider, or they would provide the private key allowing the recipient to decrypt the data.
The principal issue is that, in order to keep private, personal data generated by corporations or individuals, including voice transmissions of phone conversations, for example, secure – one would like to have your phone encrypt that data so that only the authorized recipients could decrypt it – not the phone companies the NSA, the FBI, etc.
In the past, the NSA has led the battle in congress to lower authorized encryption standards in the U.S., so that no law abiding citizen could encode their personal data to a level that the NSA could not surreptitiously decrypt it. The result, of course, is that international criminals, spies, what have you, use higher levels of encryption that may (at points in time) not be decipherable by anyone. Looks to me like they’re again trying to maintain their ability to (more easily) decrypt the personal data of law abiding citizens…