The technical aspects of privacy

The first of three public workshops kicked off a conversation with the federal government on data privacy in the US.

Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.

Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law.

A narrow horizon for privacy

Having a foot in the hacker community and hearing news all the time about new technical assaults on individual autonomy, I found the circumscribed scope of the conference disappointing. The consensus on stage was that the collection of personal information was toothpaste out of the tube, and that all we could do in response was promote oral hygiene. Much of the discussion accepted the conventional view that deriving value from data has to play tug of rope with privacy protection. But some speakers fought that with the hope that technology could produce a happy marriage between the rivals of data analysis and personal data protection.

No one recognized that people might manage their own data and share it at their discretion, an ideal pursued by the Vendor Relationship Management movement and many health care reformers. As an audience member pointed out, no one on stage addressed technologies that prevent the collection of personal data, such as TOR onion routing (which was sponsored by the US Navy).

Although speakers recognized that data analysis could disadvantage individuals, either through errors or through efforts to control us, they barely touched on the effects of analysis on groups.

Finally, while the Internet of Things was mentioned in passing and the difficulty of preserving privacy in an age of social networking was mentioned, speakers did not emphasize the explosion of information that will flood the Internet over the upcoming few years. This changes the context for personal data, both in its power to improve life and its power to hurt us.

One panelist warned that the data being collected about us increasingly doesn’t come directly from us. I think that’s not yet true, but soon it may be. The Boston Globe just reported that a vast network of vehicle surveillance is run by private industry, unfettered by the Fourth Amendment or discrimination laws (and providing police with their data). If people can be identified by the way they walk, privacy may well become an obsolete notion. But I’m not ready to give up yet on data collection.

In any case, I felt honored to hear and interact with the impressive roster of experts and the well-informed audience members who showed up on Monday. Just seeing Carol Rose of the Massachusetts ACLU sit next to John DeLong of the NSA would be worth a trip downtown. A full house was expected, but a winter storm kept many potential attendees stuck in Washington, DC, or other points south of Boston.

Questions the government is asking itself, and us

John Podesta, a key adviser to the Clinton and Obama administrations, addressed us by phone after the winter storm grounded his flight. He referred to the major speech delivered by President Obama on January 17 of this year, and said that Podesta was leading a working group formed afterward to promote an “open, interoperable, secure, and reliable Internet.”

It would be simplistic, however, to attribute administration interest in privacy to the flak emerging from the Snowden revelations. The government has been trying to cajole industries to upgrade security for years, and launched a cybersecurity plan at the same time as Podesta’s group. Federal agencies have also been concerned for some time with promoting more online collaboration and protecting the privacy of participants, notably in the National Strategy for Trusted Identities in Cyberspace (NSTIC) run by the National Institute of Standards and Technology (NIST). (Readers interested in the national approach to identity can find Alexander Howard’s analysis on Radar.)

Yes, I know, these were the same folks who passed NSA mischief on to standards committees, seriously weakening some encryption mechanisms. These incidents can remind us that the government is a large institution pursuing different and sometimes conflicting goals. We don’t have to withdraw on them on that account and stop pressing our values and issues.

The relationship between privacy and identity may not be immediately clear, but a serious look at one must involve the other. This understanding underscores a series I wrote on identity.

Threats to our autonomy don’t end with government snooping. Industries want to know our buying habits and insurers want to know our hazards. MIT professor Sam Madden said that data from the sensors on cell phones can reveal when automobile drivers make dangerous maneuvers. He also said that the riskiest group of drivers (young males) reduce risky maneuvers up to 78% if they know they’re being monitored. How do you feel about this? Are you viscerally repelled by such move-by-move snooping? What if your own insurance costs went down and there were fewer fatalities on the highways?

But there is no bright line dividing government from business. Many commenters complained that large Internet businesses shared user data they had collected with the NSA. I have pointed out that the concentration of Internet infrastructure made government surveillance possible.

Revelations that the NSA collected data related to international trade, even though there’s no current evidence it is affecting negotiations, makes one wonder whether government spies have cited terrorism as an excuse for pursuing other goals of interest to businesses, particularly when we were tapping the phone calls of leaders in allies such as Germany and Brazil.

Podesta said it might be time to revisit the Fair Information Practices that have guided laws in both the US and many other countries for decades. (The Electronic Privacy Information Center has a nice summary of these principles.)

Podesta also identified a major challenge to our current legal understanding of privacy: the shift from predicated searching to non-predicated or pattern searching. This jargon can be understood as follows: searching for a predicate can be a simple database query to verify a relationship you expect to find, such as whether people who reserve hotel rooms also reserve rental cars. A non-predicated search would turn up totally unanticipated relationships, such as the famous incident where a retailer revealed a customer’s pregnancy.

Podesta asked us to consider what’s different about big data, what business models are based on big data, what uses there are for big data, and whether we need research on privacy protection during analytics. Finally, he promised a report about three months from now about law enforcement.

Later in the day, US Secretary of Commerce Penny Pritzker offered some further questions: What principles of trust do businesses have to adopt? How can privacy in data be improved? How can we be more accountable and transparent? How can consumers understand what they are sharing and with whom? How can government and business reduce the unanticipated harm caused by big data?

Incentives and temptations

The morning panel trumpeted the value of data analysis, while acknowledging privacy concerns. Panelists came from medicine, genetic research, the field of transportation, and education. Their excitement over the value of data was so infectious that Shafi Goldwasser of the MIT Computer Science and Artificial Intelligence Laboratory later joked that it made her want to say, “Take my data!”

I think an agenda lay behind the choice of a panel dangling before us an appealing future when we can avoid cruising for parking spots, can make better use of college courses, and can even cure disease through data sharing. In contrast, the people who snoop on social networking sites in order to withdraw insurance coverage from people were not on the panel, and would have had a harder time justifying their use of data. Their presence would highlight the deceptive enticements of data snooping. Big data offers amazing possibilities in the aggregate. Statistics can establish relationships among large populations that unveil useful advice to individuals. But judging each individual by principles established through data analysis is pure prejudice. It leads to such abuses as labeling a student as dissolute because he posts a picture of himself at a party, or withdrawing disability insurance from someone who dares to boast of his capabilities on a social network.

Having our cake

Can technology save us from a world where our most intimate secrets are laid at the feet of large businesses? A panel on privacy enhancing techniques suggested it may.

Data analysis without personal revelations is the goal; the core techniques behind it are algorithms that compute useful results from encrypted data. Normally, encrypted data is totally random in principle. Traditionally, it would violate the point of encryption if any information at all could be derived from such data. But the new technologies relax this absolute randomness to allow someone to search for values, compute a sum, or do more complex calculations on encrypted values.

Goldwasser characterized this goal as extracting data without seeing it. For instance, suppose we could determine whether any faces in a surveillance photo match suspects in a database without identifying innocent people in the photo? What if we could uncover evidence of financial turmoil from the portfolios of stockholders without knowing what is held by each stockholder?

Nickolai Zeldovich introduced his CryptDB research, which is used by Google for encrypted queries in BigQuery. CryptDB ensures that any value will be represented by the same encrypted value everywhere it appears in a field, and can also support some aggregate functions. This means you can request the sum of values in a field and get the right answer without having access to any individual values. Different layers of protection can be chosen, each trading off functionality for security to a different degree.

MIT professor Vinod Vaikuntanathan introduced homomorphic encryption, which produces an encrypted result from encrypted data, allowing the user to get the result without seeing any of the input data. This is one of the few cutting-edge ideas introduced at the workshop. Although homomorphic encryption was suggested in 1979, no one could figure out how to make it work till 2009, and viable implementations such as HELib and HCrypt emerged only recently.

The white horse that most speakers wanted to ride is “differential privacy,” an unintuitive term that comes from a formal definition of privacy protection: any result returned from a query would be substantially the same whether or not you were represented by a record in that data. When differential privacy is in place, nobody can re-identify your record or even know whether you exist in the database, no matter how much prior knowledge they have about you. A related term is “synthetic data sets,” which refers to the practice of offering data sets that are scrambled and muddied by random noise. These data sets are carefully designed so that queries can produce the right answer (for instance, “how many members are male and smoke but don’t have cancer?”), but no row of data corresponds to a real person.

Cynthia Dwork, a distinguished scientist at Microsoft Research and one of the innovators in differential privacy, presented an overview that was fleshed out by Harvard professor Salil Vadhan. He pointed out that such databases make it unnecessary for a privacy expert to approve each release of data because even a user with special knowledge of a person can’t re-identify him.

These secure database queries offer another level of protection: checking the exact queries that people run. Vaikuntanathan indicated that homomorphic encryption would be complemented by a functional certification service, which is a kind of mediator that accepts queries from users. The server would check a certificate to ensure the user has the right to issue that particular query before carrying it out on the database.

The ongoing threat to these technologies is the possibility of chipping away at privacy by submitting many queries, possibly on multiple data sets, that could cumulatively isolate the information on a particular person. Other challenges include:

They depend on data sets big enough to hide individual differences. The bigger the data, the less noise has to be introduced to hide differences. In contrast, small data sets can’t be protected well.

They don’t protect the rights of a whole group.

Because they hide individuals, they can’t be used by law enforcement or similar users to target those individuals.

The use of these techniques will also require changes to laws and regulations that make assumptions based on current encryption methods.

Technology lawyer Daniel Weitzner wrapped up the panel on technologies by describing technologies that promote information accountability: determining through computational monitoring how data is used and whether a use of data complies with laws and regulations.

There are several steps to information accountability:

First, a law or regulation has to be represented by a “policy language” that a program can interpret.

The program has to run over logs of data accesses and check each one against the policy language.

Finally, the program must present results with messages a user can understand. Weitzner pointed out that most users want to do the right thing and want to comply with the law, so the message must help them do that.

Challenges include making a policy language sufficiently expressive to represent the law without become too complex for calculations. The language must also allow incompleteness and inconsistency because laws don’t always provide complete answers.

The last panel of the day considered some amusing and thought-provoking hypothetical cases in data mining. Several panelists dismissed the possibility of restricting data collection but called for more transparency in its use. We should know what data is being collected and who is getting it. One panelist mentioned Deborah Estrin, who calls for companies to give us access to “data about me.” Discarding data after a fixed period of time can also protect us, and is particularly appealing because old data is often no use in new environments.

Weitzner held out hope on the legal front. He suggested that when President Obama announced a review of the much-criticized Section 215 of the Patriot Act, he was issuing a subtle message that the fourth amendment would get more consideration. Rose said that revelations about the power of metadata prove that it’s time to strengthen legal protections and force law enforcement and judges to treat metadata like data.

Privacy and dignity

To me, Weitzner validated his role as conference organizer by grounding discussion on basic principles. He asserted that privacy means letting certain people handle data without allowing other people to do so.

I interpret that statement as a protest against notorious court rulings on “expectations of privacy.” According to US legal doctrine, we cannot put any limits on government access to our email messages or to data about whom we phoned because we shared that data with the companies handling our email and phone calls. This is like people who hear that a woman was assaulted and say, “The way she dresses, she was asking for it.”

I recognize that open data can feed wonderful, innovative discoveries and applications. We don’t want a regime where someone needs permission for every data use, but we do need ways for the public to express their concerns about their data.

It would be great to have a kind of Kickstarter or Indiegogo for data, where companies asked not for funds but for our data. However, companies could not sign up as many people this way as they can get now by surfing Twitter or buying data sets. It looks like data use cannot avoid becoming an issue for policy, whoever sets and administers it. Perhaps subsequent workshops will push the boundaries of discussion farther and help us form a doctrine for our decade.

Get the O’Reilly Hardware Newsletter

Private companies, as you noted, have been found giving information to the police and security services that would have been “unreasonable search and seizure” if the police had done them directly.

In the EU, and to a distinctly lesser degree in Canada, those private companies would be breaking the law by collecting the data, and breaking it again if they released it without a warrant.

This echoes something the library community found long ago: you only keep “patron X borrowed book Y” until the book was returned. That’s a principle every computerized library system has to meet (eg, GEAC), or it could not be sold in privacy-protective countries.

The moral? Software companies should write software that meets the highest common standards for privacy, because that’s a business advantage. Poorer privacy can keep you from selling your products anywhere in Europe.

Andy Oram

Dave, you make a good technical point–the principles you espouse is commonly called “privacy by design.”

mitch696969

Many a problem originate in different values being espoused in the hardware, software produced. It is fair to consider the values of your products, services. It is not fair to dictate your values without respect to the values of everyone else though. Should people randomly peek into your phone because you stepped on their property? Isn’t the phone your property? Should your TV viewing habits be publicly known as Google spits back your preferences on a public computer in a public setting? Did the things we previously settled on as individuals be thrown to the wind in the interest of computer science?

Indeed! In this case, countries placing appropriate limitations on what companies can do with information that they collect but which is not their property can act as an “innoculant” against the problem spreading. If I’m a vendor and want to keep extra information, I have to program it on a country-by-country basis and make no errors. If I err, I can be hauled into court and banged on with a clue-stick.

Transparency is the key to ending “Icky” data collection. If a company has to publicly disclose what they did with your data they will think twice. Random behavior in large data sets definitely works to hobble secret communication channels that are weaved in everything. The Internet was built for war which is why many of the principles are inside out. Better information leads to better discrimination and we need to end the ‘age of digital discrimination’. There is a limit to how much people, especially machines should know. We all made due just fine before by asking, observing like the normal people our maker intended us to be. It is alarming that hackers in a rogue country can identify future inventors and hack everything they touch to further their own aims before that inventor even gets a start.