Posts Tagged ‘PINQ’

Editor’s Note: Tony Gibbon is developing a datatrust demo as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project. Tony’s work, like Grant’s, could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use. We’re happy to have him guest blogging about the demo here.

Back in August, Alex wrote about the PINQ privacy technology and noted that we would be trying to figure out what role it could play in the datatrust. The goal was to build a demo of PINQ in action and get a better understanding of PINQ and its challenges and quirks in the process. We settled on a quick-and-dirty interactive demo to try to demonstrate the answers to the following.

What does PINQ bring to the table?

Before we look at the benefits of PINQ, let’s first take a look at the shortcomings of one of the ways data is often released with an example taken from the CDC website.

This probably isn’t the best example of a compelling dataset, but it is a good example of the lack of flexibility of many datasets that are available—namely that the data is pre-bucketed and there is a limit to how far you are able to drill down on the data.

On one hand, the limitation makes sense: If the CDC allowed you (or your prospective insurance company) to view disease information at street level, the potential consequences are quite frightening. On the other hand, they are also potentially limiting the value of the data. For example, each county is not necessarily homogenous. Depending on the dataset, a researcher may legitimately wish to drill down without wanting to invade anyone’s privacy—for example to compare urban vs. suburban incidence.

This is where PINQ shines—it works in both these cases. PINQ allows you to execute an arbitrary aggregate query (meaning I can ask how many people are wearing pink, but I can’t ask PINQ to list the names of people wearing pink) while still protecting privacy.

Let’s turn to the demo. (Note: the data points in the demo were generated randomly and do not actually indicate people or residences, much less anything about their health.) The quickest, most visual arbitrary query we came up with is drawing a rectangle on a map and counting each data point that falls inside, so we placed hundreds of “sick” people on a map to let users count them. (Keep in mind that the arbitrariness of a PINQ query need not be limited to location on a map. It could be numerical like age, textual like name, include multiple fields etc.)

Now let’s attempt to answer the researcher’s question. Is there a higher incidence of this mysterious disease in urban or suburban areas? For the sake of simplicity, we’ll pretend he’s particularly interested in two similarly populated, conveniently rectangular areas: one in Seattle and the other in a nearby suburb as shown below:

An arbitrary query such as this one is clearly not possible with data that is pre-bucketed such as the diabetes by county. Let’s take a look at what PINQ spits out.

We get an “answer” and a likely range. (The likely range is actually an input to the query, but that’s a topic for another post.) So what does this mean? Are there really 311.3 people in Seattle with the mysterious disease? Why are there partial people?

PINQ adds a random amount of noise to each answer, which prevents us from being able to measure the impact of a single record in the dataset. The PINQ answer indicates that about 311 people (plus or minus noise) in Seattle have the disease. The noise, though randomly generated, is likely to fall within a particular range, in this case 30. So the actual number is likely to be within 30 of 311, while the actual number of those in the nearby suburb with the disease is likely to be within 30 of 177.

Given these numbers (and ignoring the oversimplification and silliness of his question), the researcher could conclude that the incidence in the urban area is higher than the suburban area. As a bonus, since this is a demo and no one’s privacy is at stake, we can look at the actual data and real numbers:

The answers from PINQ were in fact pretty close to the real answer. We got a little unlucky with the Seattle answer as the actual random noise for that query was slightly greater than the likely range, but our conclusion was the same as if we had been given the real data.

But what about the evil insurance company/ employer/ neighbor?

By now, you’re hopefully starting to see potential value of allowing people to execute arbitrary queries rather than relying on pre-bucketed data, but what about the potential harm? Let’s imagine there’s a high correlation between having this disease and having high medical costs. While you might want your data included in this dataset so it could be studied by someone researching a cure, you probably don’t want it used to discriminate against you.

To examine this further, let’s zoom in and ask about the disease at my house. PINQ only allows questions with aggregate answers, so instead of asking “does Tony have the disease?” we’ll ask, “how many people at Tony’s house have the disease?”

You’ll notice, unlike the CDC map, PINQ doesn’t try to stop me from asking this potentially harmful, privacy-infringing question. (I don’t actually live there.) PINQ doesn’t care if the actual answer is big or small, or if I ask about a large or small area, it just adds enough noise to ensure the presence or absence of a single record (in this case person) doesn’t have an effect on your answers.

PINQ’s answer was “about 2.4, with likely noise within +/- 5” (I dialed down the likely noise to +/-5 for this example). As with all PINQ answers, we have to interpret this answer in the context of my initial question: “Does Tony have the disease?” Since the noise added is likely to be within 5 and -5, the real answer is likely to be between 0 and 7, inclusive, and we can’t draw any strong conclusions about my health because the noise overwhelms the real answer.

Another way of looking at this is that we get similarly inconclusive answers when we try to attack the privacy of both the infected and the healthy. Below I’ve made the diseased areas visible on the map and we can compare the results of querying me and my neighbor, only one of whom is infected:

Keep in mind that my address may not be in the dataset because I’m healthy or because I chose not to submit my information. In either case, the noise causes the answer at my house to be indistinguishable from the answer at my neighbor’s address, and our decisions to be included or excluded from the dataset do not affect our privacy. Of equal importance from the first example, the addition of this privacy preserving noise does not preclude the extraction of potentially useful answers from the dataset.

At the Symposium in November we spent quite a bit of time trying to wrap our collective brain around the PINQ privacy technology, what it actually guarantees and how it does so.

I’ve attempted to condense our several hours of discussion into a series of blog posts.

We began our discussion with the question: What does a privacy guarantee mean to you?

There was a range of answers to this question. They all boiled down to one of the following two:

Nothing bad will come of this information I give you: It won’t be used against me (discrimination, fraud investigations, psychological warfare). It won’t be used to harass me (spam).

The absence or presence of a single record cannot be discerned.

Let’s just say definition 1 is the “layperson” definition, which is more focused on the consequences of giving up personal data.

And definition 2 is the “technologist'” definition, which is more focused on the mechanism behind how to actually fulfill the layperson’s guarantee in a meaningful, calculable way.

Q. What does PINQ guarantee?

Some context: PINQ is a layer of code that sits between data and anyone trying to ask questions of that data that guarantees privacy in a measurable way to the individuals represented in the data.

The privacy PINQ guarantees is broader than the layperson’s understanding of privacy. Not only does PINQ guard against re-identification, targeting, and in short, any kind of harm resulting from exposing your data, it prevents any and all things in the universe from changing as a direct result of your individual data contribution.

If you want to guarantee that nothing in the world will change as a result of someone contributing their data to a data set, then you simply need to make sure that no one asking questions of that data set will get answers that are discernibly affected by the presence or absence of any one person.

Therefore, if you define privacy guarantee as “the absence or presence of a single record cannot be discerned,” meaning the inclusion of your data in a data set will have no discernible impact on the answers people get out of that data set, you also end up guaranteeing that nothing bad can ever happen to you if you contribute your data because in fact, absolutely nothing (good or bad) will happen to anyone as a direct result of you contributing your data, because with PINQ as the gatekeeper, your particular data record might as well not be there!

What is the practical fallout of such a guarantee?

Not only will you not be targeted to receive SPAM as a result of contributing your data to a dataset, no one else will be targeted to receive SPAM as a result of you contributing your data to a data set.

Not only will you not be discriminated against by future employers or insurance companies as a result of contributing your data to a dataset, no one else will be discriminated against as a result of contributing your data to a dataset.

Does this mean that my data doesn’t matter? Why then would I bother to contribute?

Now is a good time to point out that PINQ’s privacy guarantee is expansive, but in a very specific way. Nothing in the universe will change as a result of any one person’s data. However, the aggregate effect of everyone’s data will absolutely make a difference. It’s the same logic behind avoiding life’s little vices like telling white lies, littering or chewing gum in class. One person littering isn’t such a big deal. But what if everyone littered?

Still, is such an expansive privacy guarantee necessary?

It turns out, it’s incredibly hard to narrow a privacy guarantee to prevent just “harm,” because harm is a subjective concept with cultural and social overtones. One person’s spam is another’s helpful notification.

All privacy guarantees today which largely focus on preventing harm are by and large based on “good intentions” and “theoretical best practices,” not technical mechanisms that are measurable and “provable.”

However, change, as a function of how much any one person’s data is discernibly affecting answers to questions asked of a data set is readily measurable.

Up to this point, we’ve been engaged in a simple thought experiment that requires nothing more than a few turns of logic. How exactly PINQ keeps track of whether the absence or presence of a single record is discernible in the answers it gives out and the extent to which it’s discernible is a different matter and requires actual “innovation” and “technology.” Stay tuned for more on that.

Editor’s Note: Grant Baillie is developing a datatrust prototype as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project. Grant’s work could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use. We’re glad to have him guest-blogging about the prototype on our blog, and we’re looking forward to hearing more as he moves forward.

This post is mostly the contents of a short talk I gave at the CDP Symposium last month. In a way, it was a little like the qualifying oral you have to give in some Ph.D. programs, where you stand up in front of a few professors who know way more than you do, and tell them what you think your research project is going to be.

That is the point we’re at with the Datatrust Prototype: We are ready to move forward with an actual, real, project. This proposal is the result of discussions Mimi, Alex and I have been having over the course of the past couple month or so, with questions and insights on PINQ thrown in by Tony, and answers to some of those questions from Frank.

Not Potential features of a Datatrust that are out of scope for this first prototype.

Why?

We need a (concrete) thing

Partly this is to have something to demo to future clients/partner organizations, of course. However, we also need the beginnings of a real datatrust so that different people’s abstract concepts of what a datatrust is can begin to converge.

We need understanding of some technical issues

1. Understanding Privacy: People have been looking at this problem (i.e. releasing aggregated data in such a way that individuals’ privacy isn’t compromised) for over 40 years. After some well-publicized disasters involving ad-hoc approaches (e.g. “Just add some noise to the data”, or “remove all identifying data like names or social security numbers”) a bunch of researchers (like our friends at research.microsoft.com) came up with a mathematical model where there is a measure of privacy, ε (epsilon).

In the model, there’s a clear message: you have to pay (in privacy) if you want greater accuracy (i.e. less noise) in your answers. In particular, the PINQ C# API can
calculate the privacy cost ε of each query on a dataset. So, one can imagine having different users of a datatrust using up allocations of privacy they have been assigned, hence the term “Privacy Budget”. (Frank dislikes this term because there are many privacy strategies possible other than a simple, fixed budget). In any case, by creating the prototype, we are hoping to gain an intuitive understanding of this mathematical concept of privacy, as well as obtain insight on more practical matters like how to allocate privacy so that the datatrust is still useful.

One way of understanding privacy is to think of it as a probability, (i.e. of leaking data) or measure of risk. You could even imagine an organization buying insurance against loss of individual data, based on the mathematical bounds supplied by PINQ. The downside of this approach is that we humans don’t seem to have a good intuitive grasp of things like probability and risk (writ large, for example, in the financial meltdown last year).

Another approach that might be helpful is to notice that privacy behaves in the same way as a currency (for example, it is additive). Here, you can imagine people earning or trading currency, for example. With actual money, we have a couple of thousands of years worth of experience built into evaluations like a house being worth a million Snickers bars: How long will it take us to have similar intuition with a privacy currency?

2. PINQ vs SQL: Here, by “SQL” I’m talking of traditional persistent data storage mechanisms in general. In most specific cases we are talking about SQL-based databases (although in the data analysis world there are other possibilities, like SAS).

SQL has been around for over 35 years, and is based on a mathematical model of its own. It basically provides a set of building blocks for querying and updating a database.

PINQ is a wrapper that basically protects the privacy of an underlying SQL database. It allows you to run SQL-like queries to get result sets, but then only lets you see statistical information about these sets. Even this information will come at some privacy cost, depending on how accurate you want the answer to be. PINQ will add random noise to any answer it gives you; if you want to ask for more accurate answers, i.e. less noise added (on average), you have to pay more privacy currency.

Note: At this point in the talk, I went in to a detailed discussion of a particular case of how PINQ privacy protection is supposed to work. However, I’m going to defer this to an upcoming post.

PINQ provides building blocks that are similar to SQL’s, with the caveat that the only data you can get out is aggregated (i.e. numbers, averages, and other statistical information). Also, some SQL operations cannot be supported by PINQ because they cannot be privacy protected at all.

In any case, both PINQ and SQL support an infinite number of questions, since you can ask about arbitrary functions of the input records. However, because they have somewhat different query building blocks, it is at least theoretically possible that there are real-world data analyses that cannot be replicated exactly in PINQ, or can only be done in a cumbersome or (privacy) expensive way. So, it will be good to focus on more concrete uses cases, in order to see whether this is the case or not.

3. Efficent Queries: It’s not uncommon for database-based software projects to grind to a halt at some point when it becomes clear that the database isn’t handling the full data set as well as is needed. Then various experts are rushed in to tune and rewrite the queries so that they perform better. In the case of PINQ, there is an additional measure of query performance, that of privacy used. Frank’s PINQ tutorial already has one example of a query that can be tuned to use privacy budget more efficiently. Hopefully, by working through specific use cases, CDP can start building expertise in query optimization.

What?

Target: A researcher working for a single organization. We’re going to imagine that some organization has a dataset containing personal information, but they want to be able to do some data analysis and release statistical aggregates to the public without compromising any individual’s privacy.

A Mockup of a Rich Dataset: Hopefully, I’ve given enough incentive for why we want a reasonably “real-world” structure to our data. I’m proposing that we choose a subset of the schema available as part of the National Health and Nutrition Examination Survey (NHANES):

This certainly satisfies the “rich” requirement: NHANES combines an interesting and extensive mix of medical and sociological information (The above cover page image comes from the description of the data collected, a 12-page PDF file). Clearly, we wouldn’t want to mock up the entire dataset, but even a small subset should make for some reasonably complex analyses.

Queries: We will supply at least a canned set of queries over the sample data. A scenario I have in mind is being able to have something like Tony’s demo, but with a more complex data set. A core requirement of the prototype is to be able to reproduce the published aggregations done with the real NHANES dataset. Some kind of geographical layout, like the demo, would be compelling, too.

Not

Account management: This includes issues of tracking privacy allocation and expenditures on a per-user basis, possibly having some measure of trust to allow this. There may be some infrastructure for different users in the prototype, but for the most part we’ll be assuming a single, global user.

Collaborative queries: In the future, we could imagine having users contribute to a library of well-known queries for a given data set. The problem with public access like this is that it basically means that all privacy budget is effectively shared, since query results are shared, so for this first cut at the problem we are not going to tackle this.

Multiple Datasets, Updates: For now, we will assume a single data set, with no updates. (The former can raise security concerns, especially if data sets aren’t co-hosted, while the latter is an area where I’m not sure what the mathematical contraints are).

Sneaky code (though maybe we have a service): There is a known issue at the moment with having PINQ executing arbitrary C# code to do queries. At the moment, it is possible to have your code save all the records it sees to a file on disk. We may work around this by having the datatrust be a service (i.e. effectively restricting the allowed queries so no user-supplied code is run).

Deployment issues (e.g. who owns the data): Our prototype will just have PINQ and the simulated database running on the same machine, even though more general configurations are possible. We also explicitly don’t tackle whether the database is running on a CDP server or the organization that owns the data.

Open Source Ideological Purity: While it would be nice for CDP to be able to deploy on an open source platform, it is clear that serious issues might lie in wait for deploying on Mono (the open source C# environment). In that case, it is quite possible to switch to running PINQ on top of, say, Microsoft SQL Server.

We’ve been silent for a while on the blog, but that’s because we’ve been distracted by actual work building out the datatrust (both the technology and the organization).

Here’s a brief rundown of what we’re doing.

Grace is multi-tasking on 3 papers.

Personal Data License We’re conducting a thought experiment to think through what the world might look like if there was an easy way for individuals to release personal information on their own terms.

Organizational Structures We’ve conducted a brief survey of a few organizational structures we think are interesting models for the datatrust “trusted” entities from Banks to Public Libraries and “member-based” organizations from Credit Unions to Wikipedia. We tried to answer the question: What institutional structures can be practical defenses against abuses of power as the datatrust becomes a significant repository of highly sensitive personal information?

Snapshot of Publicly Available Data Sources A cursory overview of some of the more interesting data sets that are available to the public from government agencies to answer the question: How is the datatrust going to be different / better than the myriad data sources we already have access to today?

We also now have 2 new contributors to CDP: Tony Gibbon and Grant Baillie.

A couple of months ago, Alex wrote about a new anonymization technology coming out of Microsoft Research: PINQ. It’s an elegant, simple solution, but perhaps not the most intuitive way for most people to think about guaranteeing privacy.

Tony is working on a demonstration of PINQ in action so that you and I can see how our privacy is protected and therefore believe *that* it works. Along the way, we’re figuring out what makes intuitive sense about the way PINQ works and what doesn’t and what we’ll need to extend so that researchers using the datatrust will be able to do their work in a way that makes sense.

Grant is working on a prototype of the datatrust itself which involves working out such issues as:

What data schemas will we support? We think this one to begin with: Star Schema.

How broadly do we support query structures?

Managing anonymizing noise levels.

To help us answer some of these questions, we’ve gathered a list of data sources we think we’d like to support in this first iteration. (e.g. IRS tax data, Census data) (More to come on that.)

We will be blogging about all of these projects in the coming week, so stay tuned!

For those following this blog closely, you know CDP is developing the concept of a datatrust. Slightly less obvious may be the fact that we’re actually planning on building a datatrust too. As such, we have been have been followinginterestingprivacy ideas and technologies for a few years now, by attending conferences, reading papers and talking to interesting folks.

PINQ is a layer between the analyst and the datastore

Privacy Challenges with Aggregates

One of the key realizations that lead to the creation of CDP was that the data that is valuable for analysis (generally aggregate statistical data) is not, in principle, the data that concerns privacy advocates (identifiable, personal information). While that is true in many cases, the details are a bit more complicated.

This reality poses a potentially showstopping problem for the datatrust: sure, aggregates are probably fine most of the time, but if we want to store and allow analysis of highly sensitive data, how can we be sure identities won’t be derived from even the aggregates?

Wait, one better: It guarantees no privacy disclosures. How could that be?

Here’s an example: Imagine you record the heights of each of the people in your subway car in the morning, and calculate the average height. Then imagine that you also recorded each person’s zip code, gender and birth date. According to Sweeney above, if you calculated the “average” height of each combination of zip code, gender and birth date, you would not only know the exact height of 87% of the people on the car, but, with the help of some other public databases, you’d also know who they were.

Here’s where differential privacy helps. Take the height data you recorded (zip codes and all) and put it behind a differential privacy software wall. By adding just the right amount of noise to the results, you the analyst can query for statistical representations of all different combinations of the characteristics, and you’ll get an answer and a measure of the accuracy of the response. Instead of finding out that the average height was 5′ 5.23″, you might find out that the average height was 5′ 4″ +/- 0.75″. (I’m making these numbers up and over-simplifying.)

A Programmatic Privacy Guarantee

The guarantee of differential privacy in the above example is that if you remove any one person from the subway car dataset, and asked PINQ again for the average height, the answer would be the same (same level of accuracy) as the answer when they were included in the set.

For the analyst trying to understand the big picture, PINQ offers accurate answers and privacy. For the attacker (to use the security-lingo) seeking an individual’s data, PINQ offers answers so inaccurate they are useless.

What every prototype needs: Road Miles

Does it work? I’ve chatted with Frank a lot, and there seems to be a growing consensus in the research communities that it does; based on what I have seen I am very optimistic. However, at least right now, the guarantee is less of a concern than usability: How hard is it to understand a dataset when all you can extract from it are noisy aggregates? We’re hoping that it is more useful than not having any access to certain sensitive datasets, but we don’t know yet.

PINQ Workflow: Query, Execution, Noise, Results

So what will we be doing for the next few months? Taking PINQ out on the highway. And trying to figure out what role it can play in the datatrust. We’ll keep you posted!