Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

Given that a differentially private mechanism has a known conditional
distribution, probabilistic inference techniques may be used along with
the known conditional distribution, and generated results from previously
computed queries on private data, to generate a posterior distribution
for the differentially private mechanism used by the system. The
generated posterior distribution may be used to describe the probability
of every possible result being the correct result. The probability may
then be used to qualify conclusions or calculations that may depend on
the returned result.

Claims:

1. A method comprising: generating a result using a differentially
private computation from a private data set by a computing device;
retrieving a posterior distribution for the differentially private
computation by the computing device; and providing the posterior
distribution by the computing device.

2. The method of claim 1, further comprising providing the result.

3. The method of claim 1, further comprising: determining a conditional
distribution of the differentially private computation; and inferring the
posterior distribution using the conditional distribution.

4. The method of claim 3, wherein inferring the posterior distribution
using the conditional distribution comprises: retrieving a plurality of
results from previous executions of the differentially private
computation; and inferring the posterior distribution using the
conditional distribution and the plurality of results using probabilistic
inference.

10. A method comprising: receiving a first result at a computing device
through a network, wherein the first result is generated from a second
result of a private data set using a differentially private computation;
determining a conditional distribution of the differentially private
computation by the computing device; retrieving a plurality of results
from previous executions of the differentially private computation;
probabilistically inferring a posterior distribution of the
differentially private computation using the conditional distribution and
the plurality of results by the computing device; and providing the
posterior distribution by the computing device through the network.

11. The method of claim 10, further comprising providing the first result
through the network.

17. A system comprising: a computing device; a privacy integrated
platform that generates a first result from a second result using a
differentially private computation; and an inference engine that:
generates a posterior distribution for the differentially private
computation; receives the generated first result; and provides the
generated first result and the generated posterior distribution.

18. The system of claim 17, wherein the differentially private
computation is an exponential mechanism.

19. The system of claim 17, wherein the inference engine further
determines a conditional distribution of the differentially private
computation, and generates the posterior distribution for the
differentially private computation using the determined conditional
distribution.

20. The system of claim 19, wherein the conditional distribution is one
of a Laplacian distribution or a Gaussian distribution.

Description:

BACKGROUND

[0001] A system is said to provide differential privacy if the presence or
absence of a particular record or value cannot be determined based on an
output of the system, or can only be determined with a very low
probability. For example, in the case of a website that allows users to
rate movies, a curious user may attempt to make inferences about the
movies a particular user has rated by creating multiple accounts,
repeatedly changing the movie ratings submitted, and observing the
changes to the movies that are recommended by the system. Such a system
may not provide differential privacy because the presence or absence of a
rating by a user (i.e., a record) may be inferred from the movies that
are recommended (i.e., output).

[0002] Typically, systems provide differential privacy by introducing some
amount of noise to the data or to the results of operations or queries
performed on the data. While the addition of noise to the results of
operations may not be problematic for systems such as the rating system
described above, for some systems such noise may be problematic. For
example, in a system of medical records that provides differential
privacy, users may want a probability distribution of the noise that is
added to the results.

SUMMARY

[0003] In order to provide differential privacy protection to a private
data set, a system may add noise to the results of queries performed on
the private data set. The system may add the noise using a differentially
private mechanism with a known conditional distribution. In making
queries from the data set, a user may wish to infer some information from
that data, for example the average of some quantity. Given that the
differentially private mechanism has a known conditional distribution,
probabilistic inference techniques may be used along with the known
conditional distribution, and generated results from previously computed
queries on the private data, to generate a posterior distribution over
the unknown quantity of interest. The generated posterior distribution
may be used to describe the probability of any value being the correct
value for the quantity of interest. The probability may then be used to
qualify conclusions or calculations that may depend on the returned
result.

[0004] In an implementation, a result is generated by a differentially
private computation from a private data set. A posterior distribution for
the result given the differentially private computation is retrieved, and
the posterior distribution is provided to a user.

[0005] Implementations may include some or all of the following features.
The result may be provided to the user. A conditional distribution of the
differentially private computation may be determined. The posterior
distribution may be inferred using the conditional distribution.
Inferring the posterior distribution using the conditional distribution
may include retrieving results from previous executions of the
differentially private computation, and inferring the posterior
distribution using the conditional distribution and the results using
probabilistic inference. Using probabilistic inference may include using
Markov Chain Monte Carlo methods. The conditional distribution may be a
Laplacian distribution or a Gaussian distribution. The differentially
private computation may be an exponential mechanism. The private data set
may comprise census data. The private data set may comprise medical data.

[0006] In an implementation, a first result is received at a computing
device through a network. The first result is generated from a second
result of a private data set using a differentially private computation.
A conditional distribution of the differentially private computation is
determined. A plurality of results from previous executions of the
differentially private computation is retrieved. A posterior distribution
of the differentially private computation is probabilistically inferred
using the conditional distribution and the plurality of results. The
posterior distribution is provided by the computing device.

[0007] Implementations may include some or all of the following features.
The first result may be provided through the network. The differentially
private computation may include an exponential mechanism.
Probabilistically inferring the posterior distribution may include
probabilistically inferring an approximate posterior distribution using
Markov Chain Monte Carlo methods. The conditional distribution may be a
Laplacian distribution or a Gaussian distribution. The private data set
may include census data. The private data set may include medical data.

[0008] This summary is provided to introduce a selection of concepts in a
simplified form that are further described below in the detailed
description. This summary is not intended to identify key features or
essential features of the claimed subject matter, nor is it intended to
be used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] The foregoing summary, as well as the following detailed
description of illustrative embodiments, is better understood when read
in conjunction with the appended drawings. For the purpose of
illustrating the embodiments, there are shown in the drawings example
constructions of the embodiments; however, the embodiments are not
limited to the specific methods and instrumentalities disclosed. In the
drawings:

[0010] FIG. 1 is a block diagram of an implementation of a system that may
be used to provide probabilistic inference for differentially private
computations;

[0011] FIG. 2 is an operational flow of an implementation of a method for
determining a posterior distribution;

[0012] FIG. 3 is an operational flow of an implementation of a method for
probabilistically inferring the posterior distribution of a
differentially private computation;

[0013] FIG. 4 is an operational flow of an implementation of a method for
inferring a posterior distribution for a differentially private
computation; and

[0014] FIG. 5 shows an exemplary computing environment.

DETAILED DESCRIPTION

[0015] FIG. 1 is a block diagram of an implementation of a system 100 that
may be used to provide probabilistic inference for differentially private
computations. As illustrated the system 100 includes a privacy integrated
platform 130. In some implementations, the privacy integrated platform
130 may receive one or more queries from users of a client device 110.
The one or more queries may be received from the client device 110
through a network 120. The network 120 may be a variety of network types
including the public switched telephone network (PSTN), a cellular
telephone network, and a packet switched network (e.g., the Internet).
The client device 110 may comprise one or more general purpose computers
such as the computing device 500 described with respect to FIG. 5, for
example.

[0016] The privacy integrated platform 130 may receive the one or more
queries and satisfy the received queries from a private data set 137 by
generating data or results in response to the queries. The privacy
integrated platform 130 may satisfy the queries while providing
differential privacy to the private data set 137. Example queries may be
for a count of the number of records of the private data set 137 that
satisfy or meet specified conditions, or for the value(s) associated with
a specified record of the private data set 137. Any type of data queries
may be supported by the privacy integrated platform 130. The private data
set 137 may be implemented using a database or other data structure and
may include a variety of private data and private data sets including
medical data, census data, and financial data, for example.

[0017] As described above, a system is said to provide differential
privacy to a private data set if an output of the system does not
disclose the presence or absence of a record in the private data set, or
the presence or absence of a record can only be determined with a low
probability. Accordingly, a user of the client device 110 may not be able
to tell the presence or absence of a record in the private data set 137
based on a response to a query generated by the privacy integrated
platform 130. The amount of differential privacy that is provided by the
privacy integrated platform 130 is referred to herein as ε.
Generally, the greater the value of ε used by the privacy
integrated platform 130 the less the amount of differential privacy
provided to the private data set 137.

[0018] More specifically, with respect to Equation (1), a result or output
z generated by the privacy integrated platform 130, where z ε Z,
in response to a query against the private data set 137 from a class of
data sets X, may provide ε-differential privacy if and only if,
for all data sets A, B ε X with symmetric difference one:

p(z|A)≦p(z|B)×exp(ε) (1).

[0019] For example, if the set A contains the records of all individuals,
and the set B contains the records of all individuals except one user
(i.e., symmetric difference one), a result z having
ε-differential privacy means that the log of the likelihood ratio
that the one user is present or absent from the private data set 137
given the result z is bounded in magnitude by Equation (2):

[0020] In some implementations, the privacy integrated platform 130 may
provide differential privacy to a record or result generated in response
to a received query through the addition of noise. For example, the
privacy integrated platform 130 may retrieve a record from the private
data set 137 in response to a query and add noise to value(s) associated
with the retrieved record. Alternatively, the privacy integrated platform
130 may perform a query on the private data set 137 to generate a result.
Noise may then be added to the result before it is provided to the
requesting user.

[0021] In some implementations, the noise may be added by a noise engine
135. The noise may be Laplacian or Gaussian noise for example; however,
other types of noise may be used. By adding noise to a result before
providing the result to a user, the differential privacy of the private
data set 137 is protected because the true response to the query is
obscured, thereby preventing a user from making any inferences about the
private data set 137 with complete certainty.

[0022] In some implementations, the noise engine 135 may add noise to the
result of a query to generate a result z using a differentially private
computation. One such computation is known as an exponential mechanism.
An exponential mechanism is a function φ:X×Z→R (where R
is the set of real numbers) such that, as shown by Equation (3), for any
input data sets A, B ε X,

|φ(A,z)-φ(B,z)|≦1 (3).

[0023] The exponential mechanism function φ may return, given a true
data set X (i.e., private data set 137) a value z from the conditional
distribution of Equation (4):

[0024] In some implementations, the privacy integrated platform 130 may
provide an indicator of the conditional distribution (i.e., p(z|X, φ,
ε)) for the exponential mechanism (or other differentially
private computation) used by the noise engine 135 of the privacy platform
130 to generate a noisy result z from the private data set 137 (i.e., X).
The conditional distribution may describe the probability distribution
for the noisy result z generated by the privacy integrated platform 130.
For example, an exponential mechanism with a conditional distribution
between -1 and 1 may generate noise values between -1 and 1.

[0025] The probability that any possible data set X is the true data set
(i.e., the data set X without noise), given noisy results of queries
against the true data set, is referred to herein as the posterior
distribution over data sets given noisy observations. The probability
that any subsequent query f against the true data set has value y, given
noisy results of other stored queries against the data set, is referred
to herein as the posterior distribution over query results given noisy
observations. The conditional distribution of the exponential mechanism
used by the privacy integrated platform 130 may be used to determine the
posterior distribution over data sets given noisy observations and the
posterior distribution over query results given noisy observations.

[0026] In some implementations, the privacy integrated platform 130 may
also allow a user of the client device 110 to specify the level of
differential privacy provided (i.e., the value of ε). As the
value of ε specified by the user decreases so does the amount of
differential privacy protection afforded to the privacy data 137.
However, the lower the value of E, the amount of noise that is added to a
result increases.

[0027] The system 100 may further include an inference engine 140. The
inference engine 140 may determine (or approximate) the posterior
distribution over data sets given noisy observations and the posterior
distribution over query results given noisy observations. The inference
engine 140 may be implemented using a general purpose computer such as
the computing device 500 described with respect to FIG. 5, for example.
While the inference engine 140 is illustrated as being separate from the
privacy integrated platform 130 it is for illustrative purposes only. In
some implementations, the inference engine 140 may be a component part of
the privacy integrated platform 130.

[0028] For example, as described above, the privacy integrated platform
130 may return noisy results to provide differential privacy protection
to the private data set 137. While these noisy results may provide
differential privacy, they may make the values less useful for certain
applications such as medicine. The inference engine 140 may make the
noisy results more useful in certain circumstances by determining and
providing the posterior distribution for the exponential mechanism used
to generate the noisy results. The inference engine 140 may calculate the
posterior distribution for each exponential mechanism used by the privacy
integrated platform 130.

[0029] A model may be used for the relationship between the private data
set X and the quantity of interest θ:

p(X|θ)

[0030] In some implementations, the posterior distribution may be
determined by the inference engine 140 using Equation (5) to compute the
marginal likelihood where X represents the private data set 137,
multiplied by a prior over θ:

p(θ|z,ε∝p(θ)∫Xp(z|x,ε)p(X|.-
theta.)dX (5).

[0031] Thus, as illustrated in Equation (5), the posterior distribution
over the unknown quantity of interest θ is proportional to the
integral of p(z|X, ε) (i.e., the conditional distribution)
multiplied by the probability of a data set X across all possible data
sets X, given θ, multiplied by the prior for θ.

[0032] Moreover, in some implementations, additional data or prior
information may be known. This preexisting information may be
incorporated into the calculation of the posterior distribution by the
inference engine 140. For example, a user may have preexisting knowledge
about a user whose data is part of the private data set 137. Other
preexisting knowledge may also be incorporated such as the number of
records in the data or any other information about the data. This
preexisting knowledge may be referred to as α, and the probability
of X given α may be represented by p(X|α). There may also be
prior knowledge about the quantity of interest θ represented by
p(θ|α). The equation for the posterior distribution,
p(θ|z, ε, α), used by the inference engine 140 may
incorporate this preexisting knowledge becoming Equation (6):

p(θ|z,ε,α)∝p(θ|α)∫Xp(z|-
X,ε)p(X|α)dX (6).

[0033] The inference engine 140 may approximate the posterior distribution
from the above formula for the exponential mechanism used by the privacy
integrated platform 130 using probabilistic inference and the results of
previous executions of the exponential mechanism performed in response to
previously received user queries. For example, after each execution of
the exponential mechanism, the generated results may be stored by the
privacy integrated platform 130 for later use in calculating the
posterior distribution of the exponential mechanism by the inference
engine 140.

[0034] The inference engine 140 may approximate the posterior distribution
using probabilistic inference methods such as Markov Chain Monte Carlo
methods using the results of previous executions of the exponential
mechanism along with the conditional distribution of the exponential
mechanism. The approximated posterior distribution may be stored by the
inference engine 140 and returned to a user along with a generated noisy
result.

[0035] In some implementations, the inference engine 140 may determine or
approximate the posterior distribution for a variety of different
exponential mechanisms or other differentially private computations used
by the privacy integrated platform 130 at a variety of different values
of ε. When a subsequent query is received by the privacy
integrated platform 130 from a user, the posterior distribution of the
exponential mechanism used to calculate a result in response to the query
may be retrieved by the privacy integrated platform 130. The posterior
distribution and generated result may then be returned to the user who
provided the query, for example.

[0036] FIG. 2 is an operational flow of a method 200 for determining a
posterior distribution. The method 200 may be implemented by the privacy
integrated platform 130 and the inference engine 140, for example.

[0037] A query is received (201). The query may be received by the privacy
integrated platform 130. The query may be received from a user and may be
a request for information from a private data set such as the private
data set 137. For example, the private data set 137 may be medical or
census records.

[0038] A first result is generated in response to the query (203). The
first result may be generated by the privacy integrated platform 130. The
first result may be generated by fulfilling the query from the private
data set 137.

[0039] Noise is added to the generated first result using a differentially
private computation (205). The noise may be added to the first result to
generate a second result by the noise engine 135 of the privacy
integrated platform 130. In some implementations, the differentially
private computation may be an exponential mechanism. The noise may be
added to the first result to provide differential privacy protection to
the private data set 137. Other methods for providing differential
privacy protection may also used.

[0040] A posterior distribution for the differentially private computation
is retrieved (207). The posterior distribution may be retrieved by the
privacy integrated platform 130 from the inference engine 140. The
retrieved posterior distribution may have been pre-generated for the
differentially private computation used to generate the noise that was
added to the first result. The posterior distribution may have been
generated for the differentially private computation using a conditional
distribution associated with the differentially private computation and
the results of one or more previous executions of the differentially
private computation. The conditional distribution may be a Laplacian
distribution or a Gaussian distribution, for example.

[0041] The posterior distribution and generated second result may be
provided to a user (209). The posterior distribution and generated second
result may be provided by the privacy integrated platform 130 to a user
through a network. The user may be the same user who provided the query
to the privacy integrated platform 130, for example. As described above,
the generated second result may be generated from the first result by the
addition of noise to the first result. The addition of noise provides
differential privacy protection to the private data set 137, but obscures
the true result of the query. Accordingly, by providing the posterior
distribution that describes the probability that any generated result in
a true result, the user may be able to incorporate the probability into
any subsequent calculations or conclusions that depend on the second
result.

[0042] FIG. 3 is an operational flow of a method 300 for probabilistically
inferring the posterior distribution of a differentially private
computation. The method 300 may be implemented by an inference engine
140.

[0043] A plurality of differentially private computations is performed
(301). The differentially private computations may be performed by the
inference engine 140. The differentially private computation may be a
differentially private computation used by the privacy integrated
platform 130 to generate noise that is added to results generated in
response to queries on the private data set 137. In some implementations,
the differentially private computation is an exponential mechanism. Other
differentially private computations may also be used. The results of the
differentially private computations may be stored for later use in
calculating or approximating the posterior distribution of the
differentially private computation.

[0044] The conditional distribution of the differentially private
computation is determined (303). The determination may be made by the
inference engine 140. The conditional distribution describes the
distribution of results that are returned by the differentially private
computation given the private data set 137 and the level of differential
privacy provided by the differentially private computation (i.e., E). The
conditional distribution may be provided to the inference engine 140 by
the privacy integrated platform 130.

[0045] A posterior distribution of the differentially private computation
is probabilistically inferred (305). The inference may be made by the
inference engine 140. The posterior distribution may be probabilistically
inferred from the stored results of the differentially private
computations and the conditional distribution of the differentially
private computation. In some implementations, the inference may be made
by using Markov Chain Monte Carlo methods, for example; however, other
methods of probabilistic inference may also be used.

[0046] FIG. 4 is an operational flow of a method 400 for inferring a
posterior distribution for a differentially private computation. The
method 400 may be implemented by the inference engine 140 and the privacy
integrated platform 130.

[0047] A first result is received (401). The first result may be received
by the inference engine 140 from the privacy integrated platform 130. The
first result may have been generated from a second result using a
differentially private computation in response to a request or a query
received by the privacy integrated platform 130 from a user at the client
device 110. For example, the second result may be a result generated from
the private data set 137 and may include a result generated from private
data such as medical data. In order to provide differential privacy to
the records in the private data set, noise may be added to the results
before they are released to a user or other member of the public. In some
implementation, the noise may be Laplacian or Gaussian noise and may be
generated by a differentially private computation such as an exponential
mechanism. Accordingly, the first result may have been generated from the
second result by the privacy integrated platform 130 using differentially
private computation and may differ from the second result by some amount
of generated noise. Other methods for differential privacy may also be
used.

[0048] A conditional distribution of the differentially private
computation is determined (403). The conditional distribution may be
determined by the inference engine 140 from the privacy integrated
platform 130. The conditional distribution may describe the probability
distribution of the noise added to records by the differentially private
computation. In some implementations, the conditional distribution may be
a Gaussian or Laplacian distribution. The conditional distribution may be
function of the amount differential privacy provided by the
differentially private computation (i.e., ε).

[0049] A plurality of results from previous executions of the
differentially private computation is retrieved (405). The results may be
retrieved by the inference engine 140. In some implementations, the
results were generated in response to previously received queries.

[0050] A posterior distribution of the differentially private computation
is probabilistically inferred (407). The posterior distribution may be
inferred by the inference engine 140. The posterior distribution may be
inferred by the inference engine 140 using the conditional distribution
and the retrieved results of the differentially private computation. In
some implementations, the posterior distribution may be inferred using
Markov Chain Monte Carlo methods. Other methods may also be used.

[0051] The second result and the inferred posterior distribution are
provided (409). The second result and the inferred posterior distribution
may be provided by the inference engine 140. As described with respect to
401, the second result may have been generated in response to a query
received from a user of the client device 110. Accordingly, the second
result and the inferred posterior distribution may be returned to the
user at the client device 110.

[0052] FIG. 5 shows an exemplary computing environment in which example
implementations and aspects may be implemented. The computing system
environment is only one example of a suitable computing environment and
is not intended to suggest any limitation as to the scope of use or
functionality.

[0053] Numerous other general purpose or special purpose computing system
environments or configurations may be used. Examples of well known
computing systems, environments, and/or configurations that may be
suitable for use include, but are not limited to, personal computers
(PCs), server computers, handheld or laptop devices, multiprocessor
systems, microprocessor-based systems, network PCs, minicomputers,
mainframe computers, embedded systems, distributed computing environments
that include any of the above systems or devices, and the like.

[0054] Computer-executable instructions, such as program modules, being
executed by a computer may be used. Generally, program modules include
routines, programs, objects, components, data structures, etc. that
perform particular tasks or implement particular abstract data types.
Distributed computing environments may be used where tasks are performed
by remote processing devices that are linked through a communications
network or other data transmission medium. In a distributed computing
environment, program modules and other data may be located in both local
and remote computer storage media including memory storage devices.

[0055] With reference to FIG. 5, an exemplary system for implementing
aspects described herein includes a computing device, such as computing
device 500. In its most basic configuration, computing device 500
typically includes at least one processing unit 502 and memory 504.
Depending on the exact configuration and type of computing device, memory
504 may be volatile (such as random access memory (RAM)), non-volatile
(such as read-only memory (ROM), flash memory, etc.), or some combination
of the two. This most basic configuration is illustrated in FIG. 5 by
dashed line 506.

[0057] Computing device 500 typically includes a variety of computer
readable media. Computer readable media can be any available media that
can be accessed by device 500 and include both volatile and non-volatile
media, and removable and non-removable media.

[0058] Computer storage media include volatile and non-volatile, and
removable and non-removable media implemented in any method or technology
for storage of information such as computer readable instructions, data
structures, program modules or other data. Memory 504, removable storage
508, and non-removable storage 510 are all examples of computer storage
media. Computer storage media include, but are not limited to, RAM, ROM,
electrically erasable program read-only memory (EEPROM), flash memory or
other memory technology, CD-ROM, digital versatile disks (DVD) or other
optical storage, magnetic cassettes, magnetic tape, magnetic disk storage
or other magnetic storage devices, or any other medium which can be used
to store the desired information and which can be accessed by computing
device 500. Any such computer storage media may be part of computing
device 500.

[0059] Computing device 500 may contain communications connection(s) 512
that allow the device to communicate with other devices. Computing device
500 may also have input device(s) 514 such as a keyboard, mouse, pen,
voice input device, touch input device, etc. Output device(s) 516 such as
a display, speakers, printer, etc. may also be included. All these
devices are well known in the art and need not be discussed at length
here.

[0060] It should be understood that the various techniques described
herein may be implemented in connection with hardware or software or,
where appropriate, with a combination of both. Thus, the processes and
apparatus of the presently disclosed subject matter, or certain aspects
or portions thereof, may take the form of program code (i.e.,
instructions) embodied in tangible media, such as floppy diskettes,
CD-ROMs, hard drives, or any other machine-readable storage medium where,
when the program code is loaded into and executed by a machine, such as a
computer, the machine becomes an apparatus for practicing the presently
disclosed subject matter.

[0061] Although exemplary implementations may refer to utilizing aspects
of the presently disclosed subject matter in the context of one or more
stand-alone computer systems, the subject matter is not so limited, but
rather may be implemented in connection with any computing environment,
such as a network or distributed computing environment. Still further,
aspects of the presently disclosed subject matter may be implemented in
or across a plurality of processing chips or devices, and storage may
similarly be affected across a plurality of devices. Such devices might
include PCs, network servers, and handheld devices, for example.

[0062] Although the subject matter has been described in language specific
to structural features and/or methodological acts, it is to be understood
that the subject matter defined in the appended claims is not necessarily
limited to the specific features or acts described above. Rather, the
specific features and acts described above are disclosed as example forms
of implementing the claims.