Wednesday, 6 August 2014

Definition of privacy : an attackers prior belief of tuple $t$ belonging to private instance $I$ divided by his posterior belief of tuple $t$ belonging to $I$ given the published instance $V$.

Prior is given by $Pr[t] \leq k \frac{n}{m}$ where $n = |I|$, $m = |D|$ ($D$ refers to the domain of the attribute values of $I$), and $k$ is some constant. This is a reasonable definition because $I$ contains $n$ tuples out of $m$ possible tuples.

Posterior is given by $Pr[t | V] \leq \gamma$ for some $\gamma$.

Definition of utility : Utility is the ability to estimate counting queries (example of a counting query- SELECT count(*) WHERE country=``Canada''). Utility is conveyed by the formula : $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ where $n = |I|$, $Q(I)$ refers to a counting query over the instance $I$, $\widetilde{Q}(V)$ refers to a counting query estimate over the published instance $V$ and $\rho$ is some number.

One result of the paper is that if $k$ is $\Omega (\sqrt{m})$ (hence $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$, which represents a very powerful attacker with a large prior) then no algorithm can achieve both utility and privacy.

If $k$ is $O(1)$ (i.e., $Pr[t] = O(\frac{n}{m})$, which means that the attacker is weaker) then we describe an algorithm that achieves a privacy/utility tradeoff given by the formula $\rho = \sqrt{\frac{k}{\gamma}}$. The idea is that if $\rho$ is small ($\rho$ represents query estimation error, shown in the RHS of the utility formula), then utility is high because query estimation error is low. $k$ conveys prior while $\gamma$ conveys the posterior (definition of privacy).

The described anonymization algorithm is a randomized algorithm st for any $Q$ (query), the probability that the algorithm outputs $V$ st $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ is $1-\epsilon$ for some $\epsilon$.

Classification of adversaries

For every tuple, the attackers prior is either small or equal to 1.

(Defn 2.1) Let $d \in [0,1]$. A d-bounded adversary is one for which $\forall t \in D$ either $Pr[t] \leq d$ or $Pr[t] = 1$. If $Pr$ is tuple independent, it is called a d-independent adversary.

For a d-independent adversary, there is no correlation amongst tuples though there may be correlation among attributes.

(Defn 2.3) An algorithm is (d-$\gamma$)-private for all d-independent adversaries if (i) $\frac{d}{\gamma} \leq \frac{Pr[t|V]}{Pr[t]}$ and (ii) $Pr[t|V] \leq \gamma$

If $t$ fails condition (i), then there is a negative leakage (i.e., adversary learns more about the fact that tuple $t$ is not in $I$), while if it fails condition (ii), this is a positive leakage (i.e., adversary learns more about the fact that the tuple $t$ is in $I$).

(Defn 2.5) A randomized algorithm is called $(\rho, \epsilon)$-useful if it has an estimator (i.e., $\widetilde{Q}$) st for any $Q$ and $I$: $Pr[|Q(I) - \widetilde{Q}(V)| \geq \rho \sqrt{n}] \leq \epsilon$.

Impossibility result

Prove that if the attackers prior is large (i.e., $k=\Omega(\sqrt{m})$ thus, $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$ thus $d= \Omega(\frac{n}{\sqrt{m}})$ by defn 2.1), then no algorithm can achieve both privacy and utility.

We first show that if $d = \Omega(\frac{n}{\sqrt{m}})$, no $(d, \gamma)$-private algorithm can achieve even little utility.

Let $Q$ be a large query (i.e., returns a sizable fraction if executed on the domain).

When we execute $\widetilde{Q}$, we get errors which depend on $n=|I|$.

If there is utility in algorithm $A$, the users should be able to distinguish between when $Q(I) = n$ and when $Q(I) = 0$. This means we should be able to differentiate between $Pr_{A}(V) = Pr[V | E_{Q}]$ and $Pr_{B}(V) = Pr[V | E_{Q}']$, where $E_{Q}$ refers to the event where $Q(I) = n$ while $E_{Q}'$ refers to the event where $Q(I) = 0$.

Intuitively, $SD(Pr_{A}, Pr_{B})$ should be large in order to differentiate $Pr_{A}$ and $Pr_{B}$. This would mean that utility exists in algorithm A, hence we can get a reasonable estimate for $Q$ by running $\widetilde{Q}$ on $V$.

An algorithm is considered meaningless if $SD(Pr_{A}, Pr_{B})$ is smaller than 0.5 for 2/3 of the queries that a user executes.

But then absence of $(d, \gamma)$-privacy means that there exists some d-independent adversary for which the following happens for some tuple $t$ : either (i) positive leakage (i.e., $Pr[t] \leq d$ but $Pr[t|V] \geq \gamma$) or (ii) negative leakage (i.e., $Pr[t|V] << Pr[t]$). Hence, we have proven the impossibility result.

Algorithm

Let us assume that $k=O(1)$. Now we can develop an algorithm to publish $V$ from $I$ so that $V$ has some utility.