Sunday, 2 November 2014

My free app for Scrabble players, which had about 35,000 downloads was recently removed from the Google play store. Initially, Google sent me the following warning email :

This is a notification that your application, Scrabble Game, with package ID com.dhruvg.apps.bingledemo, is currently in violation of our developer terms.REASON FOR WARNING: Violation of the spam provisions of the Content Policy. Please refer to the spam policy help article for more information.

Your title and/or description attempts to impersonate or leverage another popular product without permission. Please remove all such references. Do not use irrelevant, misleading, or excessive keywords in apps descriptions, titles, or metadata.

Please refer to the keyword spam policy help article for more information.
Your application will be removed if you do not make modifications to your application’s description to bring it into compliance within 7 days of the issuance of this notification. If you have additional applications in your catalog, please also review them for compliance. Note that any remaining applications found to be in violation will be removed from the Google Play Store.
Please also consult the Policy and Best Practices and the Developer Distribution Agreement as you bring your applications into compliance. You can also review this Google Play Help Center article for more information on this warning.
All violations are tracked. Serious or repeated violations of any nature will result in the termination of your developer account, and investigation and possible termination of related Google accounts.

I asked them for an explanation of the violating keywords. I had an idea that this might be the "Scrabble" keyword. However, my app is built only for tournament level Scrabble players. It is very specialized, and removing that keyword would probably result in very few downloads. Regardless, I removed all but one reference to the "Scrabble" keyword. However, this wasn't enough :

This is a notification that your application, Scrabble Bingo Game, with package ID com.dhruvg.apps.bingledemo, has been removed from the Google Play Store.REASON FOR REMOVAL: Violation of the spam provisions of the Content Policy. Please refer to the keyword spam policy help article for more information.

Your title and/or description attempts to impersonate or leverage another popular product without permission. Please remove all such references. Do not use irrelevant, misleading, or excessive keywords in apps descriptions, titles, or metadata.

This particular app has been disabled as a policy strike. If your developer account is still in good standing, you may revise and upload a policy compliant version of this application as a new package name.
This notification also serves as notice for remaining, unsuspended violations in your catalog, and you may avoid further app suspensions by immediately unpublishing any apps in violation of (but not limited to) the above policy. Once you have resolved any existing violations, you may republish the app(s) at will. Before publishing applications, please ensure your apps’ compliance with the Developer Distribution Agreement and Content Policy.
All violations are tracked. Serious or repeated violations of any nature will result in the termination of your developer account, and investigation and possible termination of related Google accounts. If your account is terminated, payments will cease and Google may recover the proceeds of any past sales and the cost of any associated fees (such as chargebacks and payment transaction fees) from you.
If you feel we have made this determination in error, you can visit the Google Play Help Center article for additional information regarding this removal.

My appeal :

My app has been in the store for 2 years now, and I have so far complied
with all requirements. It is a free app (I earn no money from adds) and the
app was built by me alone. I only made it to help competitive
tournament-level Scrabble players improve their game. The app is innately
tied to the game of Scrabble, hence the "Scrabble" keyword is necessary to
promote the game, because it is only directed at improving a players
Scrabble vocabulary via 7-word scores.
I implore you to reconsider reinstating my app on the play store. At least
advise me on the specific keyword violation so I can remove it. I have no
intention of spamming users. My game is only built for Scrabble players, so
how can I remove the Scrabble keyword from the game title? I would
be grateful for any help and guidance and am willing to cooperate fully.

And the automated response :

Hi,
We have reviewed your appeal and will not be reinstating your app. This decision is final and we will not be responding to any additional emails regarding this removal.
If your account is still in good standing and the nature of your app allows for republishing you may consider releasing a new, policy compliant version of your app to Google Play under a new package name. We are unable to comment further on the specific policy basis for this removal or provide guidance on bringing future versions of your app into policy compliance. Instead, please reference the REASON FOR REMOVAL in the initial notification email from Google Play.
Please note that additional violations may result in a suspension of your Google Play Developer account.

Lessons learnt :

- Google has terrible customer service. There is no way for me to talk to a human being about my problem.
- All information about downloads, ratings, etc is lost and apparently, this decision is irreversible. I will avoid developing apps for Android (though I highly doubt iOS is any any better in these types of situations).
- This is my last blog post on Google's blogger platform. I think it's time for me to avoid relying on just a single platform/provider/company. Mostly because of this line -
"Serious or repeated violations of any nature will result in the termination of your developer account, and investigation and possible termination of related Google accounts."
- All future blog posts will be hosted on Github pages instead.

Friday, 15 August 2014

In this paper, we will introduce a formal approach to measuring utility. Using this measure, we will show how to inject utility into k-anonymous and l-diverse tables, while maintaining the same level of privacy.

Introduction

k-anonymity and l-diversity rely on generalizations to preserve privacy.

In the real-world, many attributes often need to be suppressed in order to guarantee privacy, which is bad for utility no matter what operations are being performed on the data.

One solution is to publish marginals (i.e., contingency tables for a subset of the attributes) along with the original anonymized data. This would require anonymizing the marginals too (also via generalizations) in order to preserve privacy.

However, there are many possible subsets of attributes (marginals) for which contingency tables can be built. How do we decide which particular collection of marginals to publish?

Preliminaries

(Defn 2.3) K-anonymity : Table $D$ satisfies k-anonymity if $\forall t \in D$, there are at least $k-1$ other tuples that have the same values as $t$ for every QI (quasi identifier attribute). Note that we assume that the set of all non-sensitive attributes form the QI.

(Defn 2.4) Anonymized group : An anonymized group is a setwise maximal set of tuples that have the same (generalized) value for each non-sensitive attribute.

(Defn 2.5) (c,l)-diversity : Let $c>0$ be a constant and $q$ be an anonymized group. Let $S$ be a sensitive attribute and $s_{1},..., s_{m}$ be the values of $S$ that appear in $q$ and let $r_{1},..., r_{m}$ be their frequency counts. Let $r_{(1)},...,r_{(m)}$ be those counts sorted in a descending order. We say $q$ satisfies (c,l)-diversity wrt $S$ if $r_{(1)} \leq c \sum_{i=l}^{m}r_{(i)}$.

Existing utility measures

Generalization height is one utility measure.

Another measure is discernability, which assigns a cost to each tuple based on how many other tuples are indistinguishable from it. It is the sum of squares of anonymized groups, plus $|D|$ times the number of suppressed tuples.

Both of the above measures do not consider the distributions of the tuples.

A third measure is the classification metric, appropriate when one wants to train a classifier over the anonymized data. Thus, one attribute is treated as a class label. This metric assigns a penalty of 1 for every tuple suppressed. If a tuple $t$ is not suppressed, it looks at the majority class label of $t$'s anonymized group. If the class label of tuple $t$ is different from the majortiy, assign a penalty of 1. This metric the sum of all penalties. But it is not clear what happens if one wants to build classifiers for several different attributes.

A fourth measure is the information to privacy loss ratio, also designed for classification. However, it suffers from the same weakness as the classification metric.

Proposed utility measure

We view the data as an iid sample generated from some distribution $F$.

Suppose tuples in our table have (discrete valued) attributes $U_{1},..., U_{n}$. Then we can estimate $F$ using $\hat{F_{1}}$ where $\hat{F_{1}}$ corresponds to $P(t.U_{1}=u_{1},...,t.U_{n} = u_{n})$ where $t.U_{1}$ refers to the attribute value $U_{1}$ for tuple $t$.

Now suppose we are given anonymized marginals (e.g., 23% of tuples have the age attribute appearing between [46-50] years old while 77% appear between [50-55]). We can view 23% and 77% as constraints i.e., the marginals represent constraints. We can compute the (maximum entropy) probability distribution that corresponds to these constraints, $\hat{F_{2}}$.

(It turns out that the maximum entropy is also the maximum likelihood estimate associated with log linear models.)

We now have $\hat{F_{1}}$ associated with the original data and $\hat{F_{2}}$ associated with the anonymized marginals. We can compare them using the standard KL (kullback-liebler) divergence, which is minimized with $\hat{F_{1}} = \hat{F_{2}}$.

Since our goal is to determine which anonymized marginals to publish, we want to minimize KL-divergence between the various possible $\hat{F_{2}}$ and a fixed $\hat{F_{1}}$.

Extending privacy definitions

We can extend k-anonymity and l-diversity to collections of anonymized marginals.

(Defn 4.1) k-link anonymity : A collection of anonymized marginals $M_{1},...,M_{r}$ satisfies k-link anonymity if for all i = $1,...,r$ and for all tuples $t \in NonSensitive(M_{i})$, either $M_{i}(t) = 0$ or $M_{i}(t) \geq k$. $NonSensitive(M_{i})$ refers to the non sensitive attributes which $M_{i}$ is comprised of while $M_{i}(t)$ refers to the number of the tuples which have the same attribute values as $M_{i}$.

We must also be sure that an adversary cannot use combinatorial techniques to determine that a tuple with a certain value for its quasi-identifiers exists in the original table and that the number of such tuples is less than $k$.

(Defn 4.2) k-combinatorial anonymity : Let $D$ be the domain of the nonsensitive attributes. A collection of anonymized marginals $M_{1},... ,M_{r}$ satisfies k-combinatorial anonymity if for all $t \in D$ one of the following holds:

Thursday, 7 August 2014

Association of QIs (quasi identifier attributes) with sensitive attributes is known as sensitive attribute disclosure (when breached) c.f., membership disclosure which is concerned with identifying whether an individual is in a table.

Goal of the paper is to evaluate the tradeoff bet. incremental gain (by generalization and suppression) in data mining utility and the degradation in privacy caused by publishing QI attributes with sensitive attributes (by the very same generalization and suppression).

Privacy loss is the increase in adversary's ability to learn sensitive attributes corresponding to a given identity.

Utility gain is the increase in accuracy of machine learning (ml) tasks evaluated on the sanitized dataset.

Notations

Tuples $t_{i}$ and $t_{j}$ are quasi-equivalent if $t_{i}[Q] = t_{j}[Q]$ for the QI attribute $Q$. We can thus partition the original table $T$ into QI classes (equivalence classes), denoted as $<t_{j}>$.

Let $\epsilon_{Q} \subseteq T$ be a set of representative records for each equivalence class in $T$.

Semantic privacy is concerned with prior/posterior knowledge of adversary. Anonymization is usually performed by randomly perturbing attribute values.

In contrast, syntactic privacy (microdata privacy) is only concerned about the distribution of attribute values in the sanitized table without any care about the adversary's knowledge. Privacy here is performed by generalization and suppression.

This paper is only concerned with semantic privacy because of the numerous weaknesses of syntactic privacy (weaknesses of k-anonymity, l-diversity, t-closeness were presented by the authors).

Attack model

Adversary only sees $T'$ (sanitized table).

We assume that generalization and suppression is only carried out to QI attibutes, not to sensitive attributes. Why? To keep sanitized data ``truthful'' [31,34] (this reasoning doesn't make much sense to me).

We can now define sensitive attribute disclosure : $A_{diff}(<t>) = \frac{1}{2} \sum_{i=1}^{l}|p(T, s_{i}) - p(<t>, s_{i})|$ $A_{quot}(<t>) = |\frac{p(<t>, s)}{p(T, s)}|$ Basically, these values define how much more the adversary learns from observing santizied QI attributes that he would have learnt from trivial sanitization.

Semantic Privacy

(Defn 1)$\delta$-disclosure privacy : $<t>$ is $\delta$-disclosure private wrt $S$ (sensitive attribute) for all $s \in S$ if $A_{quot}(<t>) = |\log \frac{p(<t>, s)}{p(T,s)}| < \delta$. A table is $\delta$-disclosure private if $\forall t \in \epsilon_{Q}, <t>$ is $\delta$-disclosure private (intuitively, this means that the distribution of sensitive attribute values within each QI class is roughly the same as the others in $T$).

Defn 1 allows us to relate $\delta$ with the information gain parameter (defined as $GAIN(S,Q) = H(S) - H(S|Q)$ used by decision tree classifiers such as ID3 and C4.5). Refer to lemma 1 below.

These privacy measures refer to k-anonymity (i.e., $\forall t_{j} \in \epsilon_{Q}, |<t_{j}>| \geq k$), l-diversity, (c, l)-diversity and t-closeness. These measures have many weaknesses. Ultimately, the authors are only concerned with sematic privacy and not with syntactic privacy.

Measuring privacy

We have a bound on disclosure based on definition 1. However, defintion 1 is insufficient for expressing privacy because actual table instances may have less sensitive attribute disclosure (thus more privacy) than permitted by the definition.

$A_{acc}$ underestimates the amount of info leaked by $T'$ because it doesn't consider the shifts in the probabilty of non-majority sensitive attribute values.

Utility

One approach to measuring utility is to minimize the amount of generalization and suppression to QI attributes [10].

Utility measurement is innately tied to the computations that are to be performed on $T'$. We are concerned with measuring the utility of classification tasks.

Intuitively, the utility of $T'$ should be measured by how well cross attribute correlations are preserved after sanitization.

In our context, we are using the semantic definition of privacy and semantic definition of utility and wish to evaluate the tradeoff between them.

For some workload $w$, we measure workload specific utility of trivially sanitized datasets (where privacy is maximum), denoted by $U_{base}^{w}$. One example for utility is the accuracy of the classifier.

We then compute $U_{san}^{w}$ for $T'$.

$U_{san}^{w} - U_{base}^{w}$ is the utility gain. If this value is close to 0, sanitization is pointless (i.e. trivial sanitization provides as much utility as any sophisticated sanitization algorithm while providing as much privacy as possible).

Another utility measure is $U_{max}^{w}$ on the original dataset $T$. If this value is low, the workload is inappropriate for the data regardless of sanitization because even if users were given $T$, utility would have been low.

Authors used implementation of generalization and suppression from LeFerre[20]. Weka was used for classification.

Of course if sanitization has been correctly performed, it is impossible to build a good classifier for $S$ (sensitive attribute) based only on $Q$ (QI attribute) because good sanitization must destroy any correlation between $S$ and $Q$.

Perhaps it is not surprising that sanitization makes it difficult to build an accurate classifier for $S$. But experiments on the UCI dataset show that sanitization also destroys correlation between $S$ and neutral attributes.

Experiments indicate that it is difficult to find DBs on which sanitization permits both privacy and utility.

However, we can construct artifical DBs for specific scenarios that permits both privacy and accuracy (example provided is very contrived).

Wednesday, 6 August 2014

Definition of privacy : an attackers prior belief of tuple $t$ belonging to private instance $I$ divided by his posterior belief of tuple $t$ belonging to $I$ given the published instance $V$.

Prior is given by $Pr[t] \leq k \frac{n}{m}$ where $n = |I|$, $m = |D|$ ($D$ refers to the domain of the attribute values of $I$), and $k$ is some constant. This is a reasonable definition because $I$ contains $n$ tuples out of $m$ possible tuples.

Posterior is given by $Pr[t | V] \leq \gamma$ for some $\gamma$.

Definition of utility : Utility is the ability to estimate counting queries (example of a counting query- SELECT count(*) WHERE country=``Canada''). Utility is conveyed by the formula : $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ where $n = |I|$, $Q(I)$ refers to a counting query over the instance $I$, $\widetilde{Q}(V)$ refers to a counting query estimate over the published instance $V$ and $\rho$ is some number.

One result of the paper is that if $k$ is $\Omega (\sqrt{m})$ (hence $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$, which represents a very powerful attacker with a large prior) then no algorithm can achieve both utility and privacy.

If $k$ is $O(1)$ (i.e., $Pr[t] = O(\frac{n}{m})$, which means that the attacker is weaker) then we describe an algorithm that achieves a privacy/utility tradeoff given by the formula $\rho = \sqrt{\frac{k}{\gamma}}$. The idea is that if $\rho$ is small ($\rho$ represents query estimation error, shown in the RHS of the utility formula), then utility is high because query estimation error is low. $k$ conveys prior while $\gamma$ conveys the posterior (definition of privacy).

The described anonymization algorithm is a randomized algorithm st for any $Q$ (query), the probability that the algorithm outputs $V$ st $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ is $1-\epsilon$ for some $\epsilon$.

Classification of adversaries

For every tuple, the attackers prior is either small or equal to 1.

(Defn 2.1) Let $d \in [0,1]$. A d-bounded adversary is one for which $\forall t \in D$ either $Pr[t] \leq d$ or $Pr[t] = 1$. If $Pr$ is tuple independent, it is called a d-independent adversary.

For a d-independent adversary, there is no correlation amongst tuples though there may be correlation among attributes.

(Defn 2.3) An algorithm is (d-$\gamma$)-private for all d-independent adversaries if (i) $\frac{d}{\gamma} \leq \frac{Pr[t|V]}{Pr[t]}$ and (ii) $Pr[t|V] \leq \gamma$

If $t$ fails condition (i), then there is a negative leakage (i.e., adversary learns more about the fact that tuple $t$ is not in $I$), while if it fails condition (ii), this is a positive leakage (i.e., adversary learns more about the fact that the tuple $t$ is in $I$).

(Defn 2.5) A randomized algorithm is called $(\rho, \epsilon)$-useful if it has an estimator (i.e., $\widetilde{Q}$) st for any $Q$ and $I$: $Pr[|Q(I) - \widetilde{Q}(V)| \geq \rho \sqrt{n}] \leq \epsilon$.

Impossibility result

Prove that if the attackers prior is large (i.e., $k=\Omega(\sqrt{m})$ thus, $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$ thus $d= \Omega(\frac{n}{\sqrt{m}})$ by defn 2.1), then no algorithm can achieve both privacy and utility.

We first show that if $d = \Omega(\frac{n}{\sqrt{m}})$, no $(d, \gamma)$-private algorithm can achieve even little utility.

Let $Q$ be a large query (i.e., returns a sizable fraction if executed on the domain).

When we execute $\widetilde{Q}$, we get errors which depend on $n=|I|$.

If there is utility in algorithm $A$, the users should be able to distinguish between when $Q(I) = n$ and when $Q(I) = 0$. This means we should be able to differentiate between $Pr_{A}(V) = Pr[V | E_{Q}]$ and $Pr_{B}(V) = Pr[V | E_{Q}']$, where $E_{Q}$ refers to the event where $Q(I) = n$ while $E_{Q}'$ refers to the event where $Q(I) = 0$.

Intuitively, $SD(Pr_{A}, Pr_{B})$ should be large in order to differentiate $Pr_{A}$ and $Pr_{B}$. This would mean that utility exists in algorithm A, hence we can get a reasonable estimate for $Q$ by running $\widetilde{Q}$ on $V$.

An algorithm is considered meaningless if $SD(Pr_{A}, Pr_{B})$ is smaller than 0.5 for 2/3 of the queries that a user executes.

But then absence of $(d, \gamma)$-privacy means that there exists some d-independent adversary for which the following happens for some tuple $t$ : either (i) positive leakage (i.e., $Pr[t] \leq d$ but $Pr[t|V] \geq \gamma$) or (ii) negative leakage (i.e., $Pr[t|V] << Pr[t]$). Hence, we have proven the impossibility result.

Algorithm

Let us assume that $k=O(1)$. Now we can develop an algorithm to publish $V$ from $I$ so that $V$ has some utility.

Query is locally valid for dataset $D$ and set of rules $R$ if it conceals all relationships of $R$. Query is globally valid for any $D$ which conforms to some schema $S$ if $R$ is concealed for all $D$.

Document model $D = (X, E_{D}, root_{D}, labelOf_{D}, valueOf_{D})$.$X$ refers to nodes (nodes contain metadata like ids), $E_{D}$ refers to elements of the XML (i.e., attribute values, this is, in reality, a metadata of a node). $root_{D}$ is the root node, $labelOf_{D}$ is a function $f: X \to E_{D}$ and $valueOf_{D}$ is a function $f: X \to A$ where $A$ is a set of element values of atomic nodes (nodes with no outgoing edges).

k-concealment of relationship $(A,B)$ for some elements $A, B$ in $D$: given $b \in B$, we want a subset $A_{k} \subset A$ st (i) no element in $A_{k}$ is ancestor of another (ii) user cannot infer which among $k$ is an ancestor of $B$.

Expansion of $D$ is $D''$, created by replacing $E_{D}$ with $E'$ (children) and $E''_{D}$ (descendents).

A transitive closure $\bar{D}$ is an expansion of $D$ st there is an edge bet. every 2 nodes connected by direct path in $D$ and an edge from every node to itself.

An XPath query is equivalent to evaluating the query over the transitive closure of the document.

To prune a document expansion is to remove from $D''$ all edges that connect restricted nodes.

Universe of expansions of $D$ is the set of all $D''$ st $prune(\bar{D}$) = prune($D'')$. Denote this by $U_{R}(D)$.

(Global validity) $Q$ is globally valid for $R$ give schema $S$ if $Q$ is locally valid $\forall D$ that conform to $S$. This is more restrictive than local validity but offers benefits (no need to check $Q$ over all $D$ if $Q$ is globally valid, saving computation).

$R$ is coherent if it is impossible to infer any hidden relationships from the relationships not pruned by the rules.

$R$ has incomplete concealment in $D$ if either (i) $D$ has 3 elements $e1,e2,e3$ st $prune(\bar{D})$ has an edge from $e1$ to $e2$ and $e2$ to $e3$ but none from $e1$ to $e3$ or (ii) $D$ has 3 elements $e1,e2,e3$ st $prune(\bar{D})$ has an edge from $e1$ to $e3$ and $e2$ to $e3$ but none from $e1$ to $e2$.

(Coherent) R is coherent if incomplete concealment does not occur in $D$.

Encapsulating edge in $\bar{D}$: $(e1, e2)$ encapsulates $(e1' e2')$ if there is a path running through all of these edges and $e1$ appears before (or at the same spot) as $e1'$ or if $e2$ appears after (or at the same spot) as $e2'$.

If $R$ is coherent, $\forall (e1,e2)$ in $\bar{D}$ which are removed by pruning, if there is an edge $(e1',e2')$ which is encapsulated by $(e1,e2)$, then $(e1',e2')$ should be removed too.

It had become really taxing for me to work on simultaneous applications on my 4GB macbook pro. Eclipse especially consumes an inordinate amount on memory. Hence I decided to purchase a 16GB RAM card on Amazon.

The innards of my Macbook Pro (late 2011)!

The difference between 4GB and 16GB is truly astounding. Now I can actually open up RStudio and Eclipse simultaneously. What a luxury :|

As an aside, it is ridiculous how shipping a RAM card from the US to Canada even with the shipping charges saves you a significant amount compared to buying the RAM card here. This is true for almost all products. Why Canada, WHY?!