Friday, 15 November 2013

Having installed PostgreSQL 9.1 on my ec2 instance (Amazon Linux) following these instructions, I wanted to experiment with a few contrib modules. However, this process was not trivial, mainly because the installed postgres distribution did not have the contrib modules by default. Here's what I did :

Monday, 11 November 2013

Extracting useful information from large datasets has become an increasingly common endeavour. However, if datasets are not privacy protected, data mining can lead to a breach of individual privacy. One privacy preserving technique, k-anonymity, is examined. The authors describe k-anonymity, its enforcement, threats to k-anonymity and addressing these threats (in the context of data mining).

k-anonymity requires that the size of a QI-group (equivalence class) is at least k. QI attributes are those that can serve as an identifier for an individual. The aim is for tuples in a QI-group to be indistinguishable from each other. QI-groups can be built by generalization (replacing specific values with generalized versions from a hierarchy) and suppression (changing a specific attribute value to its most generalized form). These techniques result in information loss, which algorithms seek to minimize e.g., Samarati’s algorithms, etc.

k-anonymity can be violated by data mining techniques (e.g., association rule mining, classification, etc) since they can used to draw inferences and breach privacy. To address this, one could either anonymize-and-mine (AM) or mine-and-anonymize (MA). The AM approach decouples anonymization and mining but reduces the utility of the data while the MA approach extracts more utility from the data and is more efficient but restricts data mining to the owner.

Record matching is the process of identifying identical entities across different data sources (or identifying duplicates in a single source). When record matching is performed across autonomous businesses, privacy preservation becomes important (e.g., businesses might not want to share schema or tuple information). The authors thus present an efficient privacy-preserving protocol for record matching across data sources, assuming semi-honest parties. The protocol provides both secure data matching and secure schema matching.

Secure data matching is carried out in 3 phases. In phase 1, two parties, P and Q decide on a set of random strings and a comparison function. The SparseMap algorithm is used with the set and the comparison function to build an embedding space. In phase 2, P and Q both apply SparseMap to embed Rp and Rq into the embedding space. Two structures, Pstr and Qstr are also created in this process (these keep track of vectors in the euclidean space; a vector is an attribute value of a tuple). In phase 3, a third party, W compares each record in Pstr and Qstr. If the comparison distance is below a threshold, then those records are the same (approximate matching c.f. [Agrawal, SIGMOD ‘03] which uses exact matching). Thus, all the same records can be collected and returned to P and Q.

In secure schema matching, P and Q map their local schemas to a global schema owned by W in a series of 4 phases. In phase 1, P and Q decide on a secret key k (not known to W). In phase 2, P and Q map their local schemas to a global schema using mapping expressions, generating matched schemas. In phase 3, these matched schemas are encrypted by P and Q respectively by using a hash function (provided by W) and the key k. In the final phase 4, W computes the intersection of these encrypted sets. The intersection can be sent to P and Q, which can then be decrypted using key k.

The overall protocol uses both the secure data matching step and (a slightly modified version of) the secure schema matching step. Proofs are provided to corroborate the security properties of the overall protocol. Experiments suggest that (i) high quality embedding is possible by tuning the embedded space, (ii) the protocol is effective, but there is a tradeoff with quality and efficiency wrt heuristics (iii) the protocol is doubly efficient compared to [Agrawal, SIGMOD ‘03] and comparably efficient compared to record matching in the original space.

Monday, 4 November 2013

Recent trends (e.g., outsourcing) in the field of information integration highlight the importance of security when sharing information across autonomous DBs. It is essential that queries (which span multiple DBs) be answered in a terse manner i.e., no additional information apart from query results is leaked. Present solutions are inadequate (e.g., using third party middlemen, which requires a high amount of trust on these middlemen). Instead, the authors propose certain cryptographic protocols which are used to simulate trusted third parties. Protocols for 4 operations are explored. These protocols only permit a minimal necessary amount information to be shared for queries, assuming semi-honest parties.

The protocols support 4 operations- (i) intersection, (ii) equijoin, (iii) intersection size and (iv) equijoin size. For (i), a naïve protocol is proposed, which has the disadvantage that R (the receiver) can guess all additional information sent by S (the sender) if the domain is small. This can be fixed by using a pair of commutative functions (assuming ideal hashing). For (ii), yet another naïve protocol is proposed, which incorrectly encrypts extra information. This has the same weakness as that in (i) wrt the naïve protocol. To overcome this, the extra information is encrypted with a key, which can obtained by inverting a secure function. For (iii), the intersection protocol in (i) is modified so that R learns only about the intersection size, and not the intersection set (which would also result in the intersection size). For (iv), the protocol is similar to that in (iii). All the 4 protocols are all proven correct and secure assuming semi-honest parties (i.e., they disclose minimal information apart from the query result).

The authors acknowledge certain limitations of their protocols. Their protocols do not make any guarantees for multiple queries (which can be handled in other ways e.g. maintaining audit trails). They also assume that database schemas are known, and don't account for schema heterogeneity.

Finally, the authors present the theoretical efficiency of the proposed protocols and also perform estimated cost analysis for certain scenarios.

Monday, 28 October 2013

In the data cleaning process, it is often unclear whether the data is dirty or if data constraints are outdated (or both). [Chiang, ICDE ‘11] proposed a novel approach involving both data and FD repairs in a unified repair model. In this paper, the authors explored a similar idea by assigning relative trust values to data and FD repairs in a combined model. The proposed algorithms generate multiple data and FD repairs corresponding to specific relative trust values.

The key Repair_Data_FDs algorithm was developed to produce (Σ’, I’) repairs i.e., τ-constrained repairs (where τ is the relative trust value). These τ-constrained repairs are essentially minimal repairs. Two important steps of Repair_Data_FDs are (i) modifying Σ to get Σ’ (as close to Σ as possible), while guaranteeing that there exists an instance I’ s.t., I’ |= Σ’, with dist(I’, I) <= τ and (ii) modifying I wrt Σ’ to get a minimal I’.

In step (i), a key requirement is to compute the minimum number of cell changes in I to satisfy Σ’, called δopt(Σ’, I). This is NP-hard. An approximation, called P-approximate τ-constrained repair, can be used to represent an upper bound on δopt(Σ’, I). This upper bound can also be represented as the 2-approximate vertex cover of the conflict graph (given I and Σ) below a threshold. Step (i) can now be formalized. One crucial procedure is to find goal states in a state search space of FD modifications. This can be done via an optimized BFS or A* search of the search space. The output of step (i), Σ’, can now be processed by step (ii), to find data repairs wrt Σ’. The output of step (ii) is a V-instance (which is used to represent multiple I’s), obtained by finding vertex covers of conflict graphs (superficially similar to finding the upper bound of δopt(Σ’, I) in step (i)).

Another algorithm, Find_Repairs_FDs (a modification of Repair_Data_FDs) performs efficiently on multiple τ values. However, unlike Repair_Data_FDs, Find_Repairs_FDs only generates FD repairs. Hence, another algorithm, Repair_Data, is used to find data modifications for each FD repair. Experiments indicate that the A* algorithm scales better than BFS for most conditions.

Repairing dirty data is an important process since unclean data leads to erroneous decision making and economic loss. Most current repair techniques only focus on the repair of data inconsistencies (wrt a set of constraints). However, in many modern scenarios, the constraints need repair too (e.g., evolution of business rules, etc). In this paper, the authors present a novel cost model which accounts for both data and FD constraint repair. In addition, two algorithms were examined (individually and together)- one for data repair and one for constraint repair.

Data repair is based on the concept of tuple patterns. A core tuple pattern is a frequently occurring tuple in the relation, and a sequence of these form the model. A deviant tuple pattern is one which falls below the threshold but is very similar to at least one core pattern. The intuition is that deviants are candidates for data repair.

Constraint repair is based on the concept of VI (variance of information). Tuples are first partitioned into classes, where each class in a specific partition contains all tuples sharing the same X value wrt a FD, F: X -> Y. Classes can be further refined so that each class shares both the X and Y value. Constraint repair consists of enumerating candidate attribute partitions, and calculating VI scores (a technique used for evaluating different clusters) with the ground truth partition (the original partition based on F, since F is a validated application constraint). The attribute partition with the lowest VI with the ground truth is selected as the constraint repair.

Combining the data and constraint repair techniques results in the unified cost model. The model is guided by 3 properties (accuracy, redundancy and conciseness), which are encapsulated by a DL (distance length) score. The implementation of the cost model is the MDL repair algorithm. The repair algorithm first orders all constraints, and then loops over them. Within the loop, the cost for data repair is calculated first, and then the cost for constraint repair. The repair type which confers the lower cost is recommended. Extensive experiments indicate that the unified model scales well (given the parameters), and also produces high quality repairs (based on the Cora dataset).