PASMAC

Privacy and Security for MapReduce Clouds

People

PhD Students

Faculty

Despite the hype, only a few large enterprises or governmental organizations outsource their services to public clouds. On the one hand, cloud computing has been identified as one of the top ten business strategies as it offers many advantages such as greater flexibility and reduced costs. However, on the other hand, outsourcing and, therewith, relinquishing control of services implies many security and privacy problems. Cloud providers often place their data centers in foreign countries where security policies are difficult to enforce, cloud infrastructures are threatened by hackers, insiders, and even malicious customers trying to peek into or tamper with outsourced data. Consequently, the cloud cannot be trusted. As of today, lack of security and privacy guarantees are major adoption obstacles for both, large enterprises and governmental organizations.

Our research, PASMAC, targets the design and evaluation of protocols for secure and privacy- preserving “data analysis” in an untrusted cloud. With PASMAC, the user can store and query data in the cloud, preserving privacy and integrity of outsourced data and queries. PASMAC specifically addresses a real-world cloud framework: Google’s prominent MapReduce paradigm. PASMAC will design and prototype new protocols based on highly parallelizable, efficient privacy-preserving techniques, such as efficient private information retrieval, encrypted Bloom filters, and additive homomorphic encryption.

Projects

Research in PASMAC is supported by NSF grant 1218197.

Efficient Private Information Retrieval for MapReduce

One often overlooked problem with outsourcing data to the cloud is privacy of access patterns. There are many exisiting solutions for encrypting outsourced data, but it is usually accepted that the cloud will be able to see when, where and how often you access it. It is often difficult to quantify the information this access pattern leaks to an interested party. For instance, a cloud storing medical data may not be able to read individual patient records, but it could infer that a person has cancer by the fact that an oncologist is accessing his/her records.

Existing techniques that mitigate this problem (oblivious RAM and private information retrieval protocols) currently require too much computation or bandwidth to be usable in cloud situations. PIR techniques can have low bandwidth requirements, but must inherently perform some computation over the entire database for each query. ORAM, in contrast, has low computational cost but must occasionally transfer large portions of the database back to the client to be processed. Since outsourcing data to the cloud only makes sense if you have a lot of it, these techniques are not currently practical for the large database sizes that would occur in such a situation.

Our approach to this problem is to take advantage of the large computational resources that are available in cloud settings to make fast PIR queries possible for large sets of data. We have designed a version of PIR that can run efficiently on MapReduce and scale to a large number of compute nodes. This allows the client to decide how fast they would like the query to return and assign resources accordingly.

Privacy-Preserving Search in MapReduce

We present PRISM, a privacy-preserving scheme for word search in
cloud computing. In the face of a curious cloud provider, the main challenge is
to design a scheme that achieves privacy while preserving the efﬁciency of cloud
computing.
Solutions from related research, like encrypted keyword search or
Private Information Retrieval (PIR), fall short of meeting real-world cloud requirements and are impractical.

PRISM’s idea is to transform the problem of
word search into a set of parallel instances of PIR on small datasets. Each PIR
instance on a small dataset is efﬁciently solved by a node in the cloud during the
“Map” phase of MapReduce. Outcomes of map computations are then aggregated
during the “Reduce” phase. Due to the linearity of PRISM, the simple aggregation of map results yields the ﬁnal output of the word search operation. We have
implemented PRISM on Hadoop MapReduce and evaluated its efﬁciency using
real-world DNS logs.

PRISM’s overhead over non-private search is only 11%.
Thus, PRISM offers privacy-preserving search that meets cloud computing
efficiency requirements. Moreover, PRISM is compatible with standard MapReduce,
not requiring any change to the interface or infrastructure.

Efficient Privacy-Preserving Counting for MapReduce

In the face of an untrusted cloud infrastructure, outsourced data
needs to be protected. Fully homomorphic encryption is one solution
that also allows performing operations on outsourced data. However,
the involved high overhead of today's fully homomorphic encryption
techniques outweigh cloud cost saving advantages, rendering it
impractical. We present EPiC, a practical, efficient protocol for
the privacy-preserving evaluation of a fundamental operation on data
sets: frequency counting. In an IND-CPA encrypted outsourced data
set, a cloud user can specify a pattern, and the cloud will count
the number of occurrences of this pattern in a completely oblivious
manner. A pattern is expressed as a boolean formula on the fields
of the records and can specify values counting, range counting, and
conjunctions/disjunctions of field values.

EPiC's main idea is,
first, to reduce the problem of counting to a summation of polynomial evaluations. Second, to efficiently evaluate
the summation of polynomial evaluations in a privacy-preserving manner, we extend previous work
on the Hidden Modular Group Order assumption and design a new
somewhat homomorphic encryption scheme. We show how a
general pattern, defined by a boolean formula, is arithmetized
into a multivariate polynomial over GF(2) and used in EPiC. This scheme is highly
efficient in our particular counting scenario.

Besides a formal analysis where
we prove EPiC's privacy, we also present implementation and
evaluation results. We specifically target Google's prominent
MapReduce paradigm as offered by major cloud providers. Our
evaluation performed both locally and in Amazon's public cloud with
data sets sizes of up to 1 TByte shows only modest overhead compared
to non-private counting, attesting to EPiC's efficiency.

Efficient Private File Retrieval by Combining ORAM and PIR

Recent research results on tree-based Oblivious RAM by Shi et al. [14] obtain communication complexity of O(l · log3(N)) in the worst-case for an N-capacity storage with blocks size l. The individual nodes in the tree, however, are constructed using traditional ORAMs which have worst-case communication complexity linear in their capacity and block size. PIR protocols are able to provide better worst-case bounds (decoupling capacity from block size), but have traditionally been less practical than ORAM due to the fact that they require O(N) computational complexity on the server. This paper presents Path-PIR, a hybrid ORAM construction, using techniques from PIR, that overcomes the individual weaknesses of each. Path-PIR significantly reduces communication complexity when the block size of the ORAM is large. Compared to existing work, this leads to smaller data transfer costs by orders of magnitude for practical sized databases and achieves worst-case communication complexity of O(l · log2 (N)) for large block sizes. Additionally, the typically high computational cost of PIR is negated by the tree structure of the ORAM, which requires only a small fraction of the database to be operated on for each query. We also investigate the concept of an ORAM’s latency, which is the amount of communication required before users receive the result of their query. We show that Path-PIR achieves lower latency than any existing scheme, only about four times the block size. Using Amazon EC2 as an example, we demonstrate that even with the additional cost of PIR computation, Path-PIR provides a significant monetary saving compared to related work.

We present RASP, a new protocol for privacy-preserving range
search and sort queries on encrypted data in the face of an
untrusted data store. The contribution of RASP over related work
is twofold: first, RASP improves privacy guarantees by ensuring
that after a query for range [a,b] any new record added to the
data store is indistinguishable from random, even if the new
record falls within range [a,b]. Second, RASP is highly
practical, abstaining from expensive asymmetric cryptography and
bilinear pairings. Instead, RASP only relies on hash and block
cipher operations. The main idea of RASP is to build upon a new
update-oblivious bucket-based data structure. We allow for data
to be added to buckets without leaking into which bucket it has
been added. As long as a bucket is not explicitly queried, the
data store does not learn anything about bucket
contents. Furthermore, no information is leaked about data
additions following a query. Besides formally proving RASP's
privacy, we also present a practical evaluation of RASP using
Amazon Dynamo.