An Efficient Privacy-Preserving Ranked Keyword Search Method

Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertext search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents.

Privacy requirements. We set a series of privacy requirements which current researchers mostly focus on.

1) Data privacy. Data privacy presents the confi- dentiality and privacy of documents. The adversary cannot get the plaintext of documents stored on the cloud server if data privacy is guaranteed. Symmetric cryptography is a conventional way to achieve data privacy.

2) Index privacy. Index privacy means the ability to frustrate the adversary attempt to steal the information stored in the index. Such information includes keywords and the TF (Term Frequency) of keywords in documents, the topic of documents, and so on.

3) Keyword privacy. It is important to protect users query keywords. Secure query generation algorithm should output trapdoors which leak no information about the query keywords.

4) Trapdoor unlinkability. Trapdoor unlinkability means that each trapdoor generated from the query is different, even for the same query. It can be realized by integrating a random function in the trapdoor generation process. If the adversary can deduce the certain set of trapdoors which all corresponds to the same keyword, he can calculate the frequency of this keyword in search request in a certain period. Combined with the document frequency of keyword in known background model, he/she can use statistical attack to identify the plain keyword behind these trapdoors.

5) Rank privacy. Rank order of search results should be well protected. If the rank order remains unchanged, the adversary can compare the rank order of different search results, further identify the search keyword.