This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing JOURNAL OF LATEX CLASS VOL. 13, NO. SEPTEMBER 2014 IEEE Transactions onFILES, Dependable and9,Secure Computing
( Volume: 15 , Issue: 3 , May-June 1 2018 )
1
Searchable Encryption over Feature-Rich Data Qian Wang, Member, IEEE, Meiqi He, Minxin Du, Sherman S. M. Chow, Member, IEEE, Russell W. F. Lai, and Qin Zou Member, IEEE Abstract—Storage services allow data owners to store their huge amount of potentially sensitive data, such as audios, images, and videos, on remote cloud servers in encrypted form. To enable retrieval of encrypted files of interest, many searchable symmetric encryption (SSE) schemes have been proposed. However, most existing SSE solutions construct indexes based on keyword-file pairs and focus on boolean expressions of exact keyword matches. Moreover, most dynamic SSE solutions cannot achieve forward privacy and reveal unnecessary information when updating the encrypted databases. We tackle the challenge of supporting large-scale similarity search over encrypted feature-rich multimedia data, by considering the search criteria as a high-dimensional feature vector instead of a keyword. Our solutions are built on carefully-designed fuzzy Bloom filters which utilize locality sensitive hashing (LSH) to encode an index associating the file identifiers and feature vectors. Our schemes are proven to be secure against adaptively chosen query attack and forward private in the standard model. We have evaluated the performance of our scheme on various real-world high-dimensional datasets, and achieved a search quality of 99% recall with only a few number of hash tables for LSH. This shows that our index is compact and searching is not only efficient but also accurate. Index Terms—Cloud storage, searchable encryption, homomorphic encryption, similarity search, proximity search
F
1
I NTRODUCTION
S
TORAGE services have motivated data users, including both individuals and enterprises, to outsource their (potentially huge amount of) data to remote servers (e.g., cloud) to save the expensive local storage and management costs. However, outsourced data is no longer under the direct physical control of the data owner. Sensitive data, such as personal files, commercial secrets, and healthcare records, should be encrypted locally before outsourcing. Data encryption, however, renders data utilization a challenging task. In particular, it becomes difficult to retrieve data of interest based on their content as in the plaintext search. To enable users to efficiently retrieve encrypted data of interest from a remote storage server, the notion of searchable symmetric encryption (SSE) was proposed [1] and design of index-based SSE schemes has received much attention recently. The secret key in SSE can generate search tokens to perform search queries over encrypted data. Different formal security notions have been proposed (e.g., see [2]). By sacrificing access pattern and/or search pattern, most SSE schemes achieve highly efficient searches while maintaining privacy guarantee from a practical perspective. Most existing SSE solutions [2], [3], [4], [5] constructed keyword-based search indexes (e.g., linked lists based on keyword-file pairs), supporting only efficient exact but not similarity searches. Moreover, most dynamic schemes cannot guarantee forward privacy, which requires that the database server cannot learn whether a newly added data file contains a keyword w that has been searched in the past.
• • •
Q. Wang, M. Du, and Q. Zou are with The State Key Lab of Software Engineering, School of Computer Science, Wuhan University, China. E-mail: {qianwang,duminxin,qzou}@whu.edu.cn M. He is with the Department of Computer Science, The University of Hong Kong, Hong Kong. E-mail:
[email protected] S. Chow and R. Lai are with the Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong. E-mail: {sherman,wflai}@ie.cuhk.edu.hk
1.1
Related Work
The first full-text SSE scheme was proposed by Song et al. [1] with search time linear in the length of the file collection. Curtmola et al. [3] generalized security definitions of SSE and proposed a construction based on the inverted index. The search time is linear in the number of files that contain keyword w which is considered as optimal. Kamara et al. [2] extended it to dynamic data which supports both addition and removal of files. Kamara et al. [6] proposed a parallelizable and dynamic SSE scheme using red-black trees. The above dynamic SSE schemes cannot achieve forward privacy, which means the search token allows linking the files to be added in the future. The scheme of Stefanov et al. [7] is forward private. Although the theoretical cost is sub-linear, the actual search time cost will be large when the number of document-keyword pairs is large. Solutions [4], [8], [9], [10] supporting boolean expressions on keywords have also been proposed. Performing conjunctive queries over encrypted data was first considered by Golle et al. [8]. Their approach works by testing each encoded file against a set of tokens, so the complexity grows linearly with the number of files. This is also a general problem of public-key scheme [9]. For supporting generic boolean search over the encrypted data, Moataz and Shikfa [10] proposed a new SSE scheme which is also based on vectors (but not for similarity search). By transferring keywords to vectors and using the Gram-Schmidt process to orthogonalize and orthonormalize them, their scheme leverages the inner product of vectors to execute the encrypted search. The work of Cash et al. [4] first retrieves the results for one of the query keywords, and then filters the results according to the given boolean query against each of remaining keywords. The above solutions only support keyword-based exact searches and do not allow similarity search over encrypted feature-rich data. As a generalization of SSE, structured encryption is
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
proposed by Chase and Kamara [11]. The data structures supported are matrices and graphs. This is further generalized by Lai and Chow [12] which supports any abstract data type in the form of query and response pair. However, utilizing their scheme in this setting means one needs to pre-compute all possible queries, which does not satisfy our goal to support unforeseen queries. 1.2
Similarity Search
Similarity search has been widely used in the information retrieval community for content-based search of featurerich data (e.g., image, audio, and video files), and other queries such as finding close/similar points in certain metric spaces [13]. The feature-rich data objects are usually represented as high-dimensional vectors in a metric space. Distances between objects are measured by Euclidean distance or other distance metrics. In contrast to exact search, similarity search can relax constraints on users’ requests to retrieve a list of desirable results based on likely relevance. Similarity search also tolerates uncertain and inaccurate search inputs, thus improves usability and search experience in general. With the increasing popularity of sharing multimedia data in social networks, the demand for similarity search over encrypted multimedia data is also increasing. For example, users of social networks may want to identify and view multimedia files shared by their friends who enforced access control. Yet, similarity search has received attention mostly in the plaintext domain [14], [15]. Below we discuss three recent approaches for establishment of similarity indexes while providing privacypreserving content-based searches. Earlier approaches can be found in the references within. Kuzu et al. [16] investigated the use of locality sensitive hashing (LSH) family for supporting similarity search based on Jaccard distance, and used computationally-expensive methods to encrypt the index. The main problem of their scheme stems from the embedding technique for transforming the metric space on edit distance to the Euclidean space. It is proven that existing embedding approaches cannot provide sufficient distance preservation after space transformation, and lead to intolerable search errors [17]. Another limitation is the large number of hash tables required to cover the nearest objects. When this space requirement exceeds the main memory size, looking up the hash tables may cause substantial delay due to disk I/O access [18]. Moreover, it lacks a rigorous performance analysis, i.e., there is no theoretical characterization of the search quality. Yuan and Yu [19] proposed a privacy-preserving biometric identification for large-scale encrypted databases. The search complexity, however, is approximately O(mn2 ) for the exact search, where m denotes the dataset size and n is the vector dimension. The security of their scheme relies on heuristic argument that security comes from the “encryption” of feature vectors by randomly-generated matrices. It is possible that the introduced randomness can be removed in the presence of collusion. In addition, Moataz et al. [20] made the first attempt at performing similarity search over text keywords by combining the searchable encryption technique with a stemming algorithm to achieve fuzzy searches over encrypted data. Applying their technique to our setting
2
means that one needs to tag the data with many possible keywords, which may not be practical. Based on similar approaches, Wang et al. [21] proposed a multi-keyword fuzzy search scheme over encrypted data by using LSH and secure k nearest neighbors (k NN) [22]. Still, this scheme lacks formal and thorough analysis under a well-defined and rigorous security model as previously pointed out [22]. In contrast, most existing SSE schemes from the cryptography community have their security rigorously analyzed [1], [2].
1.3
Our Contributions
This paper tackles the problem of constructing a secure similarity index for efficient approximate members search over large-scale encrypted feature-rich data. The key idea is to transform the data into a set of feature vectors, which are further mapped by LSHs to an array position. We call the entry there a bucket, which will be pointing to the possible matching files. Since the LSH sends all similar files to the same output, together with our special treatment of propagation (to be explained), it hence creates a cluster of buckets which contains files approximately close to the query object with a high probability. To represent the pointers to the file, we store the inverted file identifier vectors (IFVs) in each of this bucket. An IFV is a vector which represents the files that fall in a given bucket for one hash function (among the set of hash functions in the LSH). For a trade-off of computation efficiency and bandwidth, we have two options in encrypting the resulting IFVs, which are additive homomorphic encryption (for saving bandwidth) or pseudo-random functions/permutations (for more efficient computation). The resulting schemes enable the client to perform privacy-preserving similarity search by checking only a few number of hash tables (or buckets). The response set consists of approximate files which can be readily used for privacypreserving k NN or approximate nearest neighbors (ANN) search. Another important and desirable property of our index constructions is that they are highly compact, neat and efficient, and they can also efficiently support secure dynamic updates of file and index with minimal operations. Our main contributions are summarized as follows. 1) We study the problem of similarity search over encrypted feature-rich data outsourced to remote servers. We characterize the unique security requirements and formally define the security notions for encrypted similarity search. We then present design of our encrypted indexing schemes with two instantiations supporting privacy-preserving highdimensional similarity search and data dynamics. 2) We present a thorough theoretical analysis and characterize the false positive and false negative rates of our constructions. We prove the semantic security of our schemes under the adaptively chosen query attack (CQA2) model with minimal leakage using simulation-based definitions and proofs. We also show that they achieve forward privacy. 3) We evaluate our constructions against three real-world datasets, with different sizes and qualities. which showed that they are both time- and space-efficient, and they provide high search quality in terms of recall and error ratio.
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 ) JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
2 2.1
P ROBLEM S TATEMENT AND P RELIMINARIES Similarity Search over Encrypted Feature-rich Data
We consider a data owner, also known as the client, with some large-scale and sensitive multimedia data to be outsourced to a cloud-based storage system. The client encrypts the data locally before outsourcing. The collection of multimedia data is denoted by f = (f1 , . . . , f#f ), and its corresponding encrypted form is denoted by c = (c1 , . . . , c#c ). To enable similarity search over c for effective information retrieval, the data owner builds an encrypted index I based on f and encrypts f to obtain c, then outsources the encrypted data γ = (I, c) to the remote cloud server. Later, the client issues a query in the form of a search object q . This object q will be represented by a feature vector throughout the rest of the paper, which is a multi-dimensional vector of numerical features that represents some object (e.g., some selected key scene of a video file, color histograms for image data [23], etc). The query may just near to some of the files but may not exactly match with any of them. To this end, the client generates a search token τq for q and sends it to the cloud server. After receiving τq , the cloud uses it to query I and returns all the encrypted file IDs whose corresponding files cq have characteristics similar to the search object q . The system is dynamic, which means that the client may add or remove multiple files u = f 0 to or from the servers at any time. To do so, the client generates an update token τu . The server can then use τu to securely update c and I . Formally, the core functionalities are defined below. Definition 1. An encrypted (multimedia) data storage system supporting similarity search and dynamic update consists of the following seven polynomial-time algorithms/protocols:
sk ← Gen(1λ ): is a probabilistic key generation algorithm run by the client. It takes as input a security parameter λ and outputs the secret key sk . γ ← Enc(sk, f ): is a probabilistic algorithm run by the client. It takes as input a secret key sk and a (multimedia) data collection f , and outputs an encrypted database γ . An encrypted database is defined by the tuple (c, I), where c is a ciphertext, each of them encrypting the corresponding plaintext in f , and I is an encrypted index, which supports querying the ciphertext in c via the Search protocol. Conceptually, this algorithm consists of two disjoint operations: the index generation and the encryption of the plaintext. f ← Dec(sk, c): is a deterministic algorithm run by the client. It takes as input a secret key sk and a ciphertext c, and outputs a file f . (cq ; ⊥) ← Search(sk, q; γ): is a (possibly interactive and probabilistic) protocol run between the client and the server1 . The client side takes as input a secret key sk and a search object q , while the server side takes as input the encrypted database γ . Upon completion of the protocol, the client obtains a sequence of ciphertexts cq while the server has no local output. (⊥; γ 0 ) ← Update(sk, u; γ): is a (possibly interactive and prob1. A protocol P run between the client and the server is denoted by (u; v) ← P (x; y), where x and y are the respective inputs, and u and v are the respective output, of the client and the server.
3
abilistic) protocol run between the client and the server. The client side takes as input a secret key sk and an update object u, while the server side takes as input the encrypted database γ . Upon completion of the protocol, the client obtains nothing, while the server obtains an updated encrypted database γ 0 . 2.2
Security Definitions
Similarity search over encrypted feature-rich data should provide the following security guarantees. First, an adversary cannot generate search tokens for feature vectors without the secret key. Second, the search tokens generated for the similarity indexes do not reveal any information of the original search objects (i.e., feature vectors of files of interest) beyond what is implied by the search result. Third, before and after similarity searches, the server learns only what are allowed to be leaked by the data owner, i.e., the search pattern and access pattern. We follow the widely-accepted framework to analyze the security of SSE [2], [3]. The security of an SSE scheme can be characterized via the notion of history, search pattern, and access pattern [3]. Below we will give their definitions with respect to the SSE schemes we are going to propose. Definition 2. (History Q). An s-query history over f is a vector of s queries Q = {q1 , . . . , qs }. Definition 3. (Search Pattern π ). Given a search object q at time t, the search pattern π(f , q) is defined by a binary vector of length t with a ‘1’ at location i if the search at time i ≤ t was for q ; ‘0’ otherwise. The search pattern reveals whether the same search was performed in the past or not. Definition 4. (Access Pattern Ap ). Given a search object q at time t, the access pattern is defined by Ap (f , q) = {id(cq )}. Now we come to the main definition asserting the security of an SSE scheme. Our definition follows the simulationbased framework [3], parameterized by the leakage required for the ideal world simulation [11]. Definition 5. (CQA2-security for SSE). Let L1 , L2 , and L3 be leakage functions for setup, search, and update respectively, which served as the parameters for the definition [11]. A searchable encryption scheme SSE is said to be secure against adaptively chosen query attack (CQA2), if for any PPT stateful adversary A, there exists a PPT stateful simulator S such that for the following probabilistic experiments RealSSE (λ) and IdealSSE A A,S (λ), we have
| Pr[RealSSE (λ) = 1] − Pr[IdealSSE A A,S (λ) = 1]| ≤ negl(λ). RealSSE (λ): The challenger runs Gen(1λ ) to generate the secret A key sk . A outputs f and receives γ ← Enc(sk, f ) from the challenger. A makes a polynomial number of adaptive queries q or updates u. For each query q , the challenger acts as a client and runs Search with A acting as a server. For each update u, the challenger acts as a client and runs Update with A acting as a server. Finally, A returns a bit b, which is the output of the experiment. IdealSSE A,S (λ): A outputs f and sends it to S . Given L1 (f ), S generates and sends γ to A. A makes a polynomial number
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
of adaptive queries q or updates u. For each query q , S is given L2 (f , q), and simulates a client who runs Search with A acting as an honest server. For each update u, S is given L3 (f , u), and simulates a client who runs Update with A acting as an honest server. Finally, A returns a bit b, which is the output of the experiment. The bit b output by both the real and ideal experiments, which is originated from the adversary A, can be seen as the decision of A in distinguishing whether it is in the real world or in the ideal world. The intuition is that, since only the leakages are enough to simulate the ideal world, any scheme satisfying this definition will only let any computationally-bounded adversary to learn the leakage but not the other information in the real world. The above definition ensures adaptive security when the query objects are chosen as a function of the encrypted index and the results of previous queries. The leakage functions L1 , L2 , and L3 will be defined for specific schemes. We proceed to define forward privacy. Although some existing schemes (e.g., [7]) is forward private, to the best of our knowledge, a formal definition of the notion is never explicitly given. We therefore give our own definition below. Informally, forward privacy means that, for any search object q , the knowledge of the server learned from previous execution of the search protocol on q does not help the server decide whether a newly added ciphertext c relates to q or not. Our definition requires that any two “add” updates of the same size are indistinguishable. Definition 6. (Forward Privacy). A searchable encryption scheme SSE is said to be forward private, if for any PPT stateful adversary A, we have
Pr[FwdPrvSSE (λ) = 1] ≤ A
1 + negl(λ). 2
FwdPrvSSE (λ): The challenger runs Gen(1λ ) to generate the A key sk . A outputs f and receives γ ← Enc(sk, f ) from the challenger. A makes a polynomial number of adaptive queries q or updates u. For query q , the challenger acts as a client and runs Search with A acting as a server. For update u, the challenger acts as a client and runs Update with A acting as a server. Eventually, A outputs two “add” updates ui = (“Add00 , fui ) such that |f0 | = |f1 | for any fi ∈ fui , i = 0, 1. The challenger picks a random bit b and runs Update with update object ub . Finally, A returns a bit b0 and the experiment returns |b − b0 |. 2.3
Preliminaries
Our constructions use the following basic primitives. Locality sensitive hash (LSH). Definition 7. [24] Locality sensitive hash (LSH) family H = {hj : {0, 1}d → {0, 1}t |j = 1, . . . , M } is called (R, cR, P1 , P2 )-sensitive for a distance function || ∗ || if for any p, q ∈ {0, 1}d and for any j ∈ 1, . . . , M , • •
if ||q − p|| ≤ R then Pr[hj (q) = hj (p)] ≥ P1 ; if ||q − p|| ≥ cR then Pr[hj (q) = hj (p)] ≤ P2 .
To use LSH for similarity search, we should at least ensure that c > 1 and P1 > P2 . To reduce the false positive rate,
4
we need to enlarge the gap between the two probabilities, i.e., making P1 P2 . In practice, R can be pre-set by users in the construction of similarity indexes to meet the requirements of different applications. In Section 5, we carefully choose R based on the characteristics of each dataset. In this paper, we choose l2 norm, i.e., the Euclidean norm P 2 1/2 , for LSH using 2-stable distribution ||x||2 = ( d i=1 xi ) in our construction. Specifically, each hash function is defined by a·q+b ha,b = b c, W where a is a d-dimensional vector with entries chosen independently from a 2-stable distribution: Normal distribution 2 g(x) = √12π e−x /2 , and b is chosen uniformly from [0, W ]. Note that, W should not be too large, otherwise P1 and P2 will both go to 1. In practice, W should be chosen (not too small) to enlarge the gap between P1 and P2 given a fixed c. More details can be found in the literature [24]. We remark that the LSH family aims at mapping similar objects into the same or the neighboring buckets. It is not meant to be a cryptographic one which aims at providing collision resistance for instance. Pseudo-random function (PRF) and permutation (PRP). Let G : {0, 1}λ × {0, 1}poly(λ) → {0, 1}poly(λ) be a PRF, which is a polynomial-time computable function that cannot be distinguished from random function by any polynomialtime adversary [25], and F be a PRP, which is a PRF but the function is also bijective. Looking ahead, in our construction, for a given key k , the range of Fk (·) will be as large as the encrypted index; and Gk (·) takes as input a bit-string which can be as large as a double of the output length of Fk (·), and outputs a bit string of length #f . Homomorphic encryption. Homomorphic encryption allows certain computations to be carried out on ciphertexts to generate an encrypted result which matches the result of operations performed on the plaintext after being decrypted. Although homomorphic encryption for general operations is prohibitively slow, homomorphic encryption for only summation operations is efficient, such as Paillier cryptosystem [26] which is additively homomorphic. It is secure against chosen plaintext attack (CPA) assuming the decisional composite residuosity problem is hard [26]. In Paillier cryptosystem, the public (encryption) key is pkp = (n = pq, g), where g ∈ Z∗n2 , and p and q are two large prime numbers (of equivalent length) chosen randomly and independently. The private (decryption) key is skp = (ϕ(n), ϕ(n)−1 mod n). Given a message a, we write the encryption of a as JaKpkp , or simply JaK. The encryption of a message x ∈ Zn is JxK = g x · rn mod n2 , for some random r ∈ Z∗n . The decryption of the ciphertext is x = L(JxKϕ(n) mod n2 ) · ϕ−1 (n) mod n, where L(u) = u−1 n . The homomorphic property of the Paillier cryptosystem is given by Jx1 K·Jx2 K = (g x1 · r1n ) · (g x2 · r2n ) = g x1 +x2 (r1 r2 )n mod n2 = Jx1 + x2 K. 2.4 Notations Standard bitwise Boolean operations are defined on binary W vectors: bitwise OR (union) “ ” and bitwise AND (intersec-
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable 15 and ,Secure Computing IEEE Transactions on Dependable and Secure Computing ( Volume: Issue: 3 , May-June 1 2018 ) JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
V tion) “ ”, which take the same notations as single bit OR and AND operations, respectively. Feature-rich data objects are typically represented as high-dimensional vectors of numerical features. We use pi to denote the d-dimensional feature vector extracted from data file i. When representing images, the feature values might correspond to the pixels of an image, when representing texts perhaps term occurrence frequencies. [n] denotes the set of positive integers less than or equal to n, i.e., [n] = {1, 2, . . . , n}. 2.5
Inverted File Vectors
Given a vector v, we refer its ith element as v[i] or vi . We use ei to denote a unit basis vector which is a bit-vector of length #f having the ith entry being 1; 0 otherwise. We use fidj ∈ {0, 1}#f to denote the inverted file identifier vector (IFV) placed at the j th bucket. In fidj , the ith entry of IFV, i.e., fidj [i], is 1 if ei has been mapped into the j th c to denote a joint IFV combucket; 0 elsewhere . We use fid puted from the “union” of multiple IFVs, to be explained in the next section. IFV is a cornerstone of our constructions.
3
5
3.2.1
Initialization and Index Construction λ
Gen(1 ): Given a security parameter λ, choose the following uniformly at random from the corresponding domains: • • •
$
a PRP key k1 ← {0, 1}λ for Fk1 (·), a symmetric key k2 ← SKE.Gen(1λ ) for a CPAsecure symmetric encryption SKE, functions (h1 , h2 , . . . , hM ) from the LSH family H.
The M hash functions will be used to generate M hash tables, each consists of m buckets. The output of this algorithm sk = (k1 , k2 , m, h1 , h2 , . . . , hM ).
Enc(sk, f ): 1)
2)
[Initialize a temporary array of buckets] Let x1 , . . . , xM m be the positions of all buckets. For each xj , store 0 (a bit-vector of length #f having all 0 bits) in I[xj ]. [Prepare the feature-to-file index according to the buckets given by LSH] For each file fi ∈ f , (a) Obtain its corresponding d-dimensional feature vector pi using the feature extraction algorithm for the specific data type (e.g., color histograms for image data [23]).
O UR C ONSTRUCTIONS
(b) Compute the M hash bucket positions of the LSH (x1 = g1 (pi ), . . . , xM = gM (pi )), where
This section presents our index data structure for similarity search over feature-rich encrypted multimedia data.
gj (pi ) = hj (pi ) mod m + m(j − 1) for j ∈ [M ] 3.1
Overview
Our key idea is an encoding approach based on inverted file identifier vectors (IFVs), bit-vectors representing a set of files, which allow identification of files with features most similar to the query feature vector. We map highdimensional points onto a line by using an LSH as a basis to put similar objects into the same or adjacent buckets. We aim at minimizing the space overhead. At the same time, we also aim at maintaining an acceptable level of collision probability of nearby objects and increasing the chance of finding similar data objects close to the query. To do so, we c by calculating the “union” of each generate a joint IFV fid c is bucket and its two neighboring buckets, thus probing fid equivalent to probing three adjacent buckets in a shot. For security, the buckets will be permuted through a pseudo random permutation as will be shown in our construction. We proposed two schemes based on the above ideas by c using either Paillier homomorphic cryptosysencrypting fid tem (with ciphertext packing), or by XOR-ing bit vectors with the outputs of a PRF. Intuitively, Scheme-I has lower communication cost where the search results consists of only one Paillier encryption aggregating M encrypted IFVs. Our Scheme-II is more computationally efficient due to the use of symmetric encryption, but it requires the server to return M encrypted IFVs. More details can be found in the experimental results in Table 5. 3.2
Basic Construction without Index Privacy
We first present our basic index construction, which does not protect the privacy of IFVs, as follows.
3)
4)
5) 6)
(c) For each bucket position xj , update the bucket by storing the bitwise OR of the unit basis vector ei and the IFV W previously stored in the bucket, denoted by I[xj ] ei . [Propagate the IFVs to the neighbor] For each bucket xj , let fidj beWthe IFV W stored in I[xj ]. fidj fid+ Compute the “union” (fid− j ) as a joint j − + c IFV fidj , where fidj and fidj denote IFVs stored in the neighboring left and right buckets, respectively. c j in I[xj ]. Store fid [Permute the buckets] Permute array I according to the mapping xj to the randomized position Fk1 (xj ) for each position xj . For 1 ≤ i ≤ #f , let ci ← SKE.Enck2 (fi ), where SKE can be any secure encryption scheme. Output γ = (I, c), where c = (c1 , . . . , c#f ).
Dec(sk, c): Return f ← SKE.Deck2 (c). 3.2.2
Search Over the Similarity Index
The client searches by first generating a search token τq for the search object q . The server can then use the token to locate the buckets in the similarity index I and extract the corresponding joint IFVs. Finally, the server decodes the IDs of files that are approximate members to the query q , and returns corresponding ciphertexts to the client.
Search(sk, q; I): Client Side: Extract the d-dimensional feature vector q based on the interest of the client. Compute (x1 = g1 (q), . . . , xM =
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014( Volume: 15 , Issue: 3 , May-June 1 2018 )
gM (q)) and the search token τq = (Fk1 (x1 ), . . . , Fk1 (xM )). Send the search token τq to the server. Server Side: Let y1 = Fk1 (x1 ), . . . , yM = Fk1 (xM ). Extract the corc i from I[yi ] for each i ∈ [M ]. responding joint IFVs fid VM c Compute d = i=1 fidi . Return cq = {ci ∈ c : d[i] = 1}.
3.3
Scheme-I: Using Homomorphic Encryption
The above basic construction supports efficient similarity search without privacy. Based on it, we propose Scheme-I. c is encrypted using Paillier encryption, In Scheme-I, each fid so that the server can still aggregate the M intermediate IFVs on behalf of the client as in the basic construction,
1 0 1 1 0
1024bit
Pad 0's
0 0 1 0 0 0
log 2 3M - bit Block
...
3.2.4 Illustrating Example Suppose the index is built on six files f1 , . . . , f6 , three LSH h1 , h2 , h3 are randomly chosen, i.e., M = 3, m = 6. We have g1 = h1 , g2 = h2 + 6 and g3 = h3 + 12. In the index construction, f1 , . . . , f6 are processed to generate feature vectors pi , and their corresponding initial IFVs (i.e., e1 , . . . , e6 ) are then stored into index array at buckets mapped by LSHs. Joint IFVs are then computed for each bucket. Finally, all bucket positions are further permuted randomly using PRP, i.e., Fk1 [g1 (pi )], Fk1 [g2 (pi )], Fk1 [g3 (pi )] (i ∈ {1, 2, 3, 4, 5, 6}). All the joint IFVs form the similarity index I to be stored on the server. To search, the client can first (process the file of interest to) create a d-dimensional vector q . Suppose f1 , f4 and f6 are some approximate near neighbors of q , and g1 (q) = g1 (p1 ) = g1 (p4 ) = g1 (p6 ) = 3, g3 (q) = g3 (p1 ) = g3 (p4 ) = g3 (p6 ) = 15, but g2 (q) = g2 (p4 ) = g2 (p6 ) = 10 6= g2 (p1 ) = 9. As described, IFVs in the neighboring buckets are also explored by bitwise union operations. With k1 , the client prepares the search token τq = (Fk1 [g1 (q)] = 6, Fk1 [g2 (q)] = 3, Fk1 [g3 (q)] = 18), which enables the server to locate IFVs corresponding to the approximate near neighbors of query object q in I . Suppose further that Fk1 [g2 (p1 )] = 13. The original bucket positions (i.e., 3, 9, 10, 15) are randomized by the PRP into the new ones (i.e., 6, 13, 3, 18). Then the server extracts the joint IFVs in the hit buckets, i.e., I[3] = (111101), I[6] = (100101) and I[18] = (100101), performs an intersection operation and gets (100101), which means f1 , f4 and f6 are indeed the approximate members of q .
Extended IFV to be encrypted
Original IFV
...
3.2.3 Supporting Efficient Index Dynamics Update(sk, f 0 ; γ): Client Side: To add new files f 0 , the client locally encrypts the files as cf 0 and builds the sub-index If 0 for ff 0 (which has the same structures as the similarity index on the server). To do so, the client simply runs (If 0 , cf 0 ) ← Enc(sk, ff 0 ) and sets the update token τu = (If 0 , cf 0 ). To delete existing files f 0 , first search for their locations, then generate τu = {i : fi ∈ f 0 }. In either case, send the update token τu to the server. Server Side: For addition, merge If 0 into I and cf 0 into c, respectively. For deletion, to delete fi ∈ f 0 , set the ith entry of each row of I to 0, and delete cf 0 from c.
6
0 1
0 0 0 0 0 1
Fig. 1: An Illustration of the packing technique although in an encrypted form. This preserves the low communication cost (independent to M ) during the search. Saving Space by Packing. Typical parameters for Paillier encryption supports large (e.g., 1024-bit) integers as the plaintext space, but we only want to encrypt bit vectors of length #f . To pack a bit vector into multiple 1024-bit integers while leaving room for addition of individual bits, we use a block size of dlog2 3M e bits in the plaintext space to represent a bit in the bit vector as shown in Fig. 1. The remaining bits in the plaintext space which are insufficient to form a block are set to 0. The packed integers are then encrypted by the Paillier encryption. It is easy to see that one needs to allocate more than dlog2 M e bits, to ensure that the summation of bit values in one row of IFV across M different buckets will not affect the aggregated values of other rows during the search process (refer to the Steps 1 to 3 at the client side in Section 3.3.2). The reason to choose a block size of specifically dlog2 3M e bits lies in our deletion mechanism. For deletion, we will set the second to last entry to 1 in the 1024-bit 0 vector corresponding to the file fi to be deleted and then generate its encrypted sub-index Iui (as presented in Step 2 at the client side in Section 3.3.3). The obtained sub-index Iui will be used to perform additive homomorphic operations on all the IFVs stored in the buckets (as presented in Step 2 at the server side). That is, upon finishing the delete updates, the last two entries of the block corresponding to fi in all the buckets are changed to either “11” or “10”. Hence, for any future new search, the decimal value of entries corresponding to fi in the aggregated form will be turned to 3M or 2M , which will not incur a “collision” (i.e., it will not affect the correctness of checking equality of the corresponding entries for a new search). This is the reason why we should pad (dlog2 3M e − 1)-bit of 0’s before each bit of IFV to obtain a plaintext block. 3.3.1 Initialization and Index Construction Gen(1λ ): Additionally set up (skp , pkp ) for the Paillier encryption. Include (skp , pkp ) in sk .
Enc(sk, f ): Based on the basic index construction, additional steps are performed as follows: For each IFV stored in the
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
7
index I[Fk1 (xi )] (i ∈ [M m]), use Paillier cryptosystem c i under the public key pkp according to the to encrypt fid packing method presented above. 3.3.2 Search Over the Similarity Index After located the hit encrypted IFVs, the server homomorphically sums up these encrypted IFVs and returns the resulting ciphertext to the client. Now for each query object q , the server only needs to transfer one ciphertext corresponding to the sum of the plaintext values (hit IFVs). The client then decrypts the ciphertext and unpacks the blocks according to the alignment set by the above packing method, i.e., by interpreting the decryption result as a concatenation of many identifiers as bit-strings, and then separating them from it. In this way, the client effectively delegates the public-key computation burden to the server (i.e., the server integrates all encrypted IFVs into one while the client only needs to do one decryption).
Search(sk, q; γ): Client Side: Identical to that in the basic construction. Server Side: Let y1 = Fk1 (x1 ), . . . , yM = Fk1 (xM ). Extract corc i from I[yi ] for each i ∈ [M ]. responding encrypted fid M Q ciK Jfid Compute the intermediate encrypted result θq = and return it to the client. Client Side: 1) 2)
3)
i=1
Decrypt θq using the private key skp . Unpack the decrypted message to obtain b = (b1 , . . . , b#f ), where bi is the binary representation of a dlog2 3M e-bit integer. Retrieve the ciphertext corresponding to i = {i ∈ [#f ] : bi = M } from the server.
Server Side: Return cq = {ci : i ∈ i} to the client. We remark that the request of the ciphertext can be made by using private information retrieval (PIR) protocol instead which allows the transfer of ciphertexts without the server learning the locations i. 3.3.3 Supporting Efficient Index Dynamics Next, we show supporting efficient file updates can be realized by our encrypted similarity index.
Update(sk, f 0 ; γ): Client Side: 1) 2)
Procedures for adding new files set f 0 are the same as in the basic construction. To delete existing files f 0 , for each file fi ∈ f 0 initialize a 1024-bit 0 vector, and then set the (i mod 1024 th dlog2 3M e ) ∗ dlog2 3M e − 1) entry to 1, generate the corresponding encrypted sub-index Ifi by homomorphic encryption. At last, send the τu = (If 0 , i) to the server, where I 0 = {Ifi : fi ∈ f 0 }.
∈ f 0 to be deleted, for the block in all the buckets I[x], and Ifi extracted from τu , compute I[x] ← I[x] · Ifi (by leveraging the additively homomorphic property) and delete ci for all i ∈ i, where i = {i : fi ∈ f 0 }. This works since the decimal value of entries corresponding to fi ∈ f 0 in I is changed from M to 3M and others are turned to 2M . For each file fi
2)
i·dlog2 3M e th 1024
If the last “blocks” of the original index I do not encode with the information of dlog1024 files, the client selects part 2 3M e of files from f 0 to construct the sub-index If0 0 to merge into the last “blocks” of the index I . This is to make sure that the last “blocks” of the index I are encoded with dlog1024 3M e files. 2
3.4
Scheme-II: Using Pseudorandom Padding
To further improve the computation efficiency we next introduce Scheme-II by leveraging a “one-time-pad” encryption instantiated by PRF. It is CPA-secure due to the pseudorandomness of the PRF, which in turns rely on its one-wayness [25]. To avoid the reiteration of the same steps, we only present the key differences and modifications. 3.4.1
Initialization and Index Construction λ
Gen(1 ): Besides the basic parameters such as m and those $ for LSH, we need k1 , k2 , and an additional bit string k3 ← {0, 1}λ chosen uniformly at random to serve as the key of the PRF Gk3 (·). Output sk = (k1 , k2 , k3 , m, h1 , h2 , . . . , hM ). Enc(sk, f ): Based on the basic similarity index construction, the following additional step is added. 3)
3.4.2
For each IFV stored in the index I[Fk1 (xi )] c i as I[Fk (xi )] := (i ∈ [M m]), encrypt fid 1 L c fidi Gk3 (IndexID||Fk1 (xi )), where || denotes the concatenation of strings and G(·) is a PRF. Here, we use IndexID2 to uniquely identify the encrypted similarity index I . Search Over the Similarity Index
Now for each query object, different from Scheme-I, the server needs to locate and return all hit encrypted IFVs (i.e., M IFVs) to the client during the search process. At the client side, after receiving the encrypted results, the client uses the secret key to decrypt and then decodes the IDs of the files which are similar to the query object q .
Search(sk, q; γ): Client Side: Identical to that in the basic construction. Server Side: Let y1 = Fk1 (x1 ), . . . , yM = Fk1 (xM ). Extract corresponding IFVs from I[yi ] for each i ∈ [M ]. Return r = (r1 = I[y1 ], . . . , rM = I[yM ]) to the client. Client Side: 1)
Compute Gk3 (IndexID||Fk1 (xi )) for i ∈ [M ].
Server Side: 1)
Procedures for adding new files set f 0 are the same as in the basic construction.
2. The notion of IndexID is reasonable. In practice, if the data owner wants to build indexes for different combinations of multimedia data files, then an IndexID can be used to uniquely identify each index.
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
2)
3) 4)
Decrypt r using Gk3 (IndexID||Fk1 (xi )) for i ∈ [M ] to obtain M di = ri Gk3 (IndexID||Fk1 (xi )) M = I[yi ] Gk3 (IndexID||Fk1 (xi )) ci. = fid VM c Compute d = i=1 fidi . Send i = {i : d[i] = 1} to the server, or use PIR.
Server Side: Return cq = {ci : i ∈ i} to the client. 3.4.3 Supporting Efficient Index Dynamics Update(sk, f 0 ; γ): Client Side: Based on the basic similarity index construction, the following modifications should be made for addition. 1)
2)
0 For each row vector denoted L by AddVec in the I , set AddVec[j] = AddVec[j] Gk3 (IndexID||j)[#f + i], where j ∈ [M m]. Output the encrypted sub-index If 0 and c0 as τu .
For deletion using τu , the following adaptions is made. 1)
2)
For each file fi to be deleted, generate a (M m)dimension row vector DelVec, where each entry is initialized L to 0. Then encrypt it as DelVec[j] = DelVec[j] Gk3 (IndexID||j)[i], where j ∈ [M m]. Output the encrypted sub-index If 0 as τu .
Server Side: 1) 2)
For file addition, append sub-index If 0 extracted from τu to I and c0 into c, respectively. For file deletion, to delete fid ∈ f 0 , set the ith row of I to the encrypted DelVec of fid extracted from If 0 and delete c0 from c accordingly.
It is easy to see the correctness of the update. 3.5
Efficiency Analysis
For the search time efficiency, M hash operations are used to generate the search token at the client side for both schemes. #f ·dlog2 (3M )e·(M −1) At the server side, for Scheme-I, addi1024 tive homomorphic operations are applied to aggregate M encrypted IFVs while Scheme-II only needs to locate the buckets using the search token. For the client to decrypt #f ·dlog2 (3M )e the (intermediate) search results, (Paillier) de1024 cryption operations, or M XOR operations over bit vectors of length #f are needed for Scheme-I and Scheme-II respectively. Besides, #f checking operations are involved to evaluate the equality of each decrypted entry to 1 or M . Although the searching time complexity is theoretically O(#f ), we still achieve practical efficiency. For Scheme-II, the computations involved in Scheme-II are XOR and equality checking, which are extremely efficient. For SchemeI, due to the use of packing technique, instead of #f ciphertexts, only a fraction of it is being processed. Our experiment also confirmed that the involved computation, which is public homomorphic aggregation of ciphertexts at the server side, and the decryption of aggregated ciphertexts at the client side, attains practical efficiency.
8
While the de facto standard of SSE efficiency is sub-linear in f , those schemes only work for simple keyword search or a boolean expression based on keyword-comparison. Conceptually, those schemes process a selected set of keywords instead of the files itself. Our aim is to support fuzzy search across many large files, it is practically infeasible to adopt this approach which requires the preparation of every possible “keywords” across a large number of files and tagging them accordingly. One could alternatively consider verifying the similarities between each data file and the query object file. However, it either nullifies the advantages of cloud computing for the client (if the processing is done at the client side) or puts an extremely high burden to the cloud (for the cloud to go through each of the large files to check if it matches with the query, in the encrypted domain). In our schemes, O(#f ) of computation is needed due to the functionality we aim to support — answering unforeseen query and returning result due to similarities. Under this setting, sub-optimal solution may miss some of the candidates. The elegancy in our design is that, while still requiring O(#f ) operation, we restrict it to the basic primitive operations as we previously discussed. We leverage the cloud to do most of the computation jobs. Compared with the (trivial) solution of returning O(#f ) of “objects” (anything after post-processing of the cloud on encrypted data, say, via fully homomorphic encryption; or in the trivial case, the ciphertexts themselves) #f for the client to process, we obtain φ(d) times of savings for communication and storage costs, where φ(d) denotes the size of the returning candidate set. These significant savings are also demonstrated numerically in Tables 2, 3, 4.
3.6
Security Analysis
For both of our constructions, we define the leakage functions as follows. Definition 8. (Leakage function L1 for setup). Given a multime#f dia data collection f , L1 (f ) = {|fi |}i=1 , where | ∗ | denotes the length of the file. Definition 9. (Leakage function L2 for search). Given a multimedia data collection f , a search object q , L2 (f , q) = {π(f , q), Ap (f , q), E(q)}, where E(q) is the identifiers of the buckets corresponding to q , i.e., E(q) = {id(g1 (q)), . . . , id(gM (q))}. Specifically, in the above definition, the access pattern implicitly includes intermediate results, e.g., the encrypted IFVs extracted from the hit buckets. We emphasize that the encrypted IFVs and their encrypted aggregation do not reveal additional information since these intermediate results are encrypted before the recovery of id(cq ). Definition 10. (Leakage function L3 for update). Given a multimedia data collection f , an update object u, L3 (f , u) = +#f 0 {id(fi ), |fi |}#f = i=#f +1 for add updates, and L3 (f , u) {id(fi )}i∈#f 0 for delete updates. Theorem 1. Assuming the decisional composite residuosity problem is hard (resp. the existence of one-way functions), Scheme-I (resp. Scheme-II) is CQA2-secure in the standard model.
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 ) JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
Proof. The only difference between Scheme-I and SchemeII lies in how the IFVs are encrypted. We only use the fact that both methods are CPA-secure in the following proof, without utilizing the homomorphic property. Thus, the proofs for both schemes are identical. For either scheme, we argue its CQA2-security by describing a polynomial-time simulator S , such that for any PPT adversary A, the outputs of RealSSE (λ) and A IdealSSE (λ) are indistinguishable, as follows: A,S #f
[Setup] S is given L1 (f ) = {|fi |}i=1 . Let I be a dictionary of size M m mapping λ-bit random strings to encryption of zeros by the underlying encryption scheme. The number of zeros encrypted is determined by #f (and the packing method described in the construction for Scheme-I). Generate k20 ← SKE.Gen(1λ ) and ci ← SKE.Enck20 ({0, 1}|fi | ) for i ∈ [#f ]. Output γ = (I, c), where c = {c1 , . . . , c#f }. S maintains internally a mapping from the M m identifiers of the buckets to the λ-bit random strings generated above. [Simulating Queries] S is given L2 (f , q) = {π(f , q), Ap (f , q), E(q)}. Consider the M identifiers of buckets corresponding to the query object q . S sends the random strings corresponding to these identifiers to A. Upon verifying the response from A against the index I , S sends the file identifiers given in Ap (f , q) to A, who then returns the corresponding ciphertexts. [Simulating Updates] For add updates, S is given +#f 0 L3 (f , u) = {id(fi ), |fi |}#f i=#f +1 , which are simulated as in the setup phase. For delete updates, S is given L3 (f , u) = {id(fi )}i∈#f 0 . S simulates a fresh If 0 as in the setup phase, and sends (If 0 , {id(fi )}i∈#f 0 ) to A. (λ) and The indistinguishability between RealScheme−I A IdealScheme−I (λ) follows from the CPA security of SKE or A,S the XOR-based encryption scheme, and the indistinguishability of PRF from random functions. It is easy to note that the simulation above is quite simple. Specifically, our schemes only return an aggregated encrypted IFV (for Scheme-I) or a number of encrypted IFVs (for Scheme-II), instead of directly retrieving the encrypted results for the client. That is why the simulator can just return a (set of) random string(s) with appropriate length since the adversary simply cannot distinguish without the corresponding secret key for decryption. Also, in retrieving the encrypted results for the client in the final stage, the simulator does not need to directly work with the encrypted index. Instead, the information available from the leakage function already allows it to do that. From another perspective, the adaptive security of our scheme can be achieved in the standard model due to the non-trivial communication overhead, i.e., the encrypted IFV is of length O(#f ). For forward privacy, observing that adding new files to the database introduces new columns in encrypted form; and the decryption of IFVs is performed at the client side during queries. Thus, even if equipped with previous queries, it is obvious that the server is unable to decrypt the newly added portions of the IFVs.
9
Theorem 2. Assuming the decisional composite residuosity problem is hard (resp. the existence of one-way functions), Scheme-I (resp. Scheme-II) is forward private in the standard model. Proof. By the same argument as in the proof of Theorem 1, we can assume that the encryption schemes in both SchemeI and Scheme-II are CPA-secure. In the forward privacy game, upon receiving the challenge updates u0 and u1 , the challenger simply encrypts zeros to produce dummy ciphertexts c0 and dummy index If 0 . If the adversary can distinguish the dummy update from the real ones, we can construct an adversary which breaks the CPA-security of the encryption scheme, which happens with probability less than negl(λ). Suppose that the adversary cannot distinguish the above. The simulated ciphertexts and index contain no information about u0 and u1 , so the success probability of the adversary is exactly 21 . Thus, the overall success probability of the adversary is bounded above by 12 + negl(λ).
4
T HEORETICAL A NALYSIS OF S EARCH Q UALITY
For search quality of our system, the computation of union of IFVs in consecutive neighboring buckets can greatly reduce the false negative rate, and the computation of bitwise intersection of IFVs in multiple hashing-hit buckets during search can eliminate a large amount of false positives. Based on Definition 7, each LSH hi (q) = b a·q+b W c maps the d-dimensional points q onto intervals of length W . After mapping all feature vectors p extracted from f , the probability that a query q falls into the same bucket as some point p depends on the interval length W , and the distance between q and p (i.e., k p − q k2 ). Fig. 2 shows the probability that q falls into the same bucket as p and the probability that q falls into the left or right neighboring bucket of p. Consider an LSH hi (·). For a fixed p, let fp (q) be the probability that q and p fall into the same interval (i.e., collide with each other under ha,b (·)). In addition, we use fp− (q) and fp+ (q) to denote the probability that q falls into the neighboring left and right intervals, respectively. For ease of analysis, we let x(δ) (δ ∈ {−1, 1}) be the absolute distance of p from the boundary of the interval ha,b (p) + δ . According to the definition of ha,b (q) = b a·q+b W c, the difference between the projections of q and p onto the line is (a · q + b) − (a · p + b) = a · (q − p), distributed as k p − q k2 X , where X follows a Gaussian distribution. When W is large enough, q can fall into the same bucket as p, or its left/right neighboring slots. Formally: Lemma 3. The probability that two items p and q collide for an LSH is Z W 1 t t fp (q) = p(u) = Pr[ha,b (q) = ha,b (p)] = g( )(1− )dt, u u W 0 where p(u) is the probability as a function of u, u =k p−q k2 and g(t) is the probability density function of the absolute 2 value of the 2-stable distribution, namely, g(t) = √12π e−t /2 . Proof. Set t = |(a · p + b) − (a · q + b)| = |a · (p − q)|, x = ut . Therefore, x is distributed as X . As shown in Fig. 3, the probability that a·q+b falls into the same slot as u W RW a·p+b u −x u is g(x)( )dx . According to the substitution W 0 u u
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
10
where S denotes the collection of the approximate members of q in the dataset, pi ∈ S, u ≤ R in Eq. (1a), pi ∈ f \ S, u > R in Eq. (1b).
a " pi ! b ha ,b ( pi )
x
ai p+b u
xi (#1) xi (!1) ha,b ( pi ) #1
ha,b ( pi ) ! 1
W u
Proof. The probability that item q falls into the slot to the right (left) of p depends on how close q is to the right (left) boundary of its slot. We have Z x(−1)+W 1 t − fp (q) = Pr[ha,b (q) = ha,b (p) − 1] = g( )dt u u x(−1) 2
Fig. 2: Probability of query q and its approximate neighbors falling into different intervals
Fig. 3: Probability of query q and its approximate neighbors falling into the same intervals Rb
rule for the definite integral we have Z W Z W u −x g(x)( u W )dx = 0
a
f (x)dx =
W
g(ψ(t))(
0
u
Z
W
= 0
Z
= 0
W
t g( ) · ( u
Rβ α W u
f (ψ(t))ψ 0 (t)dt,
− ψ(t)
W u t W u − u W u
)·
)dψ(t)
1 dt u
t 1 t g( )(1 − )dt. u u W
Because the variance of uX is u2 , the increase of u will lead to the decrease of collision probabilities. According to Definition 7, we have P1 = p(1) and P2 = p(c) for R = 1 (Readers can refer to [27] for more details). Further, we have Z W Z W Z W t t )dt = p(1) = g(t)dt − g(t)(1 − g(t) dt W W 0 0 0 Z W 2 1 − W2 = − 1) g(t)dt + √ (e W 2π 0 W2 1 = normcdf (W ) − 1/2 + √ (e− 2 − 1) W 2π Z W t 1 t g( )(1 − )dt p(c) = c c W 0 Z W W2 c 1 t g( )dt + √ (e− 2c2 − 1) = c c W 2π 0 W2 c = normcdf (W/c) − 1/2 + √ (e− 2c2 − 1), W 2π where normcdf (·) denotes the distribution function of 2stable distribution. Theorem 4. Given an (R, cR, P1 , P2 )-sensitive LSH family for a distance function || ∗ ||2 , the false negative rate, the false positive rate and the recall (defined in Section 5.2) of the encrypted similarity index are
fnegi = 1 −
M Y
(fp−ij + fpij + fp+ij ),
(1a)
≈ e−Ax(−1) , fp+ (q) = Pr[ha,b (q) = ha,b (p) + 1] = −Ax(+1)2
≈e
Z
x(+1)+W
x(+1)
1 t g( )dt u u
.
The probability density function of a Gaussian random 2 2 variable is e−x /2σ (scaled by a normalizing constant). Thus, according to [18], the probability that point q falls into the neighboring slot of p can be estimated by the equations above, where A is a constant depending on k p − q k2 and x(+1) + x(−1) = W . Given an (R, cR, P1 , P2 )-sensitive LSH family, we have fp− (q) and fp+ (q). In our similarity search index, probing one bucket is equivalent to probing three neighboring buckets at a time. Suppose the set of approximate members of the query q is S. After hashing and randomization, the probability of mapping q into bucket k (which contains IFVs of the approximate members of q ) is (fp−i + fpi + fp+i ) for each pi ∈ S. With M LSH functions, the probability that q is successfully mapped into all M target QM buckets is fsuccessi = j=1 (fp−ij + fpij + fp+ij ). Thus, for each member of S the probability of a false negative alarmed by at least one bucket hit miss is then given by Eq. (1a). For item i in set f \ S, the probability that q collides with it for the LSH function j is (fp−ij + fpij + fp+ij ). Thus, if pi is not an approximate neighbor of q , we can figure out the false positive probability by the probability that a query collides with legitimate items by all M hash functions as in Eq. (1b). After computing fnegi for each pi ∈ S, the expected P#S number of correctly returned files is i=1 (1 − fnegi ). We can estimate the recall of our schemes as in Eq. (1c).
5
E XPERIMENT E VALUATION
We evaluate by thorough experiments the efficiency and effectiveness of our constructions. The experiments are performed on a machine with an Intel Core i5-4460S 2.90GHz processor, 12 GB of DRAM and a 500GB 5400RPM SATA disk. We report space cost, search time, recall, and error ratio. 5.1
Evaluation Datasets
We choose the following three representative datasets summarized in Table 1, showing that our encrypted index for different dataset sizes can easily fit into the main memory.
j=1
fposi =
M Y
(fp−ij + fpij + fp+ij ),
(1b)
j=1 #S
recall =
1 X (1 − fnegi ), #S i=1
(1c)
5.1.1 Forest Covertype Trace The actual forest cover type for a given observation (30 × 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
Dataset type Forest Covertype Image Data I Image Data II
# Files 581,012 59,500 14,950
Feature dimension 54 400 512
Dataset size 119.9 MB 22.7 MB 27 GB
11
use I(q) to denote the set of ideal answers. Assume A(q) is the set of actual answers returned, Recall is defined by
Recall = TABLE 1: Characteristics of experimental datasets is in raw form (not scaled) and contains binary (0 or 1) columns of data for independent qualitative variables (wilderness areas and soil types). This data set3 contains 581, 012 data points. Each has 54-dimensional attributes, which include 10 quantitative variables, 4 binary wilderness areas, and 40 binary soil type variables. It is commonly used to predict forest cover type from cartographic variables only. 5.1.2 Image Data I The image data set is obtained from a project of Shakhnarovich4 . It contains 59, 500 of 20 × 20 grey-level image patches taken from a bunch of motorcycle images. We reshaped each image into a 400-dimensional column vector. 5.1.3 Image Data II The image data set consists of 14,950 high quality pictures and the total size is about 27GB. For each image, we use the color histogram extraction tool from FIRE5 to extract a 512-dimensional color histogram. 5.2
Evaluation Metrics
Our experiment randomly generates queries and extracts the corresponding feature vectors as the search objects for each dataset. All measures are averaged over these queries. Recall that our goal is to generate candidate set of approximate files for refinement such as privacy-preserving approximate nearest neighbors (ANN) search or k -nearest neighbor (k NN). We not only performed experiment on different types of data, but also showed the efficiency and search quality under different constraint (i.e., distance threshold R for ANN and parameter k for k NN). For each search feature vector in the Forest Trace dataset and image dataset I, the ideal answer is defined to be the most similar files whose feature vectors are the nearest neighbors of the search one, under a pre-defined distance constraint R using the Euclidean distance measure. For the image dataset II, the ideal answer is defined to be the k nearest neighbors of the query object. More specifically, we measure the performance of our encrypted similarity search scheme in four aspects: space cost, search time, recall and error ratio. Space cost is to evaluate the space efficiency of the index data structure. The space cost of the index should be practical compared to the original dataset size. Search time is to measure the search speed of answering one search query over the encrypted similarity index. It includes the times of locating the hit bucket positions in the index, decrypting of returned encrypted IFVs, and computing the intersection of IFVs. Recall is a typical metric in evaluation of similarity search over plaintext data. Given a search feature vector q , we 3. https://archive.ics.uci.edu/ml/datasets/Covertype 4. http://ttic.uchicago.edu/∼gregory/download.html 5. Flexible image retrieval engine http://thomas.deselaers.de/fire
|A(q) ∩ I(q)| . |I(q)|
Our goal is to generate a fine candidate set which will be used in the implementation of privacy-preserving k NN or ANN searches to the query object, where only top-k candidates or the candidates under a pre-defined distance constraint R (i.e., the ideal answer I(q)) will be returned. So we do not directly consider the precision metric as in the plaintext similarity search domain [18], [28], [29]. Error ratio is commonly used to measure the approximate nearest neighbor search. It evaluates how close the distances of the most similar files found by our index scheme when compared to the truly most similar files’ distance (computed using the corresponding feature vectors). For a set of queries Q, error ratio is defined by
error ratio =
N 1 X X dLSHk , |Q|N q∈Q k=1 d∗k
where N = |A(q) ∩ I(q)|, dLSHk is the distance of the k th most similar file found by our scheme, and d∗k is the distance of the true k th most similar file. Note that all distances are computed using feature vectors corresponding to the multimedia data file. Obviously, in the ideal case the values of recall and error ratio are 1.0 for both. 5.3
Experimental Results
Tables 2, 3 and 4 show the main results. For each dataset, the tables report the search time, the bucket length, and the space cost when achieving different recall values, the error ratios, and the ratios of the original dataset size to the #f , where φ(d) denotes the size of the candidate set size φ(d) returning candidate set. As shown in our results, for about 0.6 million items in the Forest Trace dataset, the search time of SchemeI and Scheme-II are 49.94s and 0.1561s, respectively. For the actual communication cost, the returned intermediate encrypted results are also very small compared to the size of original data files. As illustrated in Table. 5, the communication costs of both schemes are practically acceptable. We have demonstrated that our schemes achieve a high recall (i.e., the fraction of relevant instances can be retrieved). Note also that, as we will show, our system can speed up #f ) compared with the tens to hundreds of times (i.e., φ(d) na¨ıve approach of scanning the whole database. The client thus gets a highly relevant candidate set which are much smaller to perform a refinement such as k NN or ANN searches. This is why we do not explicitly consider precision, which is the fraction of retrieved instances that are relevant. The results show that our encrypted similarity index is highly space-efficient and time-efficient. In all cases, the number of hash tables required is only 5 and each hash table has a small number of buckets m. Even with a very large data size, the hash tables only occupy a small portion of the main memory size (without incurring a disk I/O). Based on the theoretical analysis of search quality in section 4, Fig. 5(a) and Fig. 5(b) show the false positive rates
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
Recall
Error Ratio 1.0006 1.0036 1.0086
0.9878 0.9527 0.9132
Search Time of Scheme-I (s) 49.94 49.94 49.94
12
Search Time of Scheme-II (s) 0.1561 0.1561 0.1561
M
m
W
5 5 5
37 49 53
300 240 210
Space Cost of Scheme-I (MB) 102.53 135.78 146.88
Space Cost of Scheme-II (MB) 12.81 16.97 18.35
#f φ(d)
112 236 373
TABLE 2: Search performance of Forest Trace dataset with R = 150 Recall
Error Ratio 1.0008 1.0041 1.0075
0.9915 0.9418 0.9141
Search Time of Scheme-I (s) 5.12 5.12 5.12
Search Time of Scheme-II (s) 0.0164 0.0164 0.0164
M
m
W
5 5 5
16 24 26
800 550 500
Space Cost of Scheme-I (MB) 4.54 6.81 7.38
Space Cost of Scheme-II (MB) 0.57 0.85 0.92
#f φ(d)
16 43 54
TABLE 3: Search performance of Image dataset I with R = 500 Recall
Error Ratio 1.0042 1.0073 1.0286
0.9942 0.9494 0.9118
Search Time of Scheme-I (s) 1.29 1.29 1.29
Search Time of Scheme-II (s) 0.0111 0.0111 0.0111
M
m
W
5 5 5
65 81 112
10 6 5
Space Cost of Scheme-I (MB) 4.61 5.83 8.06
Space Cost of Scheme-II (MB) 0.58 0.72 1.00
#f φ(d)
406 719 940
TABLE 4: Search performance of image dataset II with k = 10 1
1
1
0.95
0.8
0.95 0.9
0.75
M=5 M=10 M=15
0.7 0.65
0.9
Recall
0.8
Recall
Recall
0.85 M=5 M=10 M=15
0.85
0.8
M=5 M=10 M=15
0.6
0.4
0.2
0.6 0.55 160
180
200
220
W
240
260
280
300
(a) Forest Trace dataset
0.75 400
600
800
W
1000
(b) Image dataset I
1200
0
2
4
W
6
8
10
(c) Image dataset II
Fig. 4: Search quality for different W (R = 150 for Forest Trace dataset, R = 500 for image dataset I, k = 10 for image dataset II) quality and bucket length W . As we discussed before, W should be carefully chosen to guarantee good search quality while maintaining a lower false positive rate.
(a) False positive (W = 250)
(b) False positive (M = 5)
Fig. 5: False positive rate
with the change of M and W , respectively. While, other parameters are fixed properly, i.e., c = 5 and R = 1. Given a fixed dataset, our search time is only related to the number of hash tables. Even for a large-scale dataset, desirable recall and error ratio can be achieved with high search speed. With the increase of W , better search quality can be obtained with less number of buckets in each hash table. Fig. 4 explicitly shows the relationship between search
When false positive rate increases, we can use more hash tables to get more accurate results. As shown in Fig. 6, using more hash functions will reduce the number of falselyreturned files and the recall will only slightly decrease. It is also worth noting that how the recall changes (with the increase of M ) highly depends on the characteristics of the dataset. Thus, for a specific dataset, by choosing W and M carefully we can obtain a balance between false positive rate and search quality. From a practical view, we can rely on a larger M to achieve lower false positive rate while still maintaining the search quality at a high level. To understand Fig. 7. We first describe the definitions related to the file return probability F RP . We can compute the distance r between the feature vector of a multimedia file and the query feature vector. Then, given a distance range to the query, F RP denotes the ratio of the number of files (among the returned candidate set) in the distance range to the number of files (among the whole file collection) in the distance range.
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable15 and ,Secure Computing IEEE Transactions on Dependable and Secure Computing ( Volume: Issue: 3 , May-June 1 2018 ) JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
M =5 Scheme-I (KB) Scheme-II (KB) 567.39 709.24 58.11 72.63 14.60 18.25
Forest Covertype Trace Image Data I Image Data II
13
M = 10 Scheme-I (KB) Scheme-II (KB) 712.02 1418.48 72.91 145.26 18.32 36.50
M = 15 Scheme-I (KB) Scheme-II (KB) 854.43 2127.72 87.5 217.89 21.99 54.75
TABLE 5: The communication costs of Scheme-I and Scheme-II are #f · dlog2 3M e bits and #f · M bits, respectively.
(a) Forest Trace dataset
(b) Image dataset I
(c) Image dataset II
Fig. 7: Recall of the files with different distance to the query (M = 5, R = 150 and W = 300 for Forest Trace dataset; M = 5, R = 500 and W = 800 for image dataset I; M = 5, k = 10 (R ≈ 3.3277), and W = 10 for image dataset II) the forest dataset, r < 300 for the image dataset I and r < 5 for the image dataset II). The results are consistent with Definition 7: the larger the c is the less probable the # dissimilar files are returned. " %&''( Table 5 gives the communication costs of running ! Scheme-I and Scheme-II with different datasets. The experimental results show that while the communication cost of Scheme-II is slightly larger than that of Scheme-I, the %&'' % ! " # $% $ $! ! " # $% $ $! communication efficiency of both schemes are practically '( )* (a) Forest Trace dataset (W = 300) (b) Forest Trace dataset (W = 300)high. For search efficiency, the search time in Scheme-II is two orders of magnitudes smaller than that in Scheme-I. )*$% !
&'
%$$
+,-.//
)*+,-.'/0'.-1*.2-3'045-6
%$$
(
%&'')
&'
%&'' +,-.//
,-./01*23*104-1506*37809
!
(&'
$&'
C ONCLUSION
We investigated the problem of privacy-preserving similarity search over encrypted feature-rich data. We proposed a high-speed and compact similarity search index %&'( ! " # $% $ $! * supporting efficient file and index updates. Based on well(d) Image dataset I (W = 800) defined security models with leakages, we proved our index constructions are semantically secure against adaptively ! chosen query attack. Theoretical performance analysis was also presented to carefully characterize our index designs. "%&& Utilizing three different representative real-world datasets, we showed that our index constructions are highly compact, practically secure and enabling efficient similarity search "%&& over encrypted multimedia data with high search quality. %&'#
$
%&'()
%&' %
!
"
#
%$$
$!
*+
(c) Image dataset I (W = 800) #"""
! ""
)*+,--
&'()*+,-.,+*/'+0*1,.23*4
6
%&'#)
!"""
""
"
!"
!
#"
#
$"
%
(e) Image dataset II (W = 10)
"%&'
!"
!
#"
#
$"
(
(f) Image dataset II (W = 10)
Fig. 6: Influence of the number of hash tables
For a fixed search query, W , and M , the closer the distances between feature vectors of files to the search query, the higher probability that those files will be returned. Our scheme returns almost all the closest files (r < 100 for
7
ACKNOWLEDGMENTS
Qian’s research is supported in part by National Natural Science Foundation of China under Grant No. 61373167 and National Basic Research Program of China (973 Program) under Grant No. 2014CB340600. Sherman Chow is supported by General Research Fund Grant No. 14201914 and the Early Career Award from Research Grants Council, Hong Kong.
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29]
D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searches on encrypted data,” in S&P’00, 2000, pp. 44–55. S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchable symmetric encryption,” in CCS ’12, 2012, pp. 965–976. R. Curtmola, J. A. Garay, S. Kamara, and R. Ostrovsky, “Searchable symmetric encryption: Improved definitions and efficient constructions,” in CCS ’06. ACM, 2006, pp. 79–88. D. Cash, S. Jarecki, C. S.Jutla, H. Krawczyk, M.-C. Rosu, and M. Steiner, “Highly-scalable searchable symmetric encryption with support for boolean queries,” in CRYPTO, 2013, pp. 353–373. F. Hahn and F. Kerschbaum, “Searchable encryption with secure and efficient updates,” in CCS ’14. ACM, 2014, pp. 310–320. S. Kamara and C. Papamanthou, “Parallel and dynamic searchable symmetric encryption,” in Financial Crypt. ’13, 2013, pp. 258–274. E. Stefanov, C. Papamanthou, and E. Shi, “Practical dynamic searchable encryption with small leakage,” in NDSS’14, 2014. P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keyword search over encrypted data,” in ACNS ’04, 2004, pp. 31–45. D. Boneh and B. Waters, “Conjunctive, subset, and range queries on encrypted data,” in TCC ’05, 2007, pp. 535–554. T. Moataz and A. Shikfa, “Boolean symmetric searchable encryption,” in ASIACCS ’13. ACM, 2013, pp. 265–276. M. Chase and S. Kamara, “Structured encryption and controlled disclosure,” in ASIACRYPT, 2010, pp. 577–594. R. W. F. Lai and S. S. M. Chow, “Structured encryption with noninteractive updates and parallel traversal,” in ICDCS, 2015, pp. 776–777. R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in highdimensional spaces,” in VLDB’98, 1998, pp. 194–205. Q. Lv, M. Charikar, and K. Li, “Image similarity search with compact data structures,” in CIKM’04. ACM, 2004, pp. 208–217. I. Giangreco, I. Al Kabary, and H. Schuldt, “ADAM — database and information retrieval system for big multimedia collections,” in Intl. Congress on Big Data, 2014. M. Kuzu, M. S. Islam, and M. Kantarcioglu, “Efficient similarity search over encrypted data,” in ICDE’12, 2012, pp. 1156–1167. G. R. Hjaltason and H. Samet, “Properties of embedding methods for similarity searching in metric spaces,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 530–549, 2003. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multiprobe LSH: Efficient indexing for high-dimensional similarity search,” in VLDB’07, 2007, pp. 950–961. J. Yuan and S. Yu, “Efficient privacy-preserving biometric identification in cloud computing,” in INFOCOM, 2013, pp. 2652–2660. T. Moataz, A. Shikfa, N. Cuppens-Boulahia, and F. Cuppens, “Semantic search over encrypted data,” in ICT’13, 2013, pp. 1–5. B. Wang, S. Yu, W. Lou, and Y. T. Hou, “Privacy-preserving multikeyword fuzzy search over encrypted data in the cloud,” in IEEE INFOCOM’14, 2014. W. K. Wong, D. W.-L. Cheung, B. Kao, and N. Mamoulis, “Secure kNN computation on encrypted databases,” in SIGMOD’09. ACM, 2009, pp. 139–152. T. Deselaers, D. Keysers, and H. Ney, “Features for image retrieval: an experimental comparison,” Information Retrieval, vol. 11, no. 2, pp. 77–107, 2008. S. Har-Peled, P. Indyk, and R. Motwani, “Approximate nearest neighbor: Towards removing the curse of dimensionality,” Theory of Computing, vol. 8, no. 1, pp. 321–350, 2012. J. Katz and Y. Lindell, Introduction to modern cryptography: principles and protocols. CRC Press, 2007. P. Paillier, “Public-key cryptosystems based on composite degree residuosity classes,” in EUROCRYPT’99, 1999, pp. 223–238. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Localitysensitive hashing scheme based on p-stable distributions,” in SGC ’04. ACM, 2004, pp. 253–262. A. Joly and O. Buisson, “A posteriori multi-probe locality sensitive hashing,” in MM ’08. ACM, 2008, pp. 209–218. B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” in ICCV ’09. IEEE, 2009, pp. 2130–2137.
14
Qian Wang received the B.S. degree from Wuhan University, China, in 2003, the M.S. degree from Shanghai Institute of Microsystem and Information Technology, CAS, China, in 2006, and the Ph.D. degree from Illinois Institute of Technology, USA, in 2012, all in Electrical Engineering. He is a Professor with the School of Computer Science, Wuhan University. His research interests include wireless network security and privacy, cloud computing security, and applied cryptography. Meiqi He received the B.S. degree in Information Security from Wuhan University, China, in 2015. She is working towards the Ph.D. degree in the Computer Science Department at The University of Hong Kong. Her research interests include cloud computing, information security, applied cryptography and bioinformatics.
Minxin Du received the B.S. degree in Computer Science and Technology from Wuhan University, China, in 2015. He is working towards the Master degree in the Computer School in Wuhan University. His research interests include cloud computing, information security, and applied cryptography.
Sherman S.M. Chow is an assistant professor in the Chinese University of Hong Kong. He was a research fellow at University of Waterloo, after receiving his Ph.D. degree from New York University. His main interest is in Applied Cryptography. He has published in CCS, EuroCrypt, ITCS, and NDSS. He served on the program committees of AsiaCrypt, CCS, ESORICS, ICDCS, Infocom, PKC, and SACMAT. He is a program cochair of AsiaCCS-SCC 15, ISC 14, and ProvSec 14, and an editor of IEEE Trans. of Information Forensics and Security, Intl. J. Information Security, and J. of Information Security and Applications. He has received the Early Career Award 2013/14 from the Hong Kong Research Grants Council. Russell W.F. Lai received the B.S. degree in Mathematics and B.Eng. degree in Information Engineering from The Chinese University of Hong Kong. He is working towards the M.Phil. degree in the Department of Information Engineering at The Chinese University of Hong Kong. His research interest is in Applied Cryptography, with focus on Searchable Encryption, Obfuscation, and Lattice-based Cryptography. Qin Zou received his BE degree in information science in 2004 and PhD degree in photogrammetry and remote sensing (computer vision) in 2012 from Wuhan University. Qin Zou is assistant professor with the School of Computer Science, Wuhan University. His research activities involve computer vision, machine learning, ubiquitous computing, intelligent transportation systems, etc.
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.