IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 24,
NO. 2,
FEBRUARY 2012
353
Privacy Preserving Decision Tree Learning Using Unrealized Data Sets Pui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE Computer Society Abstract—Privacy preservation is important for machine learning and data mining, but measures designed to protect private information often result in a trade-off: reduced utility of the training samples. This paper introduces a privacy preserving approach that can be applied to decision tree learning, without concomitant loss of accuracy. It describes an approach to the preservation of the privacy of collected data samples in cases where information from the sample database has been partially lost. This approach converts the original sample data sets into a group of unreal data sets, from which the original samples cannot be reconstructed without the entire group of unreal data sets. Meanwhile, an accurate decision tree can be built directly from those unreal data sets. This novel approach can be applied directly to the data storage as soon as the first sample is collected. The approach is compatible with other privacy preserving approaches, such as cryptography, for extra protection. Index Terms—Classification, data mining, machine learning, security and privacy protection.
Ç 1
INTRODUCTION
D
ATA mining is widely used by researchers for science and business purposes. Data collected (referred to as “sample data sets” or “samples” in this paper) from individuals (referred to in this paper as “information providers”) are important for decision making or pattern recognition. Therefore, privacy-preserving processes have been developed to sanitize private information from the samples while keeping their utility. A large body of research has been devoted to the protection of sensitive information when samples are given to third parties for processing or computing [1], [2], [3], [4], [5]. It is in the interest of research to disseminate samples to a wide audience of researchers, without making strong assumptions about their trustworthiness. Even if information collectors ensure that data are released only to third parties with nonmalicious intent (or if a privacy preserving approach can be applied before the data are released, see Fig. 1a), there is always the possibility that the information collectors may inadvertently disclose samples to malicious parties or that the samples are actively stolen from the collectors (see Fig. 1b). Samples may be leaked or stolen anytime during the storing process [6], [7] or while residing in storage [8], [9]. This paper focuses on preventing such attacks on third parties for the whole lifetime of the samples. Contemporary research in privacy preserving data mining mainly falls into one of two categories: 1) perturbation and randomization-based approaches, and 2) secure multiparty
. P.K. Fong is with the Department of Computer Science, University of Victoria, 1105-250 Ross Drive, New Westminster, BC V3L 0C2, Canada. E-mail:
[email protected]. . J.H. Weber-Jahnke is with the Department of Software Engineering, Engineering Lab Wing B210, University of Victoria, 3800 Finnerty Rd., Victoria, BC V8P 5C2, Canada. E-mail:
[email protected]. Manuscript received 30 Oct. 2009; revised 26 Apr. 2010; accepted 2 July 2010; published online 28 Oct. 2010. Recommended for acceptance by E. Ferrari. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-2009-10-0754. Digital Object Identifier no. 10.1109/TKDE.2010.226. 1041-4347/12/$31.00 ß 2012 IEEE
computation (SMC)-based approaches [10]. SMC approaches employ cryptographic tools for collaborative data mining computation by multiple parties. Samples are distributed among different parties and they take part in the information computation and communication process. SMC research focuses on protocol development [11] for protecting privacy among the involved parties [12] or computation efficiency [13]; however, centralized processing of samples and storage privacy is out of the scope of SMC. We introduce a new perturbation and randomizationbased approach that protects centralized sample data sets utilized for decision tree data mining. Privacy preservation is applied to sanitize the samples prior to their release to third parties in order to mitigate the threat of their inadvertent disclosure or theft. In contrast to other sanitization methods, our approach does not affect the accuracy of data mining results. The decision tree can be built directly from the sanitized data sets, such that the originals do not need to be reconstructed. Moreover, this approach can be applied at any time during the data collection process so that privacy protection can be in effect even while samples are still being collected. The following assumptions are made for the scope of this paper: first, as is the norm in data collection processes, a sufficiently large number of sample data sets have been collected to achieve significant data mining results covering the whole research target. Second, the number of data sets leaked to potential attackers constitutes a small portion of the entire sample database. Third, identity attributes (e.g., social insurance number) are not considered for the data mining process because such attributes are not meaningful for decision making. Fourth, all data collected are discretized; continuous values can be represented via rangedvalue attributes for decision tree data mining. The rest of this paper is structured as follows: the next section describes privacy preserving approaches that safeguard samples in storage. Section 3 introduces our new privacy preservation approach via data set complementation. Section 4 provides the decision-tree building process applied for the new approach. Section 5 and 6 describe the Published by the IEEE Computer Society
354
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Fig. 1. Two forms of information release to a third party: (a) the data collector sends the preprocessed information (which was sanitized through extra techniques, such as cryptographic approaches or statistical database) at will or (b) hackers steal the original samples in storage without notifying the data collector.
evaluation and experimental results of this approach, and Section 7 discusses future research directions.
2
RELATED WORK
In Privacy Preserving Data Mining: Models and Algorithms [14], Aggarwal and Yu classify privacy preserving data mining techniques, including data modification and cryptographic, statistical, query auditing and perturbation-based strategies. Statistical, query auditing and most cryptographic techniques are subjects beyond the focus of this paper. In this section, we explore the privacy preservation techniques for storage privacy attacks. Data modification techniques maintain privacy by modifying attribute values of the sample data sets. Essentially, data sets are modified by eliminating or unifying uncommon elements among all data sets. These similar data sets act as masks for the others within the group because they cannot be distinguished from the others; every data set is loosely linked with a certain number of information providers. k-anonymity [15] is a data modification approach that aims to protect private information of the samples by generalizing attributes. k-anonymity trades privacy for utility. Further, this approach can be applied only after the entire data collection process has been completed. Perturbation-based approaches attempt to achieve privacy protection by distorting information from the original data sets. The perturbed data sets still retain features of the originals so that they can be used to perform data mining directly or indirectly via data reconstruction. Random substitutions [16] is a perturbation approach that randomly substitutes the values of selected attributes to achieve privacy protection for those attributes, and then applies data reconstruction when these data sets are needed for data mining. Even though privacy of the selected attributes can be protected, the utility is not recoverable because the reconstructed data sets are random estimations of the originals. Most cryptographic techniques are derived for secure multiparty computation, but only some of them are applicable to our scenario. To preserve private information, samples are encrypted by a function, f, (or a set of functions) with a key, k, (or a set of keys); meanwhile, original information can be reconstructed by applying a decryption function, f 1 , (or a set of functions) with the key, k, which raises the security issues of the decryption function(s) and the key(s). Building meaningful decision trees needs encrypted data to either be decrypted or interpreted in its encrypted form. The (anti)monotone framework [17] is designed to preserve both the privacy and the utility of the sample data sets used for decision tree data mining. This method applies a
VOL. 24, NO. 2,
FEBRUARY 2012
series of encrypting functions to sanitize the samples and decrypts them correspondingly for building the decision tree. However, this approach raises the security concerns about the encrypting and decrypting functions. In addition to protecting the input data of the data mining process, this approach also protects the output data, i.e., the generated decision tree. Still, this output data can normally be considered sanitized because it constitutes an aggregated result and does not belong to any individual information provider. In addition, this approach does not work well for discrete-valued attributes.
3
DATA SET COMPLEMENTATION APPROACH
In the following sections, we will work with sets that can contain multiple instances of the same element, i.e., with multisets (bags) rather than with sets as defined in the classical set theory. We begin this section by defining fundamental concepts (Section 3.1). We then introduce our data unrealization algorithm in Section 3.2.
3.1 Universal Set and Data Set Complement Definition 1. T U , the universal set of data table T, is a set containing a single instance of all possible data sets in data table T. Example 1. If data table T associates with a tuple of attributes hW ind; P layi where W ind ¼ fStrong; W eakg and P lay ¼ fY es; Nog, then T U ¼ fhStrong; Y esi; hStrong; Noi; hW eak; Y esi, hW eak; Noig. Remark 1. If data table T associates with a tuple of m attributes ha1 ; a2 ; . . . ; am i where ai has ni possible values and 1 i m, then jT U j ¼ n1 n2 nm . Definition 2. If TD is a subset of T and q is a positive integer, then q-multiple-of TD , denoted as qTD , is a set of data sets containing q instances of each data set in TD . Remark 2. jqT U j ¼ qjT U j. Definition 3. If k is a possible value of attribute a and l is a possible value of attribute b in T, then Tða¼kÞ denotes a subset of T that contains all data sets with attribute a equals k. Similarly, Tða¼kÞ^ðb¼lÞ denotes a subset of T that contains all data sets with attribute a equals k and b equals l. Theorem 1. If ki is a possible value of attribute ai in T, then U jqTða j ¼ ðq n1 n2 nm Þ=ni . i ¼ki Þ U j Proof. jqTða i ¼ki Þ
U ¼ q Tða i ¼ki Þ ¼ q n1 n2 ni1 niþ1 nm ¼ ðq n1 n2 nm Þ=ni : u t
Corollary 1.
U j jqTða i ¼ki Þ^^ðaj ¼kj Þ
¼ ðq n1 n2 nm Þ=ðni nj Þ: Definition 4. If TD is a subset of T, then the absolute complement of TD , denoted as TDc , is equal to T U TD , and a q-absolute-complement of TD , denoted as qTDc , is equal to qT U TD .
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS
4
Fig. 2. Unrealizing training samples in (a) by calling Unrealize-TrainingSet ðTS ; T U ; fg; fgÞ The resulting tables T P and T 0 are given in (b) and (c).
355
DECISION TREE GENERATION
The well-known ID3 algorithm [18] shown above builds a decision tree by calling algorithm Choose-Attribute recursively. This algorithm selects a test attribute (with the smallest entropy) according to the information content of the training set TS . The information entropy functions are given as X Hai ðTS Þ ¼ ðjTSðai ¼eÞ j=jTS jÞ log2 ðjTSðai ¼eÞ j=jTS jÞ; ð1Þ e2Ki
Example 2. With the conditions of Example 1, if TD ¼ fhStrong; Y esi; hW eak; Y esi; hW eak; Noig, t h e n TDc ¼ fhStrong; Noig and 2TDc ¼ fhStrong; Y esi; hStrong; Noi; hW eak; Y esi; hW eak; Noi; hStrong; Noig.
and
Lemma 1. If TD1 and TD2 are two sets of data sets that are subsets of T and TD2 TD1 , then any element contained by TD2 should be also contained by TD1 , such that jTD1 TD2 j ¼ jTD1 j jTD2 j.
where Ki and Kj are the sets of possible values for the decision attribute, ai , and test attribute, aj , in TS , respectively, and the algorithm Majority-Value retrieves the most frequent value of the decision attribute of TS .
Corollary 2. If ki and kj are possible values of attributes ai and aj in T, respectively, then j½TD1 TD2 ðai ¼ki Þ j ¼ jTD1ðai ¼ki Þ j jTD2ðai ¼ki Þ j and j½TD1 TD2 ðai ¼ki Þ^...^ðaj ¼kj Þ j ¼ jTD1ðai ¼ki Þ^...^ðaj ¼kj Þ j jTD2ðai ¼ki Þ^...^ðaj ¼kj Þ j.
Algorithm Generate-Tree ðTS ; attribs; defaultÞ Input: TS , the set of training data sets attribs, set of attributes default, default value for the goal predicate Output: tree, a decision tree 1. if TS is empty then return default 2. default Majority ValueðTS Þ 3. if Hai ðTS Þ ¼ 0 then return default 4. else if attribs is empty then return default 5. else 6. best Choose-Attributeðattribs; TS Þ 7. tree a new decision tree with root attribute best 8. for each value vi of best do 9. TSi fdatasets inTS as best ¼ ki g 10. subtree Generate-TreeðTSi ; attribs-best; defaultÞ 11. connect tree and subtree with a branch labelled ki 12. return tree
3.2 Unrealized Training Set Traditionally, a training set, TS , is constructed by inserting sample data sets into a data table. However, a data set complementation approach, as presented in this paper, requires an extra data table, T P . T P is a perturbing set that generates unreal data sets which are used for converting the sample data into an unrealized training set, T 0 . The algorithm for unrealizing the training set, TS , is shown as follows: Algorithm Unrealize-Training-Set ðTS ; T U ; T 0 ; T P Þ Input: TS , a set of input sample data sets T U , a universal set T 0 , a set of output training data sets T P , a perturbing set Output: hT 0 ; T P i 1. if TS is empty then return hT 0 ; T P i 2. t a data set in TS 3. if t is not an element of T P or T P ¼ ftg then 4. TP TP þ TU P P 5. T T ftg the most frequent dataset in T P 6. t0 7. return Unrealize-TrainingSet ðTS ftg; T U ; T 0 þ ft0 g; T P ft0 gÞ To unrealize the samples, TS , we initialize both T 0 and T as empty sets, i.e., we invoke the above algorithm with Unrealize-Training-Set ðTS ; T U ; fg; fgÞ. Figs. 2b and 2c show the tables that result from the unrealizing process of the samples in Fig. 2a. The resulting unrealized training set contains some dummy data sets excepting the ones in TS . The elements in the resulting data sets are unreal individually, but meaningful when they are used together to calculate the information required by a modified ID3 algorithm, which will be covered in Section 4. P
Hai ðTS jaj Þ ¼
X
ðjTSðaj ¼fÞ j=jTS jÞHai ðTSðaj ¼fÞ Þ;
ð2Þ
f2Kj
In Section 2, we discussed an algorithm that generates an unrealized training set, T 0 , and a perturbing set, T P , from the samples in TS . In this section, we use data tables T 0 and T P as a means to calculate the information content and information gain of TS , such that a decision tree of the original data sets can be generated based on T 0 and T P .
4.1 Information Entropy Determination From the algorithm Unrealize-Training-Set, it is obvious that the size of TS is the same as the size of T 0 . Furthermore, all data sets in (T 0 þ T P ) are based on the data sets in T U , excepting the ones in TS , i.e., TS is the q-absolutecomplement of (T 0 þ T P ) for some positive integer q. According to Theorem 2, the size of qT U can be computed from the sizes of T 0 and T P , with qT U ¼ 2 jT 0 j þ jT P j. Therefore, entropies of the original data sets, TS , with any decision attribute and any test attribute, can be determined by the unreal training set, T 0 , and perturbing set, T P , as we will show with Theorems 3 and 4, below. Definition 5. GðxÞ ¼ x log2 x. Theorem 2. If T S ¼ q½T 0 þ T P c and jT 0 j ¼ jTS j for some positive integer q, then jqT U j ¼ 2 jT 0 j þ jT P j.
356
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Proof. TS ¼ q½T 0 þ T P c
VOL. 24, NO. 2,
FEBRUARY 2012
Proof.
) TS ¼ qT U ðT 0 þ T P Þ 0
U
Hai ðTS jaj Þ ¼ Hai ðq½T 0 þ T P c jaj Þ
P
) jTS j ¼ jqT ðT þ T Þj ) jTS j ¼ jqT U j jðT 0 þ T P Þj; : ðT 0 þ T P Þ qT U 0
U
¼
P
) jTS j ¼ jqT j jT j jT j
f2Kj
) jT 0 j ¼ jqT U j jT 0 j jT P j; : jT 0 j ¼ jTS j
¼
) jqT U j ¼ 2jT 0 j þ jT P j:
Corollary 3. If ki and kj are possible values of attributes ai and U aj in T, respectively, then jqTða j ¼ ð2 jT 0j þ jT P jÞ=ni i ¼ki Þ U 0 and jqTðai ¼ki Þ^^ðai ¼ki Þ j ¼ ð2 jT j þ jT P jÞ=ðni nj Þ. P c
Theorem 3. If TS ¼ q½T þ T for some positive integer q and Ki is the set of possible values of attribute ai in T, then P U 0 Hai ðTS Þ ¼ e2Ki Gðx=yÞ where x ¼ jqTða j jTða j i ¼eÞ i ¼eÞ P U 0 P jTðai ¼eÞ j and y ¼ jqT j jT j jT j.
jq½T 0 þ T P c j
Hai ðq½T 0 þ T P cðaj ¼fÞ Þ
U X jqTða j j½T 0 þ T P ðaj ¼fÞ j j ¼fÞ f2Kj
t u
0
X jq½T 0 þ T P cðaj ¼fÞ j
jqT U j j½T 0 þ T P j
Hai ðq½T 0 þ T P cðaj ¼fÞ Þ ¼
U 0 P X jqTða j jTða j jTða j j ¼fÞ j ¼fÞ j ¼fÞ f2Kj
¼
jqT U j jT 0 j jT P j
U 0 P X X jqTða j jTða j jTða j j ¼fÞ j ¼fÞ j ¼fÞ f2Kj e2Ki
G
Hai ðTSðaj ¼fÞ Þ
jqT U j jT 0 j jT P j
U 0 P jqTða j jTða j jTða j i ¼eÞ^ðaj ¼fÞ i ¼eÞ^ðaj ¼fÞ i ¼eÞ^ðaj ¼fÞ
!
U 0 P jqTða j jTða j jTða j j ¼fÞ j ¼fÞ j ¼fÞ
Proof. Hai ðTS Þ
: t u
¼ Hai ðq½T 0 þ T P c Þ ¼
X
! ðjq½T 0 þ T P cðai ¼eÞ j
G
¼
X
! U jqTða j j½T 0 þ T P ðai ¼eÞ j i ¼kÞ
G
jqT U j j½T 0 þ T P jÞ
e2Ki
¼
X
Corollary 5. If kl and km are possible values of attributes al and P am Pin T, respectively, then Hai ðTSðal ¼kl Þ jaj Þ ¼ f2Kj e2Ki ðx=yÞGðz=xÞ and XX Hai ðTSðal ¼kl Þ^^ðam ¼km Þ jaj Þ ¼ ðu=vÞGðw=uÞ
jq½T 0 þ T P c jÞ
e2Ki
U 0 P ðjqTða j jTða j jTða j i ¼eÞ i ¼eÞ i ¼eÞ
G
jqT U j jT 0 j jT P jÞ
e2Ki
f2Kj e2Ki
! :
where t u
U 0 P x ¼ jqTða j jTða j jTða j; j ¼fÞ^ðal ¼kl Þ j ¼fÞ^ðal ¼kl Þ j ¼fÞ^ðal ¼kl Þ U 0 P y ¼ jqTða j jTða j jTða j; l ¼kl Þ l ¼kl Þ l ¼kl Þ
Corollary 4. If kj and kl are possible values of attributes aj and al P in T, respectively, then Hai ðTSðaj ¼kj Þ Þ ¼ e2Ki Gðx=yÞ and P Hai ðTSðaj ¼kj Þ ^^ðal ¼kl Þ Þ ¼ e2Ki Gðw=zÞ where x¼
U j jqTða i ¼e^ðaj ¼kj Þ
y¼
U jqTða j j ¼kj Þ
0 jTða j i ¼eÞ^ðaj ¼kj Þ
0 jTða j j ¼kj Þ
P jTða j; i ¼eÞ^ðaj ¼kj Þ
P jTða j; j ¼kj Þ
U w ¼ jqTða j i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ 0 P jTða j jTða j; i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ
and U 0 j jTða j z ¼ jqTða j ¼kj Þ^^ðal ¼kl Þ j ¼kj Þ^^ðal ¼kl Þ P jTða j: j ¼kj Þ^^ðal ¼kl Þ
P c
0
Theorem 4. If TS ¼ q½T þ T for some positive integer q and Kj is the set of possible values of attribute aj in T, then P P Hai ðTS jaj Þ ¼ f2Kj e2Ki ðx=yÞGðz=xÞ where U 0 P j jTða j jTða j; x ¼ jqTða j ¼fÞ j ¼fÞ j ¼fÞ U
0
P
y ¼ jqT j jT j jT j;
and
U 0 P j jTða j jTða j: z ¼ jqTða i ¼eÞ^ðaj ¼fÞ i ¼eÞ^ðaj ¼fÞ i ¼eÞ^ðaj ¼fÞ
U 0 j jTða j z ¼ jqTða i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ P jTða j; i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ U 0 u ¼ jqTða j jTða j j ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ j ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ P jTða j; j ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ U 0 v ¼ jqTða j jTða j l ¼kl Þ^^ðam ¼km Þ l ¼kl Þ^^ðam ¼km Þ P j; and jTða l ¼kl Þ^^ðam ¼km Þ U j w ¼ jqTða i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ 0 j jTða i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ P jTða j: i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
4.2 Modified Decision Tree Generation Algorithm As entropies of the original data sets, TS , can be determined by the retrievable information—the contents of unrealized training set, T 0 , and perturbing set, T P —the decision tree of TS can be generated by the following algorithm. Algorithm. Generate-Tree’ (size, T 0 , T P , attribs, default) Input: size, size of qT U T 0 , the set of unreal training data sets T P , the set of perturbing data sets
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS
357
Fig. 3. Illustration of Generate-Tree process by applying the conventional ID3 approach with the original samples TS .
Fig. 5. Data Sets in qT U , where data sets contained in the rectangles belong to TS and the rest belong to ½T 0 þ T P . Fig. 4. Illustration of Generate-Tree0 process by applying the modified ID3 approach with the unrealized samples (T 0 þ T P ). For each step the entropy values and resulting subtrees are exactly the same as the results of the traditional approach.
attribs, set of attributes default, default value for the goal predicate Output: tree, a decision tree 1. if (T 0 ; T P ) is empty then return default 2. default Minority ValueðT 0 þ T P Þ 3. if Hai ðq½T 0 þ T P c Þ ¼ 0 then return default 4. else if attribs is empty then return default 5. else 6. best Choose-Attribute0 ðattribs; size; ðT 0 ; T P ÞÞ 7. tree a new decision tree with root attribute best 8. size size=number of possible values ki in best 9. for each value vi of best do 10. Ti0 fdata sets in T 0 as best ¼ ki g 11. TiP fdata sets in T P as best ¼ ki g 12. subtree Generate-Treeðsize; Ti0 ; TiP ; attribs-best, default) 13. connect tree and subtree with a branch labelled ki 14. return tree Similar to the traditional ID3 approach, algorithm ChooseAttribute’ selects the test attribute using the ID3 criteria, based on the information entropies, i.e., selecting the attribute with the greatest information gain. Algorithm Minority-Value retrieves the least frequent value of the decision attribute of (T 0 þ T P ), which performs the same function as algorithm Majority-Value of the tradition ID3 approach, that is, receiving the most frequent value of the decision attribute of TS . To generate the decision tree with T 0 ; T P and jqT U j (which equals 2 jT 0 j þ jT P j), a possible value, kd , of the decision attribute, ad , (which is an element of A—the set of attributes in T ) should be arbitrarily chosen, i.e., we call the algorithm Generate-T ree0 ð2 jT 0 j þ jT P j; TS ; T U ; A ad ; kd Þ. Fig. 4 shows the resulting decision tree of our new ID3 algorithm with unrealized sample inputs shown in Figs. 2b and 2c. This decision tree is the same as the tree shown in Fig. 3 which was
Fig. 6. Unrealizing training samples in (a) with a dummy value Dummy added to the domain of attribute Wind. The resulting tables T P and T 0 are shown in (b) and (c).
Fig. 7. (a) The case with the lowest variance distribution. Ploss ðTL j TS ; TS Þ ¼ Ploss ðTL j½T 0 þ T P ; TS Þ ¼ jTL jðjTS j=jT U jÞ. (b) The case with the highest variance distribution. Ploss ðTL jTS ; TS Þ ¼ jTL jjTS j and Ploss ðTL j½T 0 þ T P ; TS Þ ¼ 0.
generated by the traditional ID3 algorithm with the original samples shown in Fig. 2a.
4.3 Data Set Reconstruction Section 4.2 introduced a modified decision tree learning algorithm by using the unrealized training set, T’, and the perturbing set, T P . Alternatively, we could have reconstructed the original sample data sets, TS , from T 0 and T P (shown in Fig. 5), followed by an application of the conventional ID3 algorithm for generating the decision tree
358
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 24, NO. 2,
FEBRUARY 2012
Fig. 8. Analytical results of experiments when data set complementation approach is applied to 1) normally distributed samples, 2) evenly distributed samples, and 3) extremely unevenly distributed samples, in the cases of (i) without creating any dummy attribute values and (ii) when the dummy attribute technique is used to double the size of sample domain.
from TS . The reconstruction process is dependent upon the full information of T 0 and T P (whereas q ¼ ð2 jT 0 j þ jT P jÞ= jT U j); reconstruction of parts of TS based on parts T 0 and T P is not possible.
4.4 Enhanced Protection with Dummy Values Dummy values can be added for any attribute such that the domain of the perturbed sample data sets will be expanded while the addition of dummy values will have no impact on TS . For example, we can expand the possible values of attribute Wind from {Strong, Weak} to {Dummy, Strong, Weak} where Dummy represents a dummy attribute value that plays no role in the data collection process (Fig. 6 shows the results of data set unrealization of TS with the dummy attribute value Dummy). In this way we can keep the same resulting decision tree (because the entropy of TS does not change) while arbitrarily expanding the size of T U . Meanwhile, all data sets in T 0 and T P , including the ones with a dummy attribute value, are needed for determining the entropies of q½T 0 þ T P c during the decision tree generation process.
5
THEORETICAL EVALUATION
This section provides a concise theoretical evaluation of our approach. For full details on our evaluation process, we refer to [19].
5.1 Privacy Issues Private information could potentially be disclosed by the leaking of some sanitized data sets, TL (a subset of the entire collected data table, TD ), to an unauthorized party if 1.
the attacker is able to reconstruct an original sample, tS , from TL , or
Fig. 9. Analytical results of experiments when the data set complementation approach is applied to some randomly picked samples, in the cases of (i) without creating any dummy attribute values and (ii) creating dummy attribute values in order to double the size of sample domain.
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS
359
Fig. 10. Samples with normal distribution. This set of samples is used in many data mining teaching materials.
if tL (a data set in TL ) matches tS (a data set in TS ) by chance. In the scope of this paper, tS is nonreconstructable because jTL j is much smaller than jT 0 þ T P j. Hence, we are focusing on the privacy loss via matching. The possible privacy loss of TS via TL TD , which is denoted as Ploss ðTL jTD ; TS ), is defined as 2.
Ploss ðTL jTD ; TS Þ ¼ ðjTL j=jTD jÞ
X X
MðtD ; tS Þ;
ð3Þ
tD 2TD tS 2TS
where Mðt1 ; t2 Þ returns 1 if t1 ¼ t2 , 0 otherwise. Without privacy preservation the collected data sets are the original samples. Samples with more even distribution (low variance) have less privacy loss, while data sets with high frequencies are at risk. The privacy loss of the original samples through matching is ranged as jTL jðjTs j=jT U jÞ Ploss ðTL jTS; TS Þ jTL jjTS j:
ð4Þ
The boundary cases are shown in Fig. 7. The data set complementation approach solves the privacy issues of those uneven samples. This approach converts the original samples into some unrealized data sets [T 0 þ T P ], such that the range of privacy loss is decreased to 0 Ploss ðTL j½T 0 þ T P ; TS Þ jTL jðjTS j=jT U jÞ:
ð5Þ
Data Set complementation is in favor of those samples with high variance distribution, especially when some data sets have zero counts. However, it does not provide significant improvement for the even cases. In Section 4.4, we showed that we can generate zerocount data sets and “uneven” the data set distribution by adding dummy attribute values. If we expand the size of the possible sample domain from jT U j to RjT U j where R 2, then the range of privacy loss will be decreased to
Fig. 11. Samples with even distribution. All data sets have the same counts.
360
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 24, NO. 2,
FEBRUARY 2012
Fig. 12. Samples with extremely uneven distribution. All data sets have 0 counts except one data set has jTS j counts.
0 Ploss ðTL j½T 0 þ T P ; TS Þ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jTS j ðRjT U j 2 RjT U j 1ÞðR 1Þ þ R 2Þ : jTL j jT U j R2 ðjT U j 1Þ ð6Þ
For example, if we expand the possible values of attributes Outlook and Wind to {Dummy1, Sunny, Overcast, Rain} and {Dummy2, Strong, Weak}, respectively, then the size of sample domain will be doubled (R ¼ 2) and the maximum privacy loss will be decreased from 0:0833 jTL j jTS j to 0:0273 jTL j jTS j. Adding dummy attribute values effectively improves the effectiveness of the data set complementation approach; however, this technique requires the storage size of cRjT U j jTS j, where c is the counts of the most frequent data set in TS . The worst case storage requirement equals ðRjT U j 1Þ jTS j.
6
EXPERIMENTS
This section shows the experimental results from the application of the data set complementation approach to normally distributed samples (shown in Fig. 10), evenly distributed samples (shown in Fig. 11), extremely unevenly distributed samples (shown in Fig. 12), and 4. six sets of randomly picked samples, where (i) was generated without creating any dummy attribute values and (ii) was generated by applying the dummy attribute technique to double the size of the sample domain. For the artificial samples (Tests 1-3), we will study the output accuracy (the similarity between the decision tree generated by the regular method and by the new approach), the storage complexity (the space required to store the unrealized samples based on the size of the original samples) and the privacy risk (the maximum, minimum, and average privacy loss if one unrealized data set is leaked). The unrealized samples, T 0 and T P , are shown in Figs. 13, 14, 15, 16, 17, and 18 and the analytical results are shown in Fig. 8. For the random samples, we will study the storage complexity and the privacy risk of one data set loss and that of some portions (10, 20, 30, 40, and 50 percent of randomly picked data sets) of data set loss. The analytical results are shown in Fig. 9. 1. 2. 3.
Fig. 13. Unrealized samples T 0 (left) and T P (right) derived from samples in Fig. 9, without the dummy attribute values technique.
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS
361
Fig. 15. Unrealized samples T 0 (left) and T P (right) derived from samples in Fig. 10, without the dummy attribute values technique.
6.1 Output Accuracy In all cases, the decision tree(s) generated from the unrealized samples (by algorithm Generate-Tree’ described in Section 4.2) is the same as the decision tree(s), TreeT s , generated from the original sample by the regular method. This result agrees with the theoretical discussion mentioned in Section 4.
Fig. 14. Unrealized samples T 0 (left) and T P (right) derived from samples in Fig. 9, by inserting dummy attribute values into the attributes Temperature and Windy.
6.2 Storage Complexity From the experiment, the storage requirement for the data set complementation approach increases from jTS j to ð2jT U j 1Þ jTS j, while the required storage may be doubled if the dummy attribute values technique is applied to double the sample domain. The best case happens when the samples are evenly distributed, as the storage requirement is the same as for the originals. The worst case happens when
362
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 24, NO. 2,
FEBRUARY 2012
Fig. 17. Unrealized samples T 0 (left) and T P (right) derived from samples in Fig. 11, without the dummy attribute values technique. The attribute Count is added only to save printing space.
Fig. 16. Unrealized samples T 0 (left) and T P (right) derived from samples in Fig. 10, by inserting dummy attribute values into the attributes Temperature and Windy.
the samples are distributed extremely unevenly. Based on the randomly picked tests, the storage requirement for our approach is less than five times (without dummy values) and eight times (with dummy values, doubling the sample domain) that of the original samples.
6.3 Privacy Risk Without the dummy attribute values technique, the average privacy loss per leaked unrealized data set is small, except for the even distribution case (in which the unrealized samples are the same as the originals). By doubling the sample domain, the average privacy loss for a single leaked data set is zero, as the unrealized samples are not linked to any information provider. The randomly picked tests show that the data set complementation approach eliminates the privacy risk for most cases and always improves privacy security significantly when dummy values are used.
7
CONCLUSION
We introduced a new privacy preserving approach via data set complementation which confirms the utility of training data sets for decision tree learning. This approach converts the sample data sets, TS , into some unreal data sets (T 0 þ T P ) such that any original data set is not reconstructable if an unauthorized party were to steal some portion of (T 0 þ T P ). Meanwhile, there remains only a low probability of random matching of any original data set to the stolen data sets, TL . The data set complementation approach ensures that the privacy loss via matching is ranged from 0 to jTL jðjTS j=jT U jÞ, where T U is the set of possible sample data sets. By creating dummy attribute values and expanding the size of sample domain to R times, where R 2, the privacy loss via matching will be decreased to the range between 0 and jTL jjTS j½RjT U j 2½ðRjT U j 1ÞðR 1Þ1=2 þ R 2=½R2 jT U jðjT U j 1Þ. Privacy preservation via data set complementation fails if all training data sets are leaked because the data set reconstruction algorithm is generic. Therefore, further research is required to overcome this limitation. As it is very straightforward to apply a cryptographic privacy
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS
363
preserving approach, such as the (anti)monotone framework, along with data set complementation, this direction for future research could correct the above limitation. This paper covers the application of this new privacy preserving approach with the ID3 decision tree learning algorithm and discrete-valued attributes only. Future research should develop the application scope for other algorithms, such as C4.5 and C5.0, and data mining methods with mixed discretely—and continuously valued attributes. Furthermore, the data set complementation approach expands the sample storage size (in the worst case, the storage size equals ð2jT U j 1ÞjTS jÞ; therefore, further studies are needed into optimizing 1) the storage size of the unrealized samples, and 2) the processing time when generating a decision tree from those samples.
REFERENCES [1] [2] [3] [4] [5]
[6] [7] [8]
[9] [10] [11]
[12]
[13] [14] [15] [16] Fig. 18. Unrealized samples T 0 (left) and T P (right) derived from samples in Fig. 11, by inserting dummy attribute values into the attributes Temperature and Windy. The attribute Count is added only to save printing space.
[17]
S. Ajmani, R. Morris, and B. Liskov, “A Trusted Third-Party Computation Service,” Technical Report MIT-LCS-TR-847, MIT, 2001. S.L. Wang and A. Jafari, “Hiding Sensitive Predictive Association Rules,” Proc. IEEE Int’l Conf. Systems, Man and Cybernetics, pp. 164169, 2005. R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” Proc. ACM SIGMOD Conf. Management of Data (SIGMOD ’00), pp. 439-450, May 2000. Q. Ma and P. Deng, “Secure Multi-Party Protocols for Privacy Preserving Data Mining,” Proc. Third Int’l Conf. Wireless Algorithms, Systems, and Applications (WASA ’08), pp. 526-537, 2008. J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman, “A Pristine Clean Cabalistic Foruity Strategize Based Approach for Incremental Data Stream Privacy Preserving Data Mining,” Proc. IEEE Second Int’l Advance Computing Conf. (IACC), pp. 410-415, 2010. N. Lomas, “Data on 84,000 United Kingdom Prisoners is Lost,” Retrieved Sept. 12, 2008, http://news.cnet.com/83011009_3-10024550-83.html, Aug. 2008. BBC News Brown Apologises for Records Loss. Retrieved Sept. 12, 2008, http://news.bbc.co.uk/2/hi/uk_news/politics/ 7104945.stm, Nov. 2007. D. Kaplan, Hackers Steal 22,000 Social Security Numbers from Univ. of Missouri Database, Retrieved Sept. 2008, http://www.scmagazineus.com/Hackers-steal-22000-Social-Security-numbers-fromUniv.-of-Missouri-database/article/34964/, May 2007. D. Goodin, “Hackers Infiltrate TD Ameritrade client Database,” Retrieved Sept. 2008, http://www.channelregister.co.uk/2007/ 09/15/ameritrade_database_burgled/, Sept. 2007. L. Liu, M. Kantarcioglu, and B. Thuraisingham, “Privacy Preserving Decision Tree Mining from Perturbed Data,” Proc. 42nd Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009. Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F. Dong, “Three New Approaches to Privacy-Preserving Add to Multiply Protocol and Its Application,” Proc. Second Int’l Workshop Knowledge Discovery and Data Mining, (WKDD ’09), pp. 554-558, 2009. J. Vaidya and C. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data,” Proc Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’02), pp. 2326, July 2002. M. Shaneck and Y. Kim, “Efficient Cryptographic Primitives for Private Data Mining,” Proc. 43rd Hawaii Int’l Conf. System Sciences (HICSS), pp. 1-9, 2010. C. Aggarwal and P. Yu, Privacy-Preserving Data Mining:, Models and Algorithms. Springer, 2008. L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l J. Uncertainty, Fuzziness and Knowledge-based Systems, vol. 10, pp. 557-570, May 2002. J. Dowd, S. Xu, and W. Zhang, “Privacy-Preserving Decision Tree Mining Based on Random Substitions,” Proc. Int’l Conf. Emerging Trends in Information and Comm. Security (ETRICS ’06), pp. 145-159, 2006. S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh, “Preservation of Patterns and Input-Output Privacy,” Proc. IEEE 23rd Int’l Conf. Data Eng., pp. 696-705, Apr. 2007.
364
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
[18] S. Russell and N. Peter, Artificial Intelligence. A Modern Approach 2/ E. Prentice-Hall, 2002. [19] P.K. Fong, “Privacy Preservation for Training Data Sets in Database: Application to Decision Tree Learning,” master’s thesis, Dept. of Computer Science, Univ. of Victoria, 2008. Pui K. Fong received the master’s degree in computer science from the University of Victoria, Canada, in 2008. He is currently working toward the PhD degree in the Department of Computer Science, University of Victoria, Canada. His research interests include machine learning, data mining, data security, and natural language processing. He is experienced in developing machine learning, data mining and data analyzing applications in different areas such as astronomical, chemical, biochemical, medical, natural language processing and engineering.
VOL. 24, NO. 2,
FEBRUARY 2012
Jens H. Weber-Jahnke received the MSc degree in software engineering from the University of Dortmund, Germany, and the PhD degree in computer science from the University of Paderborn, Germany. He is an associate professor in the Department of Computer Science at the University of Victoria, Canada. His research interests include security engineering, software and data engineering, and health informatics. He is a senior member of the IEEE Computer Society and licensed as a professional engineer in the Province of British Columbia.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.