Privacypreserving

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Privacypreserving as PDF for free.

More details

  • Words: 5,113
  • Pages: 33
Privacy Preservation Using Data Mining

Abstract: Many organizations collect large amounts of data about their clients and customers. Such data was used for record keeping. Data mining can extract valuable knowledge from this data .Organizations can obtain better results by pooling their data together. However, the collected data may contain sensitive or private information about the organizations or their customers, and privacy concerns are exacerbated if data is shared between multiple organizations. Distributed data mining is concerned with the computation of models from data that is distributed among multiple participants. Privacypreserving distributed data mining seeks to allow for the cooperative computation of such models without the cooperating parties revealing any of their individual data items. Our project makes two contributions in privacy-preserving data mining. First, we introduce the concept of arbitrarily partitioned data, which is a generalization of both horizontally and vertically partitioned data. Second, we provide an efficient privacy-preserving protocol for k-means clustering in the setting of arbitrarily partitioned data. Privacypreserving distributed data mining allows the cooperative computation of data mining algorithms without requiring the participating organizations to re- veal their individual data items to each other. We present a simple I/O-efficient k-clustering algorithm that was designed with the goal of enabling a privacy-preserving version of the algorithm. Our experiments show that this algorithm produces cluster centers that are, on average, more accurate than the ones produced by the well known iterative k-means algorithm. We use our new algorithm as the basis for a communication-efficient privacy-preserving k-clustering protocol for databases that are horizontally partitioned between two parties. Unlike existing privacypreserving protocols based on the k-means algorithm, this protocol does not reveal intermediate candidate cluster centers. In this work, we propose methods for constructing the dissimilarity matrix of objects .

Existing System: The earlier techniques are data perturbation techniques for privacy preserving classification model construction on centralized data , techniques for privacy preserving association rule mining in distributed environments . Techniques from secure multiparty computation form one approach to privacy-preserving data mining. Yao’s general protocol for secure circuit evaluation can be used to solve any two-party privacypreserving distributed data mining problem. However, since data mining usually involves millions or billions of data items, the communication cost of this protocol renders it impractical for these purposes. The k-mean and re cluster algorithms are used in turn for efficiency. Proposed System: In this project, we propose a privacy preserving clustering technique on horizontally partitioned data. Our method is based on constructing the dissimilarity matrix of objects from different sites in a privacy preserving manner which can be used for privacy preserving clustering, as well as database joins, record linkage and other operations that require pair-wise comparison of individual private data objects distributed over multiple sites. We handle both numeric and alphanumeric valued attributes. Our main contributions can be summarized as follows: •

Introduction of a protocol for secure multi-party computation of a dissimilarity

matrix over horizontally partitioned data,



Proof of security of the protocol, analysis of communication costs ,



Secure comparison methods for numeric, alphanumeric and categorical attributes,



Application to privacy preserving clustering.

Data Flow Diagram: K –Means:

Dataset

Clustering

K Means

Clustered dataset

Dataset

Clustering

Privacy preparation

Preserved dataset

Reclustering

Using privacy preserving protocol

Modules: 1. 2. 3. 4.

Data collection Formation of Cluster Privacy preservation Comparison evaluation of algorithms

1. Data Collection and mining: The data collection is the important part of the project. The dataset is prepared by Integrating the organizations details. Such data was initially used only for record keeping. These large collections of data could be “mined” for knowledge that could improve the performance of the organization. While much data mining occurs on data within an organization, it is quite common to use data from multiple sources in order to yield more precise or useful knowledge. However, privacy and secrecy considerations can prohibit organizations from being willing or able to share their data with each other. Therefore we preserve Privacy-preserving data mining technique.

2. Formation of Cluster: The k-means of clustering algorithm is used for cluster formation. K-Means Clustering and Fuzzy Clustering are different than Hierarchical Clustering and Diversity Selection in that the number of clusters, K, needs to be determined at the onset. The goal is to divide the objects into K clusters such that some metric relative to the centroids of the clusters is minimized. If the clustering is over binary objects, medoids need to be used instead of centroids. A medoid is just a bit string that minimizes the sum of distances to all objects in the cluster.

Various metrics to the centroids that can be minimized include •

Maximum distance to its centroid for any object.



Sum of the average distance to the centroids over all clusters.



Sum of the variance over all clusters.



Total distance between all objects and their centroids.

3. Privacy preservation: Privacy-preserving data mining solutions have been presented both with respect to horizontally and vertically partitioned databases, in which either different data objects with the same attributes, are owned by each party, or different attributes for the same data objects are owned by each party, respectively. We introduce the notion of arbitrarily partitioned data, which generalizes both horizontally and vertically partitioned data. In arbitrarily partitioned data, different attributes for different items can be owned by either party. Although extremely “patchworked” data is unlikely in practice, one advantage of considering arbitrarily partitioned data is that protocols in this model apply both to horizontally and vertically partitioned data, as well as to hybrid that are mostly, but not completely, vertically or horizontally partitioned. 4. Comparison evaluation of algorithms In this project , we present a simple deterministic algorithm, Recluster, for I/Oefficient k-clustering. This algorithm, which was explicitly designed with conversion to a privacy-preserving version in mind, examines each data item only once and uses only sequential access to the data. For fixed k, Recluster runs in O(n) time and uses O(log n) space. Our experimental results show that Recluster is, on average, more accurate in identifying cluster centers than the k-means clustering algorithm. Although there are other clustering algorithms that improve on the k-means algorithm, this is the first for which an efficient cryptographic privacy-preserving version has been demonstrated.

Hardware Constraints

Processor Type

: Pentium -IV

Speed

: 2.4GHZ

Ram

: 256 MB RAM

Hard disk

: 80 GB HD

Software Constraints Operating System

: Windows Xp.

Programming Package

: JAVA, Swing

Tools

: Eclipse, Weka (Machine learning software)

SDK

: JDK.1.5.0

System Architecture:

Data Mining System 1

Data Mining System 2

Data Collection

Reclustered Privacy Preserved Using Protocol No. of Clusters

Centroid Clustering Using K-Mean Algorithm Distance Object to Centroid

Grouping based on min. distance

Clustered Data set

Problem Definition Data mining usually involves millions or billions of data items, the communication cost of this protocol renders it impractical for these purposes. The kmean and re cluster algorithms are used in turn for efficiency. In this project, we propose a privacy preserving clustering technique on horizontally partitioned data. Our method is based on constructing the dissimilarity matrix of objects from different sites in a privacy preserving manner which can be used for privacy preserving clustering, as well as database joins, record linkage and other operations that require pair-wise comparison of individual private data objects distributed over multiple sit

Level 0: System1

Data Collection

Level 1: Data Set

Data Collection

Clustering

Level 2:

Data Set

Clustering

No. of Clusters

Finding Centroid

Grouping min distance

Level 3: Data Set

Clustering

Centroid

Collection of Clustered data’s

Applying privacy for the clustered data set’s

Level 4: Data Set

Clustered

Collection of clustered Data

Privacy Preservation

Encryption & Decryption

System 2

Data Minig Concepts: Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost. For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display.

Data, Information, and Knowledge Data Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes: •

operational or transactional data such as, sales, cost, inventory, payroll, and accounting



nonoperational data, such as industry sales, forecast data, and macro economic data



meta data - data about the data itself, such as logical database design or data dictionary definitions

Information The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

Knowledge Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

How does data mining work? While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user

queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: •

Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.



Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.



Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.



Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements: •

Extract, transform, and load transaction data onto the data warehouse system.



Store and manage the data in a multidimensional database system.



Provide data access to business analysts and information technology professionals.



Analyze the data by application software.



Present the data in a useful format, such as a graph or table.

Different levels of analysis are available: •

Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.



Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.



Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.



Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.



Rule induction: The extraction of useful if-then rules from data based on statistical significance.



Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

JAVA: It is a Platform Independent. Java is an object-oriented programming language developed initially by James Gosling and colleagues at Sun Microsystems. The language, initially called Oak (named after the oak trees outside Gosling's office), was intended to replace C++, although the feature set better resembles that of Objective C. Software Used Java began as a client side platform independent programming language that enabled stand-alone Java applications and applets. The numerous benefits of Java resulted in an explosion in the usage of Java in the back end server side enterprise systems. The Java Development Kit (JDK), which was the original standard platform defined by Sun, was soon supplemented by a collection of enterprise APIs. The proliferation of enterprise APIs, often developed by several different groups, resulted in divergence of APIs and caused concern among the Java developer community. Java byte code can execute on the server instead of or in addition to the client, enabling you to build traditional client/server applications and modern thin client Web applications. Two key server side Java technologies are servlets and JavaServer Pages. Servlets are protocol and platform independent server side components which extend the functionality of a Web server. JavaServer Pages (JSPs) extend the functionality of servlets by allowing Java servlet code to be embedded in an HTML file. Features of Java •

Platform Independence o

The Write-Once-Run-Anywhere ideal has not been achieved (tuning for different platforms usually required), but closer than with other languages.



Object Oriented

o

Object oriented throughout - no coding outside of class definitions, including main ().

o •

An extensive class library available in the core language packages.

Compiler/Interpreter Combo o

Code is compiled to bytecodes that are interpreted by a Java virtual machines (JVM) .

o

This provides portability to any machine for which a virtual machine has been written.

o

The two steps of compilation and interpretation allow for extensive code checking and improved security.



Robust o

Exception handling built-in, strong type checking (that is, all data must be declared an explicit type), local variables must be initialized.





Several dangerous features of C & C++ eliminated: o

No memory pointers

o

No preprocessor

o

Array index limit checking

Automatic Memory Management o



Automatic garbage collection - memory management handled by JVM.

Security o

No memory pointers

o

Programs runs inside the virtual machine sandbox.

o

Array index limit checking

o

Code pathologies reduced by 

byte code verifier - checks classes after loading



Class loader - confines objects to unique namespaces. Prevents loading a hacked "java.lang.SecurityManager" class, for example.



Security manager - determines what resources a class can access such as reading and writing to the local disk.



Dynamic Binding o

The linking of data and methods to where they are located, is done at runtime.

o

New classes can be loaded while a program is running. Linking is done on the fly.

o

Even if libraries are recompiled, there is no need to recompile code that uses

classes

in

those

libraries.

This differs from C++, which uses static binding. This can result in fragile classes for cases where linked code is changed and memory pointers then point to the wrong addresses.



Good Performance o

Interpretation of byte codes slowed performance in early versions, but advanced virtual machines with adaptive and just-in-time compilation and other techniques now typically provide performance up to 50% to 100% the speed of C++ programs.



Threading o

Lightweight processes, called threads, can easily be spun off to perform multiprocessing.



o

Can take advantage of multiprocessors where available

o

Great for multimedia displays.

Built-in Networking

o

Java was designed with networking in mind and comes with many classes to develop sophisticated Internet communications.

TEST PROCEDURE SYSTEM TESTING: Testing is performed to identify errors. It is used for quality assurance. Testing is an integral part of the entire development and maintenance process. The goal of the testing during phase is to verify that the specification has been accurately and completely incorporated into the design, as well as to ensure the correctness of the design itself. For example the design must not have any logic faults in the design is detected before coding commences, otherwise the cost of fixing the faults will be considerably higher as reflected. Detection of design faults can be achieved by means of inspection as well as walkthrough. Testing is one of the important steps in the software development phase. Testing checks for the errors, as a whole of the project testing involves the following test cases:

 Static analysis is used to investigate the structural properties of the Source code.  Dynamic testing is used to investigate the behavior of the source code by executing the program on the test data. TEST DATA AND OUTPUT 1. UNIT TESTING: Unit testing is conducted to verify the functional performance of each modular component of the software. Unit testing focuses on the smallest unit of the software design (i.e.), the module. The white-box testing techniques were heavily employed for unit testing.

FUNCTIONAL TESTS: Functional test cases involved exercising the code with nominal input values for which the expected results are known, as well as boundary values and special values, such as logically related inputs, files of identical elements, and empty files. Three types of tests in Functional test:  Performance Test  Stress Test  Structure Test

PERFORMANCE TEST: It determines the amount of execution time spent in various parts of the unit, program throughput, and response time and device utilization by the program unit.

STRESS TEST:

Stress Test is those test designed to intentionally break the unit. A Great deal can be learned about the strength and limitations of a program by examining the manner in which a programmer in which a program unit breaks.

STRUCTURED TEST: Structure Tests are concerned with exercising the internal logic of a program and traversing particular execution paths. The way in which White-Box test strategy was employed to ensure that the test cases could Guarantee that all independent paths within a module have been have been exercised at least once.  Exercise all logical decisions on their true or false sides.  Execute all loops at their boundaries and within their operational bounds.  Exercise internal data structures to assure their validity.  Checking attributes for their correctness.  Handling end of file condition, I/O errors, buffer problems and textual errors in output information

2. INTEGRATION TESTING: Integration testing is a systematic technique for construction the program structure while at the same time conducting tests to uncover errors associated with interfacing. i.e., integration testing is the complete testing of the set of modules which makes up the product. The objective is to take untested modules and build a program structure tester should identify critical modules. Critical modules should be tested as early as possible. One approach is to wait until all the units have passed testing, and then combine them and then tested. This approach is evolved from unstructured testing of small programs. Another strategy is to construct the product in increments of tested units. A small set of modules are integrated together and tested, to which another module is added and tested

in combination. And so on. The advantages of this approach are that, interface dispenses can be easily found and corrected.

The major error that was faced during the project is linking error. When all the modules are combined the link is not set properly with all support files. Then we checked out for interconnection and the links. Errors are localized to the new module and its intercommunications. The product development can be staged, and modules integrated in as they complete unit testing. Testing is completed when the last module is integrated and tested.

TESTING TECHNIQUES / TESTING STRATRGIES: TESTING: Testing is a process of executing a program with the intent of finding an error. A good test case is one that has a high probability of finding an as-yet –undiscovered error. A successful test is one that uncovers an as-yet- undiscovered error. System testing is the stage of implementation, which is aimed at ensuring that the system works accurately and efficiently as expected before live operation commences. It verifies that the whole set of programs hang together. System testing requires a test consists of several key activities and steps for run program, string, system and is important in adopting a successful new system. This is the last chance to detect and correct errors before the system is installed for user acceptance testing. The software testing process commences once the program is created and the documentation and related data structures are designed. Software testing is essential for correcting errors. Otherwise the program or the project is not said to be complete. Software testing is the critical element of software quality assurance and represents the ultimate the review of specification design and coding. Testing is the process of executing the program with the intent of finding the error. A good test case design is one that as a probability of finding an yet undiscovered error. A successful test is one that uncovers an yet undiscovered error. Any engineering product can be tested in one of the two ways: White box testing: This testing is also called as Glass box testing. In this testing, by knowing the specific functions that a product has been design to perform test can be conducted that demonstrate each function is fully operational at the same time searching for errors in each function. It is a test case design method that uses the control structure of the procedural design to derive test cases. Basis path testing is a white box testing.

Basis path testing:  Flow graph notation  Cyclometric complexity  Deriving test cases  Graph matrices Control Black Box Testing: In this testing by knowing the internal operation of a product, test can be conducted to ensure that “all gears mesh”, that is the internal operation performs according to specification and all internal components have been adequately exercised. It fundamentally focuses on the functional requirements of the software.

The steps involved in black box test case design are:  Graph based testing methods  Equivalence partitioning  Boundary value analysis  Comparison testing SOFTWARE TESTING STRATEGIES: A software testing strategy provides a road map for the software developer. Testing is a set activity that can be planned in advance and conducted systematically. For this reason a template for software testing a set of steps into which we can place specific test case design methods should be strategy should have the following characteristics:

 Testing begins at the module level and works “outward” toward the integration of the entire computer based system.  Different testing techniques are appropriate at different points in time.  The developer of the software and an independent test group conducts testing.  Testing and Debugging are different activities but debugging must be accommodated in any testing strategy. INTEGRATION TESTING: Integration testing is a systematic technique for constructing the program structure while at the same time conducting tests to uncover errors associated with. Individual modules, which are highly prone to interface errors, should not be assumed to work instantly when we put them together. The problem of course, is “putting them together”interfacing. There may be the chances of data lost across on another’s sub functions, when combined may not produce the desired major function; individually acceptable impression may be magnified to unacceptable levels; global data structures can present problems. VALIDATION TESTING: Software validation is achieved through a series of tests that demonstrates conformity with requirements. A test plan outlines the classes of test to be conducted and a test procedure defines specific test cases that will be used to demonstrate conformity with requirements. Thus the proposed system under consideration has been tested by validation and found to be working satisfactorily. PROGRAM TESTING: The logical and syntax errors have been pointed out by program testing. A syntax error is an error in a program statement that in violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are

common syntax error. These errors are shown through error messages generated by the computer. A logic error on the other hand deals with the incorrect data fields, out-offrange items and invalid combinations. Since the compiler s will not deduct logical error, the programmer must examine the output. Condition testing exercises the logical conditions contained in a module. The possible types of elements in a condition include a Boolean operator, Boolean variable, a pair of Boolean parentheses A relational operator or on arithmetic expression. Condition testing method focuses on testing each condition in the program the purpose of condition test is to deduct not only errors in the condition of a program but also other a errors in the program. SECURITY TESTING: Security testing attempts to verify the protection mechanisms built in to a system well, in fact, protect it from improper penetration. The system security must be tested for invulnerability from frontal attack must also be tested for invulnerability from rear attack. During security, the tester places the role of individual who desires to penetrate system.

References: 1. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2001. 2. C. Clifton et al., “Tools for Privacy Preserving Distributed Data Mining,” SIGKDD Explorations, vol. 4, no. 2, 2003 3. V.S. Verykios et al., “State-of-the-Art in Privacy Preserving Data Mining,” SIGMOD Record, vol. 33, no. 1, 4. L. Wang, S. Jajodia, and D. Wijesekera, “Securing OLAP Data Cubes against Privacy Breaches,” Proc. 25th IEEE Symp. Security and Privacy, IEEE Press, 2004, 5. R. Agrawal and R. Srikant, “Privacy-Preserving Data Mining,” Proc. 19th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2000, 6. R.J. Bayardo and R. Agrawal, “Data Privacy through Optimal k-Anonymization,” Proc. 21st Int’l Conf. Data Eng., IEEE Press, 2005, 7. N. Zhang, S. Wang, and W. Zhao, “A New Scheme on Privacy- Preserving Data Classification,” Proc. 11th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, ACM Press, 2005, 8. Z. Huang, W. Du, and B. Chen, “Deriving Private Information from Randomized Data,” Proc. 24th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2005, 9. R. Agrawal, R. Srikant, and D. Thomas, “Privacy-Preserving OLAP,” Proc. 25th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2005, pp. 251-262.

10. R. Agrawal, A. Evfimievski, and R. Srikant, “Information Sharing across Private Databases,” Proc. 22nd ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2003

Class Diagram: System1 ReclusterFrame openFile() partyOneClustered() partyTwoClustered() showClusteredItem exit()

Recursive Cluster Split() RecursiveClusterSub()

Check Attributes

MainCluster

mainCheckAttributes() mainRecluster() hashTable()

FTPServer sendFile() recieveFile() run()

System2 One One() Connection() Exit()

Start initCompenents() connectionPerformed()

FTP Client setFile() sendFile() recieveFile() displayMenu() DataEncryption

Data Decryption

Encrypt()

Decrypt()

System1

Sequence Diagram: User

File Selection

Clustering

Starting FTP Server

FTP Client

Encry ption

Choosing a file Clustering the File Selected by the user

Invoking the FTP Server Response from the System2 Encrypting the Clustered file

Decrypting And sending the file to user.

Decry ption

Use case Diagram:

System1 File Choosing

Clustering The Selected File

Reclustering the clustered File

Starting the FTP server

Encryption

Decryption

Literature Survey: Privacy preserving mining of distributed data has numerous applications. Each application poses di#erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the solution to this is a toolkit of components that can be combined for specific privacy-preserving data mining applications. We also propose a classification hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classification hierarchy. we address issues related to sharing information in a distributed system consisting of autonomous entities, each of which holds a private database. Semihonest behavior has been widely adopted as the model for adversarial threats. However, it substantially underestimates the capability of adversaries in reality. we classify malicious adversaries into two widely existing subclasses, called weakly malicious and strongly malicious adversaries, respectively. We define a measure of privacy leakage for information sharing systems and propose protocols that can effectively and efficiently protect privacy against different kinds of malicious adversaries.

We define a measure of privacy leakage for information sharing systems and propose protocols that can effectively and efficiently protect privacy against different kinds of malicious adversaries.

We consider a scenario in which two parties owning confidential databases wish to run a data mining algorithm on the union of their databases, without revealing any unnecessary information. Our work is motivated by the need to both protect privileged information and enable its use for research or other purposes.

Future Enhancement: Most of the previous studies investigated the problem and proposed solutions based on the assumption that all parties are honest or semi-honest. While it is sometimes useful, this assumption substantially underestimates the capability of adversaries and thus does not always hold in practical situations. We considered a space of more powerful adversaries which include not only honest and semi-honest adversaries but also those who are weakly malicious and strongly malicious. Many extensions to our work exist, including 1) extending the information sharing function from intersection to other operations, and 2) dealing with multiple parties in the system, including dealing with correlated attacks from multiple adversaries. Our results can be readily applied to some information sharing functions including equijoin (V0 ⊲⊳ V1) and scalar product (V0 ·V1). We are currently investigating the privacy preserving protocols for sum, union, and other information sharing functions.

Graph:

Related Documents

Privacypreserving
June 2020 1