An Adaptive Algorithm For Detection Of Duplicate Records

  • Uploaded by: lipika008
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View An Adaptive Algorithm For Detection Of Duplicate Records as PDF for free.

More details

  • Words: 601
  • Pages: 12
Technical Seminar 2004

An Adaptive Algorithm for Detection of Duplicate Records

An Adaptive Algorithm for Detection of Duplicate Records Presented By:

Rama kanta Behera

IT200127207

Under the guidance of :

Miss Ipsita Mishra

RAMAKANTA BEHERA

IT200127207

1

An Adaptive Algorithm for Detection of Duplicate Records

Technical Seminar 2004

INTRODUCTION  A “records set” is a list of prior distinct records. A new record is to be verified for a duplicate against the records set A database is a collection of related data.  Various Algorithms like

• • •

Matching learning algo, Learnable string similarity measures Adaptive Algo

RAMAKANTA BEHERA

IT200127207

2

An Adaptive Algorithm for Detection of Duplicate Records

Technical Seminar 2004

OBJECTIVES  Reduced cost of duplicate record detection.  Perfect scalability of one such detection procedure.  Cache prior information of distinct records and thus cause retaining of prior records redundant for furthering the search  Keep the algorithm adaptive.

RAMAKANTA BEHERA

IT200127207

3

An Adaptive Algorithm for Detection of Duplicate Records

Technical Seminar 2004

PREVALENT METHODS The Brute Force Method This method consumes complexity of the order number of records in the records set and requires all prior records to be stored. Method by Rail et. al The comparison of a new record against the records set is reduced from being full text match to comparing two integers

RAMAKANTA BEHERA

IT200127207

4

An Adaptive Algorithm for Detection of Duplicate Records

Technical Seminar 2004

OUTLINE OF THE PROPOSED SOLUTION The central idea behind the present algorithm is based on the fundamental property of primality of numbers Record set

Integer number space

f(x)

I Fig: hashing Record set

Integer number

f(x)

I

Prime number

g(x)

P

Fig: Extended hashing into prime space

RAMAKANTA BEHERA

IT200127207

5

Technical Seminar 2004

An Adaptive Algorithm for Detection of Duplicate Records

f(x)

g(x)

r1 r2 … rn

I1 I2 … In

P1 P2 … Pn

PRODUCT(Pprior)

P1*p2…*pn= Pprior

Fig: The complete algorithm

RAMAKANTA BEHERA

IT200127207

6

An Adaptive Algorithm for Detection of Duplicate Records

Technical Seminar 2004

REALIZATION OF THE ALGORITHM

Two functions f(x) and g(x) are to be realized for the implementation of the algorithm.  Realizing f(x)  Realizing g(x)

RAMAKANTA BEHERA

IT200127207

7

An Adaptive Algorithm for Detection of Duplicate Records

Technical Seminar 2004

STEPS OF THE ALGORITHM Step 1 : For each new record, hash is performed and unique hash value (Hnew) for each distinct record is obtained. Step 2 : Hnew is mapped to its corresponding unique prime (Pnew). Step 3 : Pprior is divided with Pnew. If Pnew exactly divides Pprior, then the corresponding record to Pnew is a duplicate and already exists in Pprior. Else, Pnew is a distinct record. Step 4 : If Pnew is a distinct record, Pprior is multiplied with Pnew and the result is stored back in Pprior. Thus updating Pprior renders the algorithm adaptive. RAMAKANTA BEHERA

IT200127207

8

Technical Seminar 2004

An Adaptive Algorithm for Detection of Duplicate Records

Fig: Flowchart RAMAKANTA BEHERA

IT200127207

9

An Adaptive Algorithm for Detection of Duplicate Records

Technical Seminar 2004

IMPLEMENTATIONS There are three important implementation details that need to be discussed

 Size of Records set  Use of Logarithms  Subsets of Records set

RAMAKANTA BEHERA

IT200127207

10

An Adaptive Algorithm for Detection of Duplicate Records

Technical Seminar 2004

CONCLUSION



A new approach to handle duplicate records is presented



This approach combines the concepts of number theory and algorithmic to solve the oftener felt problem of “duplicate record detection”.

RAMAKANTA BEHERA

IT200127207

11

Technical Seminar 2004

An Adaptive Algorithm for Detection of Duplicate Records

THANK YOU !!!

RAMAKANTA BEHERA

IT200127207

12

Related Documents


More Documents from "Flexense"