Clustering

Uploaded by: api-19981779
0
0

July 2020
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Clustering as PDF for free.

More details

Words: 1,033
Pages: 31

Preview
Full text

Presentation on Clustering Submitted To :Dr V.K.Pathak

Submitted By :Saurabh Jain III B.Tech CSE

Topics Covered :1. Clustering – Basic Idea 2. Problem 3. Solution based on Greedy Algorithm 4. Example of implementation of Algorithm 5. Application of Clustering

WHEN CLUSTERING ARISES ??

Set of Photographs

Set of Documents

Set of Microorganisms

What Is Clustering? Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Motivation for Clustering  INPUT :- Given a set of objects and distances between them.  Objects can be images, web pages, people, species . . . .  Distance function: Numeric value specifying "closeness" of two objects. Increasing distance corresponds to decreasing similarity.  Goal: Group objects into clusters, where each cluster is a set of similar objects.

Formalising the Clustering Problem Let U be the set of n objects labelled p1,p2,p3, ……………,pn. For every pair pi and pj , we have a distance d(pi,pj). We require d(pi,pi ) = 0, d(pi,pj ) > 0 if i ≠ j and d(pi,pj ) = d(pj,pi) Given a positive integer k, a k-clustering of U is a partition of U into k non-empty subsets or “clusters” C1, C2,……..,Ck . The spacing of a clustering is the smallest distance

P1

C L U S T E R S

P2

P1,P2

C1

P3

P3,P4,P5

C 2

……… …… ……… ……

Pn

Pn

Ck

How hard is clustering? • Suppose we are given n points, and would like to cluster them into k-clusters – How many possible n clusterings?

k k!

Clustering Criteria: Maximum Spacing • Spacing between clusters is defined as Min distance between any pair of points in different clusters. • Clustering of maximum spacing. Given an integer k, find a k-clustering of maximum spacing. spacing

k=4

Greedy Clustering Intuition: Greedily Algorithm cluster objects in increasing

order of distance. Let C be a set of n clusters, with each object in U in its own cluster. Calculate the distance between each pair of objects. Process pairs of objects in increasing order of distance. Let (p,q) be the next pair with p є Cp and q є Cq. If Cp ≠ Cq, add new cluster Cp U Cq to C, delete Cp and Cq from C. Stop when there are k clusters in C. Same as Kruskal's algorithm but do not add last k 1 edges in MST.

Key observation:-This procedure is precisely Kruskal's algorithm for Minimum-Cost Spanning Tree (except we stop when there are k connected components). We stop the algorithm before it adds its last k-1 edges. This is equivalent to full minimum spanning T, deleting the k-1 most expensive edges.

Remark:- Equivalent to finding an MST and deleting the k-1 most expensive edges.

Undirected Graph - G(E,V) 5

A 4

6 2

C

3

E

3

D 1

3

B

2 4

F

Minimum- spanning tree A

B 3

2

C

1

3 Expensive Edge

D

E

2 F

A

Expensive Edge

B 3

C

2

D 1

E

2 F

K=3 A

C

B

2

D 1

E

2 F

3 clusters are formed after removing 3-1=2 expensive edges A

C

B

2

D 1

E

2 F

EXAMPLE Problem: Assume that the database D is given by the table below. Follow single link technique to find clusters in D. Use Euclidean distance measure.

p1 p2 p3 p4 p5 p6

x 0.40 0.22 0.35 0.26 0.08 0.45

y 0.53 0.38 0.32 0.19 0.41 0.30

Solution :Step 1. Plot the objects in n-dimensional space (where n is the number of attributes). In our case we have 2 attributes – x and y, so we plot the objects p1, p2, … p6 in 2-dimensional space: 1 5

2 3 4

6

Step 2. Calculate the distance from each object (point) to all other points, using Euclidean distance measure, and place the numbers in a distance matrix

d(p1, p2) yp2 |2 )1/2

p1 p2 p3 p4 p5 p6

0 0.24 0.22 0.37 0.34 0.23 p1

= 0 0.15 0.20 0.14 0.25 p2

(|xp1 – xp1 |2+ | yp1 -

0 0.15 0.28 0.11 p3

0 0.29 0.22 p4

0 0.39 p5

Distance Matrix

0 p6

Step 3 Identify the two clusters with the shortest distance in the matrix, and merge them together. Re-compute the distance matrix, as those two clusters are now in a single cluster, (no longer exist by themselves). By looking at the distance matrix above, we see that p3 and p6 have the smallest distance from all - 0.11 So, we merge those two in a single cluster, and re-compute the distance 1 matrix. 5

2 3 4

6

p1

0

p2

0.24

0

(p3, p6) 0.22

0.15

0

p4

0.37

0.20

0.15

0

p5

0.34

0.14

0.28

0.29

p1

p2

(p3, p6) p4

0 p5

dist( (p3, p6), p1 ) = MIN ( dist(p3, p1) , dist(p6, p1) ) = MIN ( 0.22 , 0.23 //from original matrix = 0.22

Step 4 Repeat Step 3 until all clusters are merged. a. So, looking at the last distance matrix above, we see that p2 and p5 have the smallest distance from all - 0.14 So, we merge those two in a single cluster, and re-compute the distance matrix. 1 5

2 3 4

6

p1 (p2, p5) (p3, p6) p4

0 0.24 0.22 0.37 p1

0 0.15 0 0.20 0.15 0 (p2, p5) (p3, p6) p4

dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2), dist(p3, p5), dist(p6, p5) ) = MIN ( 0.15 , 0.25, 0.28, 0.39 ) //from original matrix = 0.15

1 5

2 3 4

K=3(no of clusters)

6

STOPPING CONDITION IN THE ABOVE ALGORITHM We stop the procedure once we obtain k connected components K is the number of clusters the user would like to have. In the above example K=3

Applications Routing in mobile ad hoc networks. Identify patterns in gene expression. Document categorization for web search. Similarity searching in medical image databases Skycat: cluster 109 sky objects into stars, quasars, galaxies.

ANY QUESTIONS?

?

Clustering

Overview

More details

Related Documents

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering