Presentation on Clustering Submitted To :Dr V.K.Pathak
Submitted By :Saurabh Jain III B.Tech CSE
Topics Covered :1. Clustering – Basic Idea 2. Problem 3. Solution based on Greedy Algorithm 4. Example of implementation of Algorithm 5. Application of Clustering
WHEN CLUSTERING ARISES ??
Set of Photographs
Set of Documents
Set of Microorganisms
What Is Clustering? Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
Motivation for Clustering INPUT :- Given a set of objects and distances between them. Objects can be images, web pages, people, species . . . . Distance function: Numeric value specifying "closeness" of two objects. Increasing distance corresponds to decreasing similarity. Goal: Group objects into clusters, where each cluster is a set of similar objects.
Formalising the Clustering Problem Let U be the set of n objects labelled p1,p2,p3, ……………,pn. For every pair pi and pj , we have a distance d(pi,pj). We require d(pi,pi ) = 0, d(pi,pj ) > 0 if i ≠ j and d(pi,pj ) = d(pj,pi) Given a positive integer k, a k-clustering of U is a partition of U into k non-empty subsets or “clusters” C1, C2,……..,Ck . The spacing of a clustering is the smallest distance
P1
C L U S T E R S
P2
P1,P2
C1
P3
P3,P4,P5
C 2
……… …… ……… ……
Pn
Pn
Ck
How hard is clustering? • Suppose we are given n points, and would like to cluster them into k-clusters – How many possible n clusterings?
k k!
Clustering Criteria: Maximum Spacing • Spacing between clusters is defined as Min distance between any pair of points in different clusters. • Clustering of maximum spacing. Given an integer k, find a k-clustering of maximum spacing. spacing
k=4
Greedy Clustering Intuition: Greedily Algorithm cluster objects in increasing
order of distance. Let C be a set of n clusters, with each object in U in its own cluster. Calculate the distance between each pair of objects. Process pairs of objects in increasing order of distance. Let (p,q) be the next pair with p є Cp and q є Cq. If Cp ≠ Cq, add new cluster Cp U Cq to C, delete Cp and Cq from C. Stop when there are k clusters in C. Same as Kruskal's algorithm but do not add last k 1 edges in MST.
Key observation:-This procedure is precisely Kruskal's algorithm for Minimum-Cost Spanning Tree (except we stop when there are k connected components). We stop the algorithm before it adds its last k-1 edges. This is equivalent to full minimum spanning T, deleting the k-1 most expensive edges.
Remark:- Equivalent to finding an MST and deleting the k-1 most expensive edges.
Undirected Graph - G(E,V) 5
A 4
6 2
C
3
E
3
D 1
3
B
2 4
F
Minimum- spanning tree A
B 3
2
C
1
3 Expensive Edge
D
E
2 F
A
Expensive Edge
B 3
C
2
D 1
E
2 F
K=3 A
C
B
2
D 1
E
2 F
3 clusters are formed after removing 3-1=2 expensive edges A
C
B
2
D 1
E
2 F
EXAMPLE Problem: Assume that the database D is given by the table below. Follow single link technique to find clusters in D. Use Euclidean distance measure.
p1 p2 p3 p4 p5 p6
x 0.40 0.22 0.35 0.26 0.08 0.45
y 0.53 0.38 0.32 0.19 0.41 0.30
Solution :Step 1. Plot the objects in n-dimensional space (where n is the number of attributes). In our case we have 2 attributes – x and y, so we plot the objects p1, p2, … p6 in 2-dimensional space: 1 5
2 3 4
6
Step 2. Calculate the distance from each object (point) to all other points, using Euclidean distance measure, and place the numbers in a distance matrix
d(p1, p2) yp2 |2 )1/2
p1 p2 p3 p4 p5 p6
0 0.24 0.22 0.37 0.34 0.23 p1
= 0 0.15 0.20 0.14 0.25 p2
(|xp1 – xp1 |2+ | yp1 -
0 0.15 0.28 0.11 p3
0 0.29 0.22 p4
0 0.39 p5
Distance Matrix
0 p6
Step 3 Identify the two clusters with the shortest distance in the matrix, and merge them together. Re-compute the distance matrix, as those two clusters are now in a single cluster, (no longer exist by themselves). By looking at the distance matrix above, we see that p3 and p6 have the smallest distance from all - 0.11 So, we merge those two in a single cluster, and re-compute the distance 1 matrix. 5
2 3 4
6
p1
0
p2
0.24
0
(p3, p6) 0.22
0.15
0
p4
0.37
0.20
0.15
0
p5
0.34
0.14
0.28
0.29
p1
p2
(p3, p6) p4
0 p5
dist( (p3, p6), p1 ) = MIN ( dist(p3, p1) , dist(p6, p1) ) = MIN ( 0.22 , 0.23 //from original matrix = 0.22
Step 4 Repeat Step 3 until all clusters are merged. a. So, looking at the last distance matrix above, we see that p2 and p5 have the smallest distance from all - 0.14 So, we merge those two in a single cluster, and re-compute the distance matrix. 1 5
2 3 4
6
p1 (p2, p5) (p3, p6) p4
0 0.24 0.22 0.37 p1
0 0.15 0 0.20 0.15 0 (p2, p5) (p3, p6) p4
dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2), dist(p3, p5), dist(p6, p5) ) = MIN ( 0.15 , 0.25, 0.28, 0.39 ) //from original matrix = 0.15
1 5
2 3 4
K=3(no of clusters)
6
STOPPING CONDITION IN THE ABOVE ALGORITHM We stop the procedure once we obtain k connected components K is the number of clusters the user would like to have. In the above example K=3
Applications Routing in mobile ad hoc networks. Identify patterns in gene expression. Document categorization for web search. Similarity searching in medical image databases Skycat: cluster 109 sky objects into stars, quasars, galaxies.
ANY QUESTIONS?
?