Similarity And Dissimilarity

  • Uploaded by: Allison Collier
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Similarity And Dissimilarity as PDF for free.

More details

  • Words: 1,970
  • Pages: 34
Similarity and Dissimilarity

Lect 09/10-08-09

1

• Similarity : numerical measure between two objects to the degree they are alike. • The value of similarity measure will be high if two objects are very similar. • If scale is [0,1], then 0 indicates no similarity and I indicates complete similarity.

Lect 09/10-08-09

2

• Dissimilarity: – numerical measure between two objects to the degree to which they are different. – Diss. Is lower for similar pair of objects.

• Transformations: – Suppose simi. between 2 obj. range from 1 ( not at all similar) to 10 (comp. simi.) – They can be transf. with in the range [0,1] as follows: • S’=(S-1)/9, where S and S’ are original and transformed similarity values.

Lect 09/10-08-09

3

• It can also be generalized as: – S’=(s - min_s)/(max_s – min_s) – Lly, diss. Can be mapped on to the interval [0,1] – D’=(d-min_d)/(max_d-min_d).

Lect 09/10-08-09

4

Similarity & Diss. Betn simple attributes • Consider objects having single attribute – One nominal attr. • In context of such an attr., simi. means whether the two objects have same value or not.

S=1 if att. values match S=0 if att. Values doesn’t match Diss. can be defined similarly in opposite way. Lect 09/10-08-09

5

• One ordinal attr. – Order is important and should be taken into account. – Suppose an attr. measures the quality of a product as follows: {poor, fair, OK, good, wonderful} • Product P1= wonderful, Product P2= good, P3=OK and so on • In order to make this observation quantitative say {poor=0, fair=1, OK=2, good = 3, wonderful=4} • Then d(P1,P2)=4-3=1 • Or d(P1,P2) = (4-3)/4=0.25 • A simi. For ordinal attr. can be defined as s=1-d

Lect 09/10-08-09

6

• For interval or ratio attr., the measure of dissi. bet. 2 objects is absolute difference between ordinal attributes.

Lect 09/10-08-09

7

Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

Lect 09/10-08-09

8

Dissimilarity between data objects • Distances: • The Euclidean distance bet. Two data objects or points n

d ( x, y ) = ∑ ( x k − y k ) 2 k =1

where n is the no. of dimensions and xk & yk are the kth attr. of x & y.

Lect 09/10-08-09

9

Euclidean Distance 3

point p1 p2 p3 p4

p1

2

p3

p4

1 p2

0 0

1

2

3

4

5

6

p1 p1 p2 p3 p4

0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

x 0 2 3 5

y 2 0 1 1

p4 5.099 3.162 2 0

Distance Matrix Lect 09/10-08-09

10

Minkowski Distance

n

dist = ( ∑ | pk − qk k =1

1 r r |)

Lect 09/10-08-09

11

Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance. – A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

• r = 2. Euclidean distance • r → ∞. “supremum” (Lmax norm, L∞ norm) distance. – This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Lect 09/10-08-09

12

Minkowski Distance

point p1 p2 p3 p4

x 0 2 3 5

y 2 0 1 1

L1 p1 p2 p3 p4 L2 p1 p2 p3 p4 L∞ p1 p2 p3 p4

p1 0 4 4 6 p1 0 2.828 3.162 5.099 p1 0 2 3 5

p2 4 0 2 4 p2 2.828 0 1.414 3.162 p2 2 0 1 3

p3 4 2 0 2 p3 3.162 1.414 0 2 p3 3 1 0 2

p4 6 4 2 0 p4 5.099 3.162 2 0 p4 5 3 2 0

Distance Matrix Lect 09/10-08-09

13

Common Properties of a Distance •

Distances, such as the Euclidean distance, have some well known properties. 1. d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) 2. d(p, q) = d(q, p) for all p and q. (Symmetry) 3. d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.



A distance that satisfies these properties is a Metric

Lect 09/10-08-09

14

Example of Non-Metric diss. • Two sets A={1,2,3,4}, B={2,3,4} • A-B={1}, B-A={ } • Define the distance d bet. A and B as

d(A,B)=size (A-B), size is function returning the no. of elts. in a set. • This dist. measure doesn’t satisfy Positivity property, symmetry property and triangular inequality. • These properties can hold if diss. Meas. Is modified as follows: D(A,B)=size (A-B) + size (B-A). Lect 09/10-08-09

15

Common Properties of a Similarity • Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q. Lect 09/10-08-09

16

Similarity Between Binary Vectors • • •

Common situation is that objects, p and q, have only n binary attributes. p and q becomes binary vectors & leads to following 4 quantities. Compute similarities using the following quantities M01 M10 M00 M11



= the number of attributes where p was 0 and q was 1 = the number of attributes where p was 1 and q was 0 = the number of attributes where p was 0 and q was 0 = the number of attributes where p was 1 and q was 1

Simple Matching Coefficients ( similarity coeff.) SMC = number of matches / number of attributes = (M11 + M00 ) / (M01 + M10 + M11 + M00 )

Lect 09/10-08-09

17

SMC versus Jaccard: Example J = number of 11 matches / number of not-both-zero attributes values = (M11 ) / (M01 + M10 + M11 )

EX. p = (1,0, 0, 0, 0, 0, 0, 0, 0, 0) q = (0, 0, 0, 0, 0, 0, 1, 0, 0,1) M01 M10 M00 M11

= = = =

2 1 7 0

(the (the (the (the

number number number number

of of of of

attributes attributes attributes attributes

where where where where

p p p p

was was was was

0 1 0 1

and and and and

q q q q

was was was was

1) 0) 0) 1)

SMC = (M11 + M00 )/(M01 + M10 + M11 + M00 ) = (0+7) / (2+1+0+7) = 0.7 J = (M11 ) / (M01 + M10 + M11 ) = 0 / (2 + 1 + 0) = 0

Lect 09/10-08-09

18

Cosine Similarity • This similarity measure is used for documents. • Doc. are often represented as vectors • Each attribute represents the frequency of each word. • Each document have thousands & tens of thousands of attributes. • The Cosine simi. is one of the most common measure of document similarity. Lect 09/10-08-09

19

Cosine Similarity • If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| , where • indicates vector dot product and || d || is the length of vector d.

• Example: consider 2 document vectors as: d1 = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0) d2 = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2) d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =(42)0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = 0.3150

Lect 09/10-08-09

20

• If cosine simi.= 1, then angle bet x and y is 0 and x & y are same except for magnitude and vice-versa. y

θ

x

Lect 09/10-08-09

21

Extended Jaccard Coefficient (Tanimoto) • Can be used for document data • Reduces to Jaccard for binary attributes

Lect 09/10-08-09

22

Correlation • Correlation measures the linear relationship between objects • Example (Perfect Correlation): the following 2 sets of values for x and y indicate cases where corr. Is -1 and +1. • x=(-3,6,0,3,-6) and y=(1,-2,0,-1,2) [COMPUTE] • x=(3,6,0,3,6) and y=(1,2,0,1,2)[COMPUTE]

Lect 09/10-08-09

23

• To compute correlation, we standardize data objects, p and q, and then take their dot product

pk′ = ( pk − mean( p)) / std ( p)

qk′ = ( qk − mean( q)) / std ( q)

correlation( p, q) = p′ • q′ Lect 09/10-08-09

24

Visually Evaluating Correlation •Scatter plots showing the similarity from –1 to 1.

•Itiseasytovisualizethe correlationbetween2data objectsxandybyplottingpairs



of corre pofig ndu inre ga usteano. of Thsis sttrib how values. suchplotswherexandy has30attributes. Eachcircle intheplotrepres. oneof the30attr. Itsx-coord. isthe valueofoneoftheatt. ofx Lect 09/10-08-09

whileitsycoord. isthevalue

25

Issues in computing proximity measures • 1. Handling the cases where attr. have different scales/ & or correlated. • 2. Computing proximity between objects when they have different types of attr. (qualitative/quantitative). • 3. Handling attr. having different weights i.e. when all attr. don’t contribute equally to the proximity of objects.

Lect 09/10-08-09

26

Mahalanobis Distance mahalanobi s( p, q) = ( p − q) ∑−1 ( p − q)T Σ is the covariance matrix of the input data X Mahalanobis distance is useful when attr. Are correlated and have different ranges of values and distribution is approx. Gaussian.

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. Lect 09/10-08-09

27

Mahalanobis Distance Covariance Matrix:

C

 0.3 0.2 Σ=  0 . 2 0 . 3   A: (0.5, 0.5)

B

B: (0, 1) A

C: (1.5, 1.5)

Mahal(A,B) = 5 Mahal(A,C) = 4

Lect 09/10-08-09

28

Combining Similarities for heterogeneous attributes •

Sometimes attributes are of many different types, but an overall similarity is needed.

Lect 09/10-08-09

29

Using Weights to Combine Similarities • When some attr. are more important for calculating proximity than others • May not want to treat all attributes the same. – Use weights wk which are between 0 and 1 and sum to 1, then previous eqn. becomes

The Minkowski distance becomes: Lect 09/10-08-09

30

Density • Density-based clustering require a notion of density • Examples: – Euclidean density • Euclidean density = number of points per unit volume

– Probability density – Graph-based density

Lect 09/10-08-09

31

Euclidean Density – Cell-based • Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains

Lect 09/10-08-09

32

Euclidean Density – Centerbased • Euclidean density is the number of points within a specified radius of the point

Lect 09/10-08-09

33

Review Questions (Unit 2) • • • • • • • • • • •

1. What are the issues related to data quality? Discuss. 2. What is the difference between noise and an outlier? 3. Data quality issues can be considered from an application point of view. Discuss. 4. Write a short note on Data Preprocessing. 5. Discuss the difference between similarity and dissimilarity. 6. State 3 axioms that makes a distance a metric. 7. Discuss similarity measures for Binary variables. 8. What is SMC? 9. How cosine similarity is used for measuring similarity between 2 document vectors. 10. Discuss issues related to computation of proximity measures. 11.For the following vectors x, y calculate the indicated similarity or distance measure: (a) X=(1,1,1,1) y=(2,2,2,2) cosine, correlation, Euclidean. (b) x=(0,1,0,1), y=(1,0,1,0) cosine, correlation, Euclidean, Jaccard, Hamming Distance( for binary data L1 distance corresponds to Hamming distance).

Lect 09/10-08-09

34

Related Documents

Similarity
November 2019 13
Similarity
August 2019 15
Names Similarity
June 2020 11

More Documents from ""