Solutions+to+tutorial+4+cluster+analysis

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Solutions+to+tutorial+4+cluster+analysis as PDF for free.

More details

  • Words: 2,736
  • Pages: 12
RESEARCH AND SURVEY STATISTICS – STA3022F SOLUTION TO TUTORIAL #4 Week 5 2007 CLUSTER ANALYSIS QUESTION 1: HOTEL PROFILE ANALYSIS study ITERATION 1: Merge C and D at a distance of 1.21 Revised distance matrix: A B CD E F G H I J K

A 0.00

B 3.97 0.00

CD 3.82 1.39 0.00

E 6.83 5.33 5.08 0.00

F 4.07 3.49 3.11 4.44 0.00

G 3.52 2.07 1.73 6.48 3.36 0.00

H 4.40 2.45 3.01 7.10 4.65 2.37 0.00

I 2.99 2.59 1.85 5.82 2.77 1.50 2.91 0.00

J 4.30 2.32 2.78 6.09 3.51 2.50 1.62 2.68 0.00

K 4.75 2.83 3.54 5.33 3.59 3.82 2.90 3.62 1.88 0.00

ITERATION 2: Merge CD and B at a distance of 1.39 Revised distance matrix: A 0

A CDB E F G H I J K

CDB 3.82 0

E 6.83 5.08 0

F 4.07 3.11 4.44 0

G 3.52 1.73 6.48 3.36 0

H 4.4 2.45 7.1 4.65 2.37 0

I 2.99 1.85 5.82 2.77 1.5 2.91 0

J 4.3 2.32 6.09 3.51 2.5 1.62 2.68 0

K 4.75 2.83 5.33 3.59 3.82 2.9 3.62 1.88 0

ITERATION 3: Merge G and I at a distance of 1.50 Revised distance matrix: A 0

A CDB E F GI H J K

CDB 3.82 0

E 6.83 5.08 0

F 4.07 3.11 4.44 0

GI 2.99 1.73 5.82 2.77 0

H 4.40 2.45 7.10 4.65 2.37 0

J 4.30 2.32 6.09 3.51 2.50 1.62 0

K 4.75 2.83 5.33 3.59 3.62 2.90 1.88 0

ITERATION 4: Merge H and J at a distance of 1.62 Revised distance matrix: A

A 0

CDB 3.82

E 6.83

F 4.07

GI 2.99

HJ 4.30

K 4.75

1

CDB E F GI HJ K

0

5.08 0

3.11 4.44 0

1.73 5.82 2.77 0

2.32 6.09 3.51 2.37 0

2.83 5.33 3.59 3.62 2.90 0

ITERATION 5: Merge CBD and GI at a distance of 1.73 Revised distance matrix: A 0

A CBDGI E F HJ K

CBDGI 2.99 0

E 6.83 5.08 0

F 4.07 2.77 4.44 0

HJ 4.30 2.32 6.09 3.51 0

K 4.75 2.83 5.33 3.59 2.90 0

ITERATION 6: Merge HJ and K at a distance of 1.88 Revised distance matrix: A CBDGI E F HKJ

A 0

CBDGI 2.99 0

E 6.83 5.08 0

F 4.07 2.77 4.44 0

HJK 4.30 2.32 5.33 3.51 0

ITERATION 7: Merge CBDGI and HJK at a distance of 2.32 Revised distance matrix: A CBDGIHJK E F

A 0

CBDGIJK 2.99 0

E 6.83 5.08 0

F 4.07 2.77 4.44 0

ITERATION 8: Merge CBDGIHJK and F at a distance of 2.77 Revised distance matrix: A CBDGIHJKF E

A 0

CBDGIJKF 2.99 0

E 6.83 4.44 0

2

ITERATION 9: Merge CBDGIHJKF and A at a distance of 2.99 Revised distance matrix: CBDGIHJKFA E

CBDGIJKFA 0

E 4.07 0

ITERATION 10: Merge CBDGIHJKFE and E at a distance of 4.07 Amalgamation schedule: Iteration 1 2 3 4 5 6 7 8 9 10

Merged objects C,D CD,B G,I H,J CDB,GI HJ,K CDBGI,HJK CDBGIHJK,F CDBGIHJKF,A CDBGIHJKFA,E

Distance 1.21 1.39 1.5 1.62 1.73 1.88 2.32 2.77 2.99 4.07

Dendrogram:

Brief comment: There appear to be 5 clusters of hotels that emerge (using a cut-off of around 2.00). Hotels B, C, D, G, and I form cluster 1, and hotels H, J, and K form cluster 2. Of the remaining hotels, hotels F and A are somewhat dissimilar to the rest, and form their own clusters. Hotel E is completely dissimilar to the rest, and also forms its own cluster. You need to examine attribute evaluations (cluster profiles) to establish reasons for similarities and differences.

3

Question 2: NEWSPAPERS SEGMENTATION ANALYSIS study Key Point: In this example is that you MUST standardise the data before doing the cluster analysis, otherwise the LEADERSHIP variable (which is of a much higher order of magnitude) will dominate the analysis unfairly. Unstandardised Data: READER 3000 2000 6000 2000 5000 3600 1816.59

A B C D E Mean Std. Dev.

ARTICLE 2 7 7 4 7 5.4 2.30

ADVERT 10 2 9 4 7 6.4 3.36

COMMUN 6 4 2 4 3 3.8 1.48

ARTICLE -1.4769 0.6950 0.6950 -0.6081 0.6950

ADVERT 1.0709 -1.3089 0.7735 -0.7140 0.1785

COMMUN 1.4832 0.1348 -1.2136 0.1348 -0.5394

Standardised Data: READER -0.3303 -0.8808 1.3212 -0.8808 0.7707

A B C D E

ITERATION #0: Initial Distance Matrix A 0.0000 3.5358 3.8478 2.4621 3.2888

A B C D E

B 3.5358 0.0000 3.3171 1.4325 2.3225

C 3.8478 3.3171 0.0000 3.2523 1.0543

D 2.4621 1.4325 3.2523 0.0000 2.3825

E 3.2888 2.3225 1.0543 2.3825 0.0000

e.g.) Distance between A and B:

D =



(-0.3303+0.8808)2 + (-1.4769-0.6950)2 + (1.0709+1.3089)2 + (1.4832-0.1348)2

= 3.5358 ITERATION #1 STEP 1: Merge C and E at distance 1.05. STEP 2: Revise distance matrix. 2.1) Computer average attribute profile for two merged objects. 4

C E Average

READER 1.3212 0.7707 1.0459

ARTICLE 0.6950 0.6950 0.6950

ADVERT 0.7735 0.1785 0.4760

COMMUN -1.2136 -0.5394 -0.8765

ARTICLE -1.4769 0.6950 0.6950 -0.6081

ADVERT 1.0709 -1.3089 0.4760 -0.7140

COMMUN 1.4832 0.1348 -0.8765 0.1348

CE 3.5402 2.8144 0.0000 2.8016

D 2.4621 1.4325 2.8016 0.0000

2.2) Revise the data matrix. A B CE D

READER -0.3303 -0.8808 1.0459 -0.8808

2.3) Recompute the distance matrix. A B CE D

A 0.0000 3.5358 3.5402 2.4621

B 3.5358 0.0000 2.8144 1.4325

ITERATION #2 STEP 1: Merge B and D at distance 1.4325. STEP 2: Revise distance matrix. 2.1) Computer average attribute profile for two merged objects. B D Average

READER -0.8808 -0.8808 -0.8808

ARTICLE 0.6950 -0.6081 0.0434

ADVERT -1.3089 -0.7140 -1.0114

COMMUN 0.1348 0.1348 0.1348

ARTICLE -1.4769 0.0434 0.6950

ADVERT 1.0709 -1.0114 0.4760

COMMUN 1.4832 0.1348 -0.8765

2.2) Revise the data matrix. A BD CE

READER -0.3303 -0.8808 1.0459

2.3) Recompute the distance matrix. A BD CE

A 0.0000 2.9612 3.5402

BD 2.9612 0.0000 2.7151

CE 3.5402 2.7151 0.0000

5

ITERATION #3 STEP 1: Merge BD and CE at distance 2.7151. STEP 2: Revise distance matrix. 2.1) Computer average attribute profile for two merged objects. BD CE Average

READER -0.8808 1.0459 0.0826

ARTICLE 0.0434 0.6950 0.3692

ADVERT -1.0114 0.4760 -0.2677

COMMUN 0.1348 -0.8765 -0.3708

ARTICLE -1.4769 0.3692

ADVERT 1.0709 -0.2677

COMMUN 1.4832 -0.3708

2.2) Revise the data matrix. A BDCE

READER -0.3303 0.0826

2.3) Recompute the distance matrix. A BDCE

A 0.0000 2.9678

BDCE 2.9678 0.0000

ITERATION #4 At the final step, merge ! and BDCE at a distance of 2.9678. AMALGAMATION SCHEDULE: ITERATION 1 2 3 4

DISTANCE 1.0543 1.4325 2.7151 2.9678

MERGED OBJECTS C, E B, D BD, CE A, BDCE

6

DENDOGRAM:

There appear to be 3 distinct clusters (A, BD and CE), suggesting the cut-off point should be around 2. To profile clusters, average the attribute values within each cluster. In this case it is possible to use either the RAW or STANDARDISED data. Using the standardised data is suggested, as you’ve already worked out the relevant numbers. Profiles: CLUSTER A BD CE

READER -0.3303 -0.8808 1.0459

ARTICLE -1.4769 0.0434 0.6950

ADVERT 1.0709 -1.0114 0.4760

COMMUN 1.4832 0.1348 -0.8765

Brief suggested Interpretation: Cluster BD is characterised by a small readership and very few adverts. It offers a moderate degree of quality in its articles and community news. Cluster CE is characterised by a large readership, high standard of articles, but poor community news. It has an average amount of advertising. Cluster A has moderate-to-low readership, many adverts, poor articles, but excellent community news.

7

QUESTION 3: AGRICULTURAL CONTROL BOARDS study 1. Completed Euclidean Distance Matrix Euclidean distances (Cluster Analysis Q4) Case No. Meat Maize Dec Citrus Sugar Wheat Meat 0.00 2.13 1.53 2.27 0.65 1.99 Maize 2.13 0.00 1.74 2.46 1.62 0.38 Dec 1.53 1.74 0.00 0.88 1.09 1.62 Citrus 2.27 2.46 0.88 0.00 1.94 2.32 Sugar 0.65 1.62 1.09 1.94 0.00 1.54 Wheat 1.99 0.38 1.62 2.32 1.54 0.00

2. Dendrogram

linkage distance .3820994 .6478426 .8800568 1.094212 1.542757

Amalgamation Schedule (Cluster Analysis Q4) Single Linkage Euclidean distances Obj. No. Obj. No. Obj. No. Obj. No. Obj. No. Obj. No. 1 2 3 4 5 6 Maize Wheat Meat Sugar Dec Citrus Meat Sugar Dec Citrus Meat Sugar Dec Citrus Maize Wheat T r e e D ia g r a m f o r 6 C a s e s S in g le L in k a g e E u c lid e a n d is t a n c e s

M eat Sugar Dec C it r u s M a iz e W heat

0 .2

0 .4

0 .6

0 .8

1 .0

1 .2

1 .4

1 .6

L in k a g e D is t a n c e

3.

Discussion of findings. If cut – off set at 1.0, then three clusters emerge

Cluster 1 Control Board Meat Sugar Cluster mean

Budget 1.52 1.45 1.485

Net export 1.38 1.9 1.64

Membership 0.68 1.06 0.87

Description: Control boards have a low budget, modest exports and very few members 8

Cluster 2 Control Board Deciduous fruit Citrus fruit Cluster mean

Budget 1.79 1.95 1.87

Net export 1.89 1.56 1.725

Membership 2.1 2.9 2.5

Description: Control boards have a very limited budget, modest exports, but a large membership base Cluster 3 Control Board Maize Wheat Cluster mean

Budget 2.22 2.48 2.35

Net export 3.32 3.04 3.18

Membership 1.2 1.2 1.2

Description: Control boards have large budgets, undertake large scale exports, but with few members. If the cut – off was set at 1.2, then clusters 1 and 2 will merge. Profile wise, they are both “small” in terms of budget size and level of exports, but differ on membership size – the “Diciduous/Citrus” boards have larger number of members than the “Meat/Sugar” Boards 4.

Standardisation issues

Standardisation removes the influence of scale on the calculation of the distance matrix. This is important because a failure to do so can result in variables that are measured with a small natural unit (e.g. grams) dominating the distance calculations at the expense of those variables measured with a large natural unit (e.g. tons). In this case, all variables have already been informally standardised so that the number in the table are of similar magnitudes (around 1). No statistical standardisation (subtract mean, divide by standard deviation) is required. If we want to find out what the standardised profiles would be for each cluster, we need to know the mean and standard deviation of each variable. These are (you need to know how to calculate these – see Stats 1!) Control Board Mean Std dev

Budget 1.90 0.40

Net export 2.18 0.80

Membership 1.52 0.82

Standardised scores can then be calculated in the usual way: Std score = (Unstd score – Mean)/StdDev

Variable Budget Net export Membership

CLUSTER 1 Unstdised Stdised Cluster Cluster mean mean 1.49 -1.04 1.64 -0.67 0.87 -0.80

CLUSTER 2 Unstdised Stdised Cluster Cluster mean mean 1.87 -0.08 1.73 -0.57 2.50 1.19

CLUSTER 3 Unstdised Cluster Stdised mean Cluster mean 2.35 1.12 3.18 1.24 1.20 -0.39

9

QUESTION 4: MAGAZINE CONTENT AND PREFERENCE study (1)

H0: µCluster1 (Décor) = µCluster2 (Décor) H1: µCluster1 (Décor) ≠ µCluster2 (Décor) MSTreatment MSError F = Fcrit =

= Between SS / df = 8.06 / 1 = Within SS / df = 50.93 / 58 = 0.87 MST / MSE = 8.06 / 0.87 = 9.26 F(1,58) (0.05) ≈ 4.00

= 8.06

Fstat > Fcrit, so reject H0 and conclude that the DECOR attribute average value is not the same between clusters. DECOR is therefore able to significantly discriminate between the two clusters. (2)

H0: µCluster1 (Garden) = µCluster2 (Garden) = µCluster3 (Garden) H1: At least one mean differs from the rest MST MSE F Fcrit

= = = =

12.54 / 2 46.45 / 57 6.27 / 0.81 F(2,57) (0.05)

= = = ≈

6.27 0.81 7.74 3.15

Fstat > Fcrit, so reject H0 and conclude that the GARDEN attribute average value is not the same across the 3 clusters. GARDEN is therefore able to significantly discriminate between the three clusters. (3)

For Buy attribute, MST = 33.16 / 2 MSE = 25.83 / 57 F = 16.58 / 0.45 Fcrit = F(2,57) (0.05)

= = = ≈

16.58 0.45 36.85 3.15

Fstat> Fcrit, so BUY is clearly significant. Significant predictor variables are (BUY, DÉCOR, GARDEN and PAY) (iv)

Cluster profiles for Cluster 1 on significant variables only: From TABLE 10: BUY DECOR GARDEN PAY

CLUSTER 1 1.07 0.23 -0.37 0.78

Respondents in cluster 1 buy considerably more lifestyle magazines than those in other clusters, and are also willing to pay more for a magazine than those in other clusters. They have a slightly above average interest in décor and a slightly below average interest in gardening features, but these are secondary effect. Essentially, this is a cluster which contains strong fans of lifestyle magazines.

10

(v)

Noting from Figure 4.5 and 4.6 that each cluster has 30 observations in it, one can get completed table Cluster 1 Cluster 2 Totals

NEED = Yes 19 16 35

CLUSTER 1: 19 “YES”es out of 30 = CLUSTER 2: 16 “YES”es out of 30 = H0: H1:

NEED = No 11 14 25

Row Totals 30 30 60

63% 53%

There is no significant association between clusters and the perceived need for a new lifestyle magazine. There is a significant association between clusters and the perceived need for a new lifestyle magazine.

Use the Pearson Chi-squared statistic given in Figure 4.9 of 0.617, and compare this to a critical Chi-squared statistic (at the 5% level, with 1 degree of freedom) of 3.84. Since the test statistic is less than the critical value, one cannot reject the null hypothesis of no association at the 5% level. Cluster membership appears to have no significant association with perceived need for a new magazine. (vi)

Up to you to provide your own interpretation. Both 2-cluster and 3-cluster models are able to pick up the clusters of disinterested readers (Cluster 2 in both solutions). The 3-cluster model has the seeming advantage of distinguishing between those readers who want more décor and more gardening articles (cluster 3), and those who are more or less happy with the current format (cluster 1). Fortunately for the magazine publishers, those that are more or less happy with the current format are those that tend to buy more magazines and pay a premium. Those that desire a different format (one with more décor and gardening features) tend not to buy a large number of magazines, and those that they do buy, they buy at an average price. On the basis of the fact that a 3-cluster solution picks up these important groupings, it would seem to be preferable to a 2-cluster solution.

(vii)

Question doesn’t specify whether you should use the 2-cluster or 3-cluster solution, so we’ll look at both. For the 2-cluster solution Euclidean distance of respondent 10 to Cluster 1 = 1.55 Euclidean distance of respondent 10 to Cluster 2 = 3.51 So, respondent 10 is closest to cluster 1 and should be included in that group Euclidean distance of respondent 10 to Cluster 1 = 1.73 Euclidean distance of respondent 10 to Cluster 2 = 3.80 Euclidean distance of respondent 10 to Cluster 3 = 2.26 So, respondent 10 is closest to cluster 1 and should be included in that group

(viii) Again, the question hasn’t specified whether you should use the 2-cluster or 3-cluster solution. This solution is just for the 2-cluster solution (the 3-cluster solution is left to you)

11

Euclidean distance of respondent 10 to Cluster 1 = 1.10 Euclidean distance of respondent 10 to Cluster 2 = 1.28 So, the new respondent is closer to cluster 1 than cluster 2 and should be included in cluster 1. To compute how this changes the centroid, just note that • The previous cluster 1 centroid (without the new respondent) was given by (0.55, 0.36, 0.20, 0.81) and had 30 members. • An average is calculated as sum of all observations divided by number of observations. • The old sum for each attribute was therefore 30*(0.55, 0.36, 0.20, 0.81) = (16.58, 11.00, 6.12, 24.31) • The new sum is (16.58+0.2, 11.0-0.4, 6.12+0.3, 24.31+0.1) = (16.77, 10.60, 6.43, 24.41) • Finally the new average/centroid is (1/31)* (16.77, 10.60, 6.43, 24.41) = (0.54, 0.31, 0.21, 0.78) • Note how the new cluster 1 centroid is very close to the old one. This is expected… one new observation shouldn’t change a centroid very much.

12