Cluster Analysis Analysis and Output Interpretation using Hierarchical Cluster Technique & SPSS 6.00
0011 0010 1010 1101 0001 0100 1011
Dr. Rohit Vishal Kumar
2
1
4
Reader, Department of Marketing Xavier Institute of Social Service
PO Box No 7, Purulia Road Ranchi - 834001 Email:
[email protected] All trademarks & Copyrights Acknowledged. Presentation Copyright Rohit Vishal Kumar 2002
Cluster Analysis - Introduction • Cluster Analysis is a multivariate analysis technique that seeks to organize information about variables so that 0011 0010 1010 1101 0001 0100 1011 relatively homogeneous groups, or "clusters," can be formed. The clusters formed with this family of methods should be highly internally homogenous (members are similar to one another) and highly externally heterogeneous (members are not like members of other clusters.
2
1
4
• Although cluster analysis is relatively simple, and can use a variety of input data, it is a relatively new technique and is not supported by a comprehensive body of statistical literature. So, most of the guidelines for using cluster analysis are rules of thumb and some authors caution that researchers should use cluster analysis
Cluster Analysis - Key Features • Cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into 0011 0010 1010 1101 0001 0100 1011 clusters." • Cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here
2
1
4
Cluster Analysis - Applications •
Medicine: clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful classification and better diagnosis.
0011 0010 1010 1101 0001 0100 1011
•
Psychiatry: the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy.
•
Archeology: researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques.
•
Marketing: researchers have attempted to use cluster analysis to identify the closeness or difference (real or perceived) between brands image, identify relatively homogenous marketing segments, identify similarities in ideas of communications etc.
2
1
4
In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.
Four Common Distance Measures •
Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional It is computed 0011space. 0010 1010 1101 0001as: 0100 1011 distance(x,y) = { (xi - yi)2 }½
2
•
Note: Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data.
•
Advantage: the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers.
•
Disadvantage: The distances can be greatly affected by differences in scale among the dimensions from which the distances are computed.
1
4
For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected, and consequently, the results of cluster analyses may be very different.
Four Common Distance Measures •
Squared Euclidean distance. One may want to square the standard Euclidean distance in order to place progressively greater weight on objects are1010 further apart. This distance 0011that 0010 1101 0001 0100 1011 is computed as : distance(x,y) = i (xi - yi)2 •
distance(x,y) = i |xi - yi| •
2
City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as:
1
4
Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as: distance(x,y) = Maximum|xi - yi|
Cluster Analysis 0011 0010 1010 1101 0001 0100 1011
2
1
4
The Example and SPSS Procedure
The Raw Data R es pondent 1 2 3 4 5 0011 0010 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1010
V1 6 2 7 4 1 0001 1101 6 5 7 2 3 1 5 2 4 6 3 4 3 4 2
V2 4 3 2 6 3 0100 4 3 3 4 5 3 4 2 6 5 5 4 7 6 3
1011
V3 7 1 6 4 2 6 6 7 3 3 2 5 1 4 4 4 7 2 3 2
V4 3 4 4 5 2 3 3 4 3 6 3 4 5 6 2 6 2 6 7 4
V5 2 5 1 3 6 3 3 1 6 4 5 2 4 4 1 4 2 4 2 7
V6 3 4 3 6 4 4 4 4 3 6 3 4 4 7 4 7 5 3 7 2
2
1
4
The above data was collected from 20 respondents. The respondents were asked to rate the following statement on a 7 point scale V1 : Shopping is Fun V2 : Shopping is bad for your budget V3 : I combine shopping with eating out V4 : I try to get the best buys while shopping V5 : I don’t care about shopping V6 : You can save money by comparing prices
Completely Disagree 1
SCALE USED Neither Agree Nor Disagree 4
Completely Agree 7
SPSS Screen 1 The data entry screen in SPSS
0011 0010 1010 1101 0001 0100 1011
2
1
4
SPSS Screen 2 : Hierarchical Cluster Choose Statistics -> Data Reduction -> Hierarchical Cluster We are shown the Hierarchical Cluster Screen as follows:
0011 0010 1010 1101 0001 0100 1011
1. Select All six variables (V1-V6) and transfer them to the variable(s) box
2
1
4
2. Select Cluster “Cases”
3. Select Display “Statistics and “Plots” 4. Press on the Statistics Button
SPSS Screen 3 : Hierarchical Cluster On Pressing the “Statistics” Button we are shown the following screen
0011 0010 1010 1101 0001 0100 1011
1. “Agglomeration Schedule” and “Cluster Membership -> None” should be checked by default. If not select these options
2
1
4
2. Press “Continue”
3. Select “Plots” from the “Screen 2”
SPSS Screen 4 : Hierarchical Cluster On Pressing the “Plots” Button we are shown the following screen
0011 0010 1010 1101 0001 0100 1011
1. Select “Dendogram” 2. Select “All Icicles”
2
1
4
3. Select Orientation “Vertical”
4. Select “Methods” from the “Screen 2”
SPSS Screen 5 : Hierarchical Cluster On Pressing the “Methods” Button we are shown the following screen
0011 0010 1010 1101 0001 0100 1011
1. Choose in Cluster Method: “Between Group Linkage”
2
2. Select in Measure “Interval” and select “Squared Euclidean Distances”
1
4
3. Select in “Transform Values” “none” in the standardize dropdown list 4. Select Continue
5. In Screen 2 select “OK”
Cluster Analysis 0011 0010 1010 1101 0001 0100 1011
The SPSS Output
2
1
4
SPSS Output 1 : Hierarchical Cluster The following output “Proximities” is displayed by SPSS
0011
Data Information 20 unweighted cases accepted. 0 cases rejected 0010 1010 1101 0001 0100 1011because of missing value. Squared Euclidean measure used. * * * * * * * * * * * * * * P R O X I M I T I E S * * * * * * * * * * * * * * Agglomeration Schedule using Average Linkage (Between Groups) Stage
Clusters Cluster 1
Combined Cluster 2
Coefficient
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
14 6 10 2 5 3 6 4 5 1 4 5 1 2 1 1 4 2 1
16 7 14 13 11 8 12 10 9 6 19 20 17 5 3 15 18 4 2
2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.333333 4.500000 5.000000 7.250000 7.333333 8.250000 10.750000 11.300000 14.000000 20.200001 38.611111 48.291668
Stage Cluster 1st Appears Cluster 1 Cluster 2 0 0 0 0 0 0 2 0 5 0 8 9 10 4 13 15 11 14 16
2
Next Stage
1 0 0 1 0 0 0 0 3 0 7 0 0 0 12 6 0 0 17 18
3 7 8 14 9 15 10 11 12 13 17 14 15 18 16 19 18 19 0
4
SPSS Output 1 : Hierarchical Cluster The Analysis : Proximities
•The "average linkage (between group)" clustering was used. 0011 0010 1010 1101 0001 0100 1011
•There were a total of 20 data points. In the first stage two data point (14 and 16) were combined. This information is provided under cluster combined cluster 1 and cluster 2 column.
2
1
•The squared Euclidean distance between the data point 14 and 16 is provided and is equal to 2.00. This is shown in column “Coefficients”
4
•The column entitled "Stage Cluster First Appeared" indicates the stage of combining the data in which the cluster first appears. The entry of 0 and 0 implies that right now no new clusters have been demarcated. The first cluster demarcation appears at stage 3 when data point 10 and 14 are combined to form a cluster. •The “next stage” columns gives the step in which the next data point was combined. The entry is 3. If we look at stage 3 then we find that data point 10 and 14 were combined to form the next cluster.
SPSS Output 2 : Hierarchical Cluster The following output “Icicle Plot” is displayed by SPSS Vertical Icicle Plot using Average Linkage (Between Groups) (Down) Number of Clusters
(Across) Case Label and number
C a s e
C a s e
C a s e
C a s e
1
1
1
5
0011 0010 1010 1101 0001 0100 1011 C a s e 1 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
C a s e
C a s e
1 9
C a s e
1 6
4 4
0
C a s e
C a s e
2
9 0
1
C a s e
C a s e
1
2 3
C a s e
C a s e
1
8 5
C a s e
C a s e
C a s e
3
1
1
C a s e 7 7
2
C a s e
C a s e
6
1
2
1
1 1 1 1 1 2 1 1 1 1 1 8 9 6 4 0 4 0 9 1 5 3 2 5 8 3 7 2 7 6 1 +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX +XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX X XXXXXXXXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX X XXXX XXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX XXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX X XXXXXXXXXX +X XXXXXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXXXXX +X X XXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXXXXX +X X XXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXX X +X X XXXXXXXXXX X X XXXX XXXX X XXXX X XXXXXXX X +X X XXXXXXX X X X XXXX XXXX X XXXX X XXXXXXX X +X X XXXXXXX X X X XXXX XXXX X XXXX X X XXXX X +X X XXXXXXX X X X XXXX XXXX X X X X X XXXX X +X X XXXXXXX X X X X X XXXX X X X X X XXXX X +X X XXXXXXX X X X X X X X X X X X X XXXX X +X X XXXX X X X X X X X X X X X X X XXXX X +X X XXXX X X X X X X X X X X X X X X X X
4
SPSS Output 2 : Hierarchical Cluster The Analysis : Icicle Plot
•The icicle plot shows the cluster combination. It is read from bottom to top. 0011 0010 1010 1101 0001 0100 1011
•Initially it was assumed that there are 20 initial cluster. Then in row labeled 19 a combination was made and 19 clusters were formed.
2
1
•The icicle plot in pictorial form represents the whole process of cluster formation. For example, if we take row labelled 7 we shall see that there are 7 clusters denoted by a series of X's: X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX XXXXXXXXXXXXX
4
•Each subsequent step leads to a formation of new cluster in one of the following three (3) ways: –Two individual cases are grouped together –A case is joined to an already existing cluster –Two clusters are grouped together
SPSS Output 3 : Hierarchical Cluster The following output “Dendogram” is displayed by SPSS Dendrogram using Average Linkage (Between Groups) Rescaled 0011 0010 1010 1101 0001 0100Distance 1011 C A S E Label Num Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case
14 16 10 4 19 18 2 13 5 11 9 20 3 8 6 7 12 1 17 15
14 16 10 4 19 18 2 13 5 11 9 20 3 8 6 7 12 1 17 15
Cluster Combine
0 5 10 15 20 25 +---------+---------+---------+---------+---------+ -+ -+-+ -+ +-+ ---+ +-------------+ -----+ +-------------------+ -------------------+ | -+-------+ +---------+ -+ | | | -+-+ +-----------------------------+ | -+ +-+ | | ---+ +---+ | -----+ | -+---------+ | -+ | | -+-+ +-+ | -+ | | | | ---+---+ | +-----------------------------------+ ---+ +---+ | -------+ | -------------+
2
1
4
SPSS Output 3 : Hierarchical Cluster The Analysis : Dendogram
•The Dendogram is a graphical output which is useful in identifying the 0011 0010 1010 1101from 0001left 0100 1011 clusters. It is read to right. •Vertical lines represent the clusters that are joined together. The position of the vertical line on the scale indicates the distance at which the clusters were joined. Because many of the distances in the early stages are of similar magnitude, it is difficult to tell the sequence in which some of the early clusters were formed. However, it is clear that in the last two stages, the distances at which the clusters are combined are large. This information is useful in deciding the number of clusters to retain.
2
1
4
Cluster Analysis 0011 0010 1010 1101 0001 0100 1011
2
1
4
Exercises and Final Notes
Practice Example •
The following data was collected for US baseball champions:
– Height : Height in Inches – Weight : Weight in Pounds – FGPct : Field Goal Percentage – Points: Average Points per game 0011 –0010 1010Average 1101 0001 0100 1011 Rebounds: rebounds per game
Champion Height Jabbar K.A. 86 Barry R 79 Baylor E 77 Bird L 81 Chamberlain W 85 Cousy B 73 Erving J 79 Johnson M 81 Jordan M 78 Robertson O 77 Russell B 82 West J 75 •
Weight 230 205 225 220 275 175 200 215 195 210 220 180
FGPct 55.9 44.9 43.1 50.3 54.0 37.5 50.6 53.0 51.3 48.5 44.0 47.4
Points 24.6 23.2 27.4 25.0 30.1 18.4 24.2 19.5 32.6 25.7 15.1 27.0
Conduct a Hierarchieal Cluster Analysis using a) Height, Weight, FGPct, Points and Rebound
Rebound 11.2 06.7 13.5 10.2 22.9 05.2 08.5 07.4 06.2 07.5 22.6 05.8
2
1
4
b) Height, FGPct, Points and Rebound c) FGPct, Points and Rebound Analyse the Dendograms to identify how the clusters have changed between (a) and (b) and (c)
Warning • We have only shown the output of a hierarchical Cluster Analysis 0011 0010 1010 1101 0001 0100 1011 • Similar Interpretations may or may not be applicable to nonhierarchical Cluster Analysis • The analysis software used was SPSS® 6.0. The output may vary with the type of analysis tool selected • Cluster Analysis should be run more than once using different distance measures and results compared before a final interpretation is attempted.
2
1
4
Thank You 0011 0010 1010 1101 0001 0100 1011
2
1
4
Feel Free to revert with your comments and suggestions