SAS Workshop - Multivariate Procedures Handout # 4
Statistical Programs College of Agriculture
PROC CLUSTER The objective in cluster analysis is to group “like” observations together when the underlying structure is unknown. This is carried out through a variety of methods, all of which use some measure of distance between data points as a basis for creating groups. Typically this distance is the standard Euclidian distance, i.e. a straight line in two dimensions, but the exact definition of distance is determined by the user. Essentially, data points with the smallest distances between them are grouped together. Then the data with the next smallest distances are added to each group, etc. until all observations end up together in one large group. The cluster is interpreted by observing the grouping history or pattern produced as the procedure was carried out. If the analysis works, distinct groups or clusters will stand out. These may have some practical meaning in terms of the research problem. The general SAS code for performing a cluster analysis is: PROC CLUSTER ; VAR var1 var2 var3 ... var n;
Here the options control the printing, computational, and output of the procedures. Some examples are: NOPRINT - suppresses any printed output, NOEIGEN - suppresses printing of eigenvalues, SIMPLE - produces simple summary statistics for each variable, METHOD = - controls the clustering method used (required option), STANDARD - Uses the correlation matrix for computation, and OUTTREE = - create an output dataset for cluster diagrams. The VAR statement, as before, lists the variables to be considered as responses. For the flour example, the SAS program would be: PROC CLUSTER METHOD = AVERAGE OUTTREE = TREE; VAR PEAK_VISC TROUGH_VISC FINAL_VISC BREAKDOWN TOTAL_SETBACK TIMEPEAK_VISC;
The method selected in this example is the AVERAGE which bases clustering decisions on the average distance (linkage) between points or clusters. Some other possibilities include CENTROID which uses the distance between the geometric centers of the clusters, MEDIAN which is similar to average, but based on median values, and SIMPLE which uses a nearest neighbor approach. The computed clusters will be saved in a dataset calledTREE for plotting purposes. The printed output for PROC CLUSTER is quite large (one line for every observation), but a sample is shown below:
The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Covariance Matrix
1 2 3 4 5 6
Eigenvalue
Difference
Proportion
Cumulative
101826.399 33540.157 838.084 0.797 0.053 0.000
68286.241 32702.073 837.287 0.744 0.053
0.7476 0.2462 0.0062 0.0000 0.0000 0.0000
0.7476 0.9938 1.0000 1.0000 1.0000 1.0000
This first section displays the eigenvalues in a manner similar to PROC PRINCOMP. Note that the values are different here because I chose not to use the STANDARD option, i.e. the output is based on the covariance matrix, not the correlations. As before, two axes define the data well. Cluster History
NCL 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55
--Clusters Joined--OB17 OB6 OB3 OB12 OB9 OB28 CL72 CL69 OB32 OB22 OB61 CL71 OB19 OB42 OB2 OB36 OB1 OB5 OB16 OB46
OB18 OB7 OB4 OB13 OB10 OB29 OB8 OB30 OB33 OB23 OB62 OB14 OB20 OB43 CL68 OB37 CL73 CL70 CL74 OB47
FREQ
Norm RMS Dist
6 6 6 6 6 6 9 9 6 6 6 9 6 6 12 6 9 9 9 6
0.0149 0.0237 0.0238 0.0256 0.0262 0.0264 0.0349 0.0374 0.0396 0.0404 0.0408 0.0411 0.0427 0.0441 0.0449 0.0469 0.0512 0.0514 0.0516 0.0543
T i e
The second section gives the clustering “history” starting with the smallest distance (Normalized RMS distance). The first line shows a cluster, #74, was created using observations 17 and 18. Similar clusters were created from single observations to make cluster numbers 73, 72, 71, 70, and 69. At cluster 68, observation number 8 was added to cluster number 72 (obs 3 & 4). This
process continues until all observations are included in one cluster. While this process may be interesting, it is hard to follow on the printout. For this reason, cluster analyses are usually reported based on plots of the clustering history, referred to as tree diagrams or dendograms. In SAS, there is a procedure to create such plots called PROC TREE. This procedure uses the output dataset from PROC CLUSTER. The code is simply: proc tree data=tree;
PROC TREE has options and statements available to “dress up” the plot by altering its shape and labeling. The details relating to these options will be left to the reader. The default plot is given below: 1.50 A v e r a g 1.25 e D i s 1.00 t a n c e 0.75 B e t w e 0.50 e n C l u 0.25 s t e r s 0.00 O O O O OO O O O O O O O O O O O OO O O O O O O O O O O O O OO O O O O O O O O O O O OO O O O O O O O O O O O OO O O O O O O O O O O O OO O O O B B B B BB B B B B B B B B B B B BB B B B B B B B B B B B B BB B B B B B B B B B B B BB B B B B B B B B B B B BB B B B B B B B B B B B BB B B B 1 6 7 2 34 85 9 1 1 1 1 1 1 1 1 11 2 2 2 2 2 2 2 2 2 2 3 3 33 3 3 3 3 3 34 4 4 4 4 44 4 4 4 5 5 5 5 5 5 5 555 6 6 6 6 6 6 6 6 6 6 7 77 7 7 7 0 1 2 3 4 5 6 7 89 0 1 2 3 4 5 6 7 8 9 0 2 34 5 1 6 7 8 90 1 2 3 4 56 7 8 9 0 1 2 3 5 4 6 789 0 1 2 3 5 4 6 7 8 9 0 12 3 4 5 Name of Observation or Cluster
I have added shading to indicate three large clusters which correspond to the three flour concentration levels. Within each of these, are five subclusters corresponding to the peak temperature levels, and these can be further broken down into the five heating rates. Thus, PROC CLUSTER has correctly identified the treatment structure of our example. As with PCA and factor analysis, these results are subjective and depend on the users interpretation. The procedures are simply descriptive and should be considered from an exploratory point of view rather than an inferential one.