Title: Cluster Analysis
1Cluster Analysis
2Cluster analysis is an exploratory procedure that
is used to identify groups of similar objects
(e.g., people, stimuli, books, singers, etc.) in
a large collection of objects. The identified
groups have members that are similar to each
other and different from the members in other
groups. The approach is similar in spirit to
MDS, but produces discrete groups without any
spatial representation. The identified groups
can be used in subsequent analyses.
3The approach is decidedly discovery
orientedthere is usually no prior knowledge of
how or why objects might be distributed into
particular groups.
4The most common clustering methods are
hierarchical and agglomerativeforming clusters
by joining nearby objects or clusters, beginning
with as many clusters as there are objects and
ending with a single cluster. In between there
MAY be a number of clusters that provides a
convenient simplification of the data.
5- The clustering solution depends on
- Method of joining nearby objects
- Type of distance or similarity measure used
(e.g., Euclidean distance, squared Euclidean
distance, Minkowski metric, correlation, binary
matches) - The information contained in the distance or
similarity measure - Nature of the data (standardized or
unstandardized, by case or by variable)
6- There are a sizeable number of ways to define
clusters, reflecting the different desirable
properties that the clusters might have. Five
methods in particular are fairly common and
usually available in software - Single linkage (nearest neighbor)
- Complete linkage (farthest neighbor)
- Average linkage
- Centroid method
- Wards method
7(No Transcript)
8Single Linkage
9(No Transcript)
10Complete Linkage
11(No Transcript)
12Average Linkage
13(No Transcript)
14Centroid Method
15(No Transcript)
16Wards Method
17(No Transcript)
18An important question is how well the different
clustering methods can recover a group structure
when it is known in advance. That can lend
insight into the ability of the methods to
identify any group structure when that structure
is not known in advance.
19An initial sample of 50 cases was generated from
a bivariate normal population, with correlation
0, means of 100, and standard deviations of
10. To form weak to strong group membership,
different constants were added to cases or
subtracted from cases. Group 1 had 10 cases Group
2 had 15 cases Group 3 had 25 cases
20The constants were either 5, 10, 15, or 20 (.5 to
2 SD adjustments). These were added or subtracted
in the following way
21The result of the adjustments was to create five
sets of data, ranging from no group structure (no
adjustments) to very strong group structure. The
ability of the different clustering methods to
recover group structure when it existed and to
not identify a clear group structure when none
existed was tested.
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29To provide a means of checking the quality of the
clusters in relation to the known structure, the
actual and identified group memberships were
cross-tabulated.
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40Assigned Group Membership
Correct Assignments
Actual Group Membership
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75(No Transcript)
76(No Transcript)
77(No Transcript)
78(No Transcript)
79(No Transcript)
80(No Transcript)
81(No Transcript)
82(No Transcript)
83(No Transcript)
84(No Transcript)
85(No Transcript)
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91(No Transcript)
92(No Transcript)
93(No Transcript)
94Clustering methods can be applied to the same
kind of data that are examined using MDS. A
proximity matrix can be used as input and the
clusters identified using any of the
methods. The similarity ratings of female pop
singers were examined, using the average of the
four raters proximity matrices.
95Matrix data Variables rowtype_ crow shakira
clarkson aguilera spears madonna twain lavigne
alanis cher. Begin data prox .00 prox 4.25
.00 prox 3.00 4.00 .00 prox 4.50 2.50 3.25
.00 prox 4.50 3.00 3.25 2.25 .00 prox 3.75 3.00
4.25 1.75 2.75 .00 prox 2.50 4.25 3.50 4.25 3.75
4.50 .00 prox 4.25 3.75 4.00 4.00 4.00 3.75 3.50
.00 prox 3.25 3.25 4.00 3.50 4.25 3.75 3.75 3.25
.00 prox 4.50 3.25 5.00 2.75 3.25 2.00 4.75 4.50
4.00 .00 end data. CLUSTER crow shakira clarkson
aguilera spears madonna twain lavigne alanis
cher /matrixin() /METHOD BAVERAGE single
complete centroid ward /IDvarname_ /PRINT
SCHEDULE /PRINT DISTANCE /PLOT DENDROGRAM
VICICLE.
96Dimension 2
Dimension 1
97Dimension 3
Dimension 1
98Dimension 2
Dimension 3
99(No Transcript)
100(No Transcript)
101(No Transcript)
102(No Transcript)
103(No Transcript)
104(No Transcript)
105(No Transcript)
106(No Transcript)
107Cluster analysis does not necessarily identify
natural clusters. It is heavily dependent on
the data from which distance between objects is
determined. If a key feature of the objects is
not part of the distance calculations, that
aspect of group definition will be missing.
108An alternative to the agglomerative, hierarchical
approach to clustering more closely resembles the
spirit of analysis of variance. The partitioning
procedure known as K-means clustering attempts to
form clusters that have the smallest possible
within-cluster variances.
109The partitioning approach to finding clusters
begins with specification of the number of
clusters desired (K) and seed values for the
initial cluster centroids. Then, cases are
assigned to clusters so that the sum of the
squared distances from cases to cluster centroids
are minimized. Cases are reassigned until no
further reduction in the sum of squared
deviations is found.
110The K-means clustering procedure is similar to
Wards method, but is not a hierarchical approach
and may not produce the same clusters. The nature
of the final clusters can be heavily dependent on
the seed values that are used. By default, most
software chooses an initial set of cases as the
seed values, chosen to be relatively far apart
from each other.
111The adequacy of the K-means approach can be
tested in the same way as the hierarchical
methodsby examining how well it recovers a known
structure.
112An initial sample of 50 cases was generated from
a bivariate normal population, with correlation
0, means of 100, and standard deviations of
10. To form weak to strong group membership,
different constants were added to cases or
subtracted from cases. Group 1 had 10 cases Group
2 had 15 cases Group 3 had 25 cases
113The constants were either 5, 10, 15, or 20 (.5 to
2 SD adjustments). These were added or subtracted
in the following way
114The result of the adjustments was to create five
sets of data, ranging from no group structure (no
adjustments) to very strong group structure. The
ability of the K-means approach to recover group
structure when it existed was tested. The K-means
approach will always identify precisely K
clusters.
115(No Transcript)
116(No Transcript)
117(No Transcript)
118(No Transcript)
119(No Transcript)
120(No Transcript)
121GET FILE'C\Courses\psy516\Cluster\xy.sav'. QUIC
K CLUSTER x y /MISSINGLISTWISE /CRITERIA
CLUSTER(2) MXITER(10) CONVERGE(0)
/METHODKMEANS(NOUPDATE) /SAVE CLUSTER
DISTANCE /PRINT ID(group ) INITIAL ANOVA
CLUSTER DISTAN /OUTFILE'C\Courses\psy516\Cluste
r\cluster centers1.sav'.
122CROSSTABS /TABLESqcl_1 BY group /FORMAT
AVALUE TABLES /STATISTICCHISQ /CELLS COUNT
/BARCHART .
GRAPH /BAR(SIMPLE)MEAN(qcl_2) BY qcl_1
/MISSINGREPORT.
123(No Transcript)
124(No Transcript)
125(No Transcript)
126(No Transcript)
127(No Transcript)
128(No Transcript)
129(No Transcript)
130(No Transcript)
131(No Transcript)
132(No Transcript)
133(No Transcript)
134(No Transcript)
135(No Transcript)
136(No Transcript)
137(No Transcript)
138(No Transcript)
139(No Transcript)
140(No Transcript)
141(No Transcript)
142(No Transcript)
143(No Transcript)
144(No Transcript)
145(No Transcript)
146(No Transcript)
147(No Transcript)
148(No Transcript)
149(No Transcript)
150(No Transcript)
151(No Transcript)
152(No Transcript)
153(No Transcript)
154(No Transcript)
155(No Transcript)
156(No Transcript)
157(No Transcript)
158(No Transcript)
159One potential use for cluster analysis is to
simplify a sample of data, perhaps when initial
analyses suggest a discontinuous nature.
160- A sample of 150 people were surveyed concerning
their opinions about four controversial issues.
On a 10-point rating scale, ranging from
Completely Disapprove (1) to Completely Approve
(10), the respondents rated their opinions of - Gun Control
- Prayer in the Schools
- Death Penalty
- Same Sex Marriage
161The sample also reported their annual income and
their number of years of education. The role of
socioeconomic status in shaping opinions on
controversial topics was the goal of the study.
An examination of the relationship between
education and income revealed an unusual pattern.
162(No Transcript)
163A hierarchical cluster analysis using Wards
method suggested from 3 to 6 clusters in the
sample.
164To determine the appropriate number of clusters,
the K-means approach was run sequentially,
testing from 2 to 8 clusters. The pseudo-F
statistic, R2, and R2/(1-R2), was calculated for
each solution.
165QUICK CLUSTER income educate
/MISSINGLISTWISE /CRITERIA CLUSTER(6)
MXITER(10) CONVERGE(0) /METHODKMEANS(NOUPDATE)
/SAVE CLUSTER /PRINT INITIAL ANOVA CLUSTER
DISTAN.
166(No Transcript)
167(No Transcript)
168(No Transcript)
169(No Transcript)
170(No Transcript)
171(No Transcript)
172(No Transcript)
173(No Transcript)
174(No Transcript)
175(No Transcript)
176(No Transcript)
177(No Transcript)
178(No Transcript)
179(No Transcript)
180(No Transcript)
181(No Transcript)
182(No Transcript)
183(No Transcript)
184(No Transcript)
185SAVE OUTFILE'C\Courses\psy516\Cluster\example2.s
av' /COMPRESSED. ONEWAY income educate gun
prayer death samesex BY qcl_7 /STATISTICS
DESCRIPTIVES HOMOGENEITY /PLOT MEANS /MISSING
ANALYSIS /POSTHOC BONFERRONI ALPHA(.05).
186(No Transcript)
187(No Transcript)
188(No Transcript)
189How much do the seed values matter? SPSS default
190QUICK CLUSTER educate income
/MISSINGLISTWISE /CRITERIA CLUSTER(5)
MXITER(10) CONVERGE(0) noinitial
/METHODKMEANS(NOUPDATE) /PRINT INITIAL
The noinitial option takes the first k cases
that are not missing and uses them as the seed
values.
191(No Transcript)
192QUICK CLUSTER educate income
/MISSINGLISTWISE /CRITERIA CLUSTER(6)
MXITER(10) CONVERGE(0) /METHODKMEANS(NOUPDATE)
/PRINT INITIAL /initial (4 20000
6 30000 8 40000 10
50000 12 70000 16 90000) .
Seed values can also be specified.
193(No Transcript)
194QUICK CLUSTER educate income
/MISSINGLISTWISE /CRITERIA CLUSTER(6)
MXITER(10) CONVERGE(0) /METHODKMEANS(NOUPDATE)
/PRINT INITIAL /initial (4 120000
6 3000 18 40000
10 5000 16 7000 18
9000) .
195(No Transcript)
196QUICK CLUSTER educate income
/MISSINGLISTWISE /CRITERIA CLUSTER(6)
MXITER(10) CONVERGE(0) /METHODKMEANS(NOUPDATE)
/PRINT INITIAL /initial (4 120000
6 113000 18 40000
10 25000 16 37000 20
100000) .
197(No Transcript)
198(No Transcript)
199(No Transcript)