Cluster Analysis - PowerPoint PPT Presentation

1 / 199

About This Presentation

Title:

Cluster Analysis

Description:

The identified groups have members that are similar to each ... CLUSTER crow shakira clarkson aguilera spears madonna twain lavigne alanis cher /matrix=in ... – PowerPoint PPT presentation

Number of Views:188

Avg rating:3.0/5.0

Slides: 200

Provided by: michael1175

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis

1
Cluster Analysis
2
Cluster analysis is an exploratory procedure that
is used to identify groups of similar objects
(e.g., people, stimuli, books, singers, etc.) in
a large collection of objects. The identified
groups have members that are similar to each
other and different from the members in other
groups. The approach is similar in spirit to
MDS, but produces discrete groups without any
spatial representation. The identified groups
can be used in subsequent analyses.
3
The approach is decidedly discovery
orientedthere is usually no prior knowledge of
how or why objects might be distributed into
particular groups.
4
The most common clustering methods are
hierarchical and agglomerativeforming clusters
by joining nearby objects or clusters, beginning
with as many clusters as there are objects and
ending with a single cluster. In between there
MAY be a number of clusters that provides a
convenient simplification of the data.
5

The clustering solution depends on
Method of joining nearby objects
Type of distance or similarity measure used
(e.g., Euclidean distance, squared Euclidean
distance, Minkowski metric, correlation, binary
matches)
The information contained in the distance or
similarity measure
Nature of the data (standardized or
unstandardized, by case or by variable)

There are a sizeable number of ways to define
clusters, reflecting the different desirable
properties that the clusters might have. Five
methods in particular are fairly common and
usually available in software
Single linkage (nearest neighbor)
Complete linkage (farthest neighbor)
Average linkage
Centroid method
Wards method

7
(No Transcript)
8
Single Linkage
9
(No Transcript)
10
Complete Linkage
11
(No Transcript)
12
Average Linkage
13
(No Transcript)
14
Centroid Method
15
(No Transcript)
16
Wards Method
17
(No Transcript)
18
An important question is how well the different
clustering methods can recover a group structure
when it is known in advance. That can lend
insight into the ability of the methods to
identify any group structure when that structure
is not known in advance.
19
An initial sample of 50 cases was generated from
a bivariate normal population, with correlation
0, means of 100, and standard deviations of
10. To form weak to strong group membership,
different constants were added to cases or
subtracted from cases. Group 1 had 10 cases Group
2 had 15 cases Group 3 had 25 cases
20
The constants were either 5, 10, 15, or 20 (.5 to
2 SD adjustments). These were added or subtracted
in the following way
21
The result of the adjustments was to create five
sets of data, ranging from no group structure (no
adjustments) to very strong group structure. The
ability of the different clustering methods to
recover group structure when it existed and to
not identify a clear group structure when none
existed was tested.
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
To provide a means of checking the quality of the
clusters in relation to the known structure, the
actual and identified group memberships were
cross-tabulated.
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Assigned Group Membership
Correct Assignments
Actual Group Membership
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
(No Transcript)
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
Clustering methods can be applied to the same
kind of data that are examined using MDS. A
proximity matrix can be used as input and the
clusters identified using any of the
methods. The similarity ratings of female pop
singers were examined, using the average of the
four raters proximity matrices.
95
Matrix data Variables rowtype_ crow shakira
clarkson aguilera spears madonna twain lavigne
alanis cher. Begin data prox .00 prox 4.25
.00 prox 3.00 4.00 .00 prox 4.50 2.50 3.25
.00 prox 4.50 3.00 3.25 2.25 .00 prox 3.75 3.00
4.25 1.75 2.75 .00 prox 2.50 4.25 3.50 4.25 3.75
4.50 .00 prox 4.25 3.75 4.00 4.00 4.00 3.75 3.50
.00 prox 3.25 3.25 4.00 3.50 4.25 3.75 3.75 3.25
.00 prox 4.50 3.25 5.00 2.75 3.25 2.00 4.75 4.50
4.00 .00 end data. CLUSTER crow shakira clarkson
aguilera spears madonna twain lavigne alanis
cher /matrixin() /METHOD BAVERAGE single
complete centroid ward /IDvarname_ /PRINT
SCHEDULE /PRINT DISTANCE /PLOT DENDROGRAM
VICICLE.
96
Dimension 2
Dimension 1
97
Dimension 3
Dimension 1
98
Dimension 2
Dimension 3
99
(No Transcript)
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
(No Transcript)
107
Cluster analysis does not necessarily identify
natural clusters. It is heavily dependent on
the data from which distance between objects is
determined. If a key feature of the objects is
not part of the distance calculations, that
aspect of group definition will be missing.
108
An alternative to the agglomerative, hierarchical
approach to clustering more closely resembles the
spirit of analysis of variance. The partitioning
procedure known as K-means clustering attempts to
form clusters that have the smallest possible
within-cluster variances.
109
The partitioning approach to finding clusters
begins with specification of the number of
clusters desired (K) and seed values for the
initial cluster centroids. Then, cases are
assigned to clusters so that the sum of the
squared distances from cases to cluster centroids
are minimized. Cases are reassigned until no
further reduction in the sum of squared
deviations is found.
110
The K-means clustering procedure is similar to
Wards method, but is not a hierarchical approach
and may not produce the same clusters. The nature
of the final clusters can be heavily dependent on
the seed values that are used. By default, most
software chooses an initial set of cases as the
seed values, chosen to be relatively far apart
from each other.
111
The adequacy of the K-means approach can be
tested in the same way as the hierarchical
methodsby examining how well it recovers a known
structure.
112
An initial sample of 50 cases was generated from
a bivariate normal population, with correlation
0, means of 100, and standard deviations of
10. To form weak to strong group membership,
different constants were added to cases or
subtracted from cases. Group 1 had 10 cases Group
2 had 15 cases Group 3 had 25 cases
113
The constants were either 5, 10, 15, or 20 (.5 to
2 SD adjustments). These were added or subtracted
in the following way
114
The result of the adjustments was to create five
sets of data, ranging from no group structure (no
adjustments) to very strong group structure. The
ability of the K-means approach to recover group
structure when it existed was tested. The K-means
approach will always identify precisely K
clusters.
115
(No Transcript)
116
(No Transcript)
117
(No Transcript)
118
(No Transcript)
119
(No Transcript)
120
(No Transcript)
121
GET FILE'C\Courses\psy516\Cluster\xy.sav'. QUIC
K CLUSTER x y /MISSINGLISTWISE /CRITERIA
CLUSTER(2) MXITER(10) CONVERGE(0)
/METHODKMEANS(NOUPDATE) /SAVE CLUSTER
DISTANCE /PRINT ID(group ) INITIAL ANOVA
CLUSTER DISTAN /OUTFILE'C\Courses\psy516\Cluste
r\cluster centers1.sav'.
122
CROSSTABS /TABLESqcl_1 BY group /FORMAT
AVALUE TABLES /STATISTICCHISQ /CELLS COUNT
/BARCHART .
GRAPH /BAR(SIMPLE)MEAN(qcl_2) BY qcl_1
/MISSINGREPORT.
123
(No Transcript)
124
(No Transcript)
125
(No Transcript)
126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
(No Transcript)
132
(No Transcript)
133
(No Transcript)
134
(No Transcript)
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
(No Transcript)
139
(No Transcript)
140
(No Transcript)
141
(No Transcript)
142
(No Transcript)
143
(No Transcript)
144
(No Transcript)
145
(No Transcript)
146
(No Transcript)
147
(No Transcript)
148
(No Transcript)
149
(No Transcript)
150
(No Transcript)
151
(No Transcript)
152
(No Transcript)
153
(No Transcript)
154
(No Transcript)
155
(No Transcript)
156
(No Transcript)
157
(No Transcript)
158
(No Transcript)
159
One potential use for cluster analysis is to
simplify a sample of data, perhaps when initial
analyses suggest a discontinuous nature.
160

A sample of 150 people were surveyed concerning
their opinions about four controversial issues.
On a 10-point rating scale, ranging from
Completely Disapprove (1) to Completely Approve
(10), the respondents rated their opinions of
Gun Control
Prayer in the Schools
Death Penalty
Same Sex Marriage

161
The sample also reported their annual income and
their number of years of education. The role of
socioeconomic status in shaping opinions on
controversial topics was the goal of the study.
An examination of the relationship between
education and income revealed an unusual pattern.
162
(No Transcript)
163
A hierarchical cluster analysis using Wards
method suggested from 3 to 6 clusters in the
sample.
164
To determine the appropriate number of clusters,
the K-means approach was run sequentially,
testing from 2 to 8 clusters. The pseudo-F
statistic, R2, and R2/(1-R2), was calculated for
each solution.
165
QUICK CLUSTER income educate
/MISSINGLISTWISE /CRITERIA CLUSTER(6)
MXITER(10) CONVERGE(0) /METHODKMEANS(NOUPDATE)
/SAVE CLUSTER /PRINT INITIAL ANOVA CLUSTER
DISTAN.
166
(No Transcript)
167
(No Transcript)
168
(No Transcript)
169
(No Transcript)
170
(No Transcript)
171
(No Transcript)
172
(No Transcript)
173
(No Transcript)
174
(No Transcript)
175
(No Transcript)
176
(No Transcript)
177
(No Transcript)
178
(No Transcript)
179
(No Transcript)
180
(No Transcript)
181
(No Transcript)
182
(No Transcript)
183
(No Transcript)
184
(No Transcript)
185
SAVE OUTFILE'C\Courses\psy516\Cluster\example2.s
av' /COMPRESSED. ONEWAY income educate gun
prayer death samesex BY qcl_7 /STATISTICS
DESCRIPTIVES HOMOGENEITY /PLOT MEANS /MISSING
ANALYSIS /POSTHOC BONFERRONI ALPHA(.05).
186
(No Transcript)
187
(No Transcript)
188
(No Transcript)
189
How much do the seed values matter? SPSS default
190
QUICK CLUSTER educate income
/MISSINGLISTWISE /CRITERIA CLUSTER(5)
MXITER(10) CONVERGE(0) noinitial
/METHODKMEANS(NOUPDATE) /PRINT INITIAL
The noinitial option takes the first k cases
that are not missing and uses them as the seed
values.
191
(No Transcript)
192
QUICK CLUSTER educate income
/MISSINGLISTWISE /CRITERIA CLUSTER(6)
MXITER(10) CONVERGE(0) /METHODKMEANS(NOUPDATE)
/PRINT INITIAL /initial (4 20000
6 30000 8 40000 10
50000 12 70000 16 90000) .
Seed values can also be specified.
193
(No Transcript)
194
QUICK CLUSTER educate income
/MISSINGLISTWISE /CRITERIA CLUSTER(6)
MXITER(10) CONVERGE(0) /METHODKMEANS(NOUPDATE)
/PRINT INITIAL /initial (4 120000
6 3000 18 40000
10 5000 16 7000 18
9000) .
195
(No Transcript)
196
QUICK CLUSTER educate income
/MISSINGLISTWISE /CRITERIA CLUSTER(6)
MXITER(10) CONVERGE(0) /METHODKMEANS(NOUPDATE)
/PRINT INITIAL /initial (4 120000
6 113000 18 40000
10 25000 16 37000 20
100000) .
197
(No Transcript)
198
(No Transcript)
199
(No Transcript)

Write a Comment

User Comments (0)