Title: Clustering%20of%20%20Non-Quantitative%20Data
1Nanjing University of Science Technology
Clustering of Non-Quantitative Data
Lonnie C. Ludeman New Mexico State University Las
Cruces, New Mexico, USA Nov 23, 2005
2Like to Thank
Nanjing University of Science Technology
Department of Computer Science
especially
Lu Jian-feng Yang Jing-yu Wang Han
for inviting me to come to NUST
3Also like to Thank two special students
Wang Qiong Wang Huan
for their helpfulness and kindness during my
tenure at NUST
4Consider the problem of clustering a standard
american deck of playing cards
Two Possible Solutions
5Two More Possible Solutions
Clustering of Non-Quantitative Data
6Clustering is the art of grouping together
pattern vectors that in some sense belong
together because they have similar
characteristics and are somehow different from
other pattern vectors.
In the most general problem the number of
clusters or subgroups is unknown as are the
properties that make them similar.
7 Lecture Topics
- General Concept of Clustering
- Define Types of Data
- Review the K-Means Clustering Algorithm
- Describe Clustering Method for non- Quantitative
data - Present two examples illustrating the method
- Discuss advantages and disadvantages of the
method
8Mathematical Formalization of the Clustering
Problem
Given a set S of NS n-dimensional pattern
vectors
S xj j 1, 2, ... , NS
Clustering is the process of partitioning S into
M subsets, Clk , k1, 2, ... , M, called
clusters, that satisfy the following conditions.
9Properties of a Clustering Partition
1. The members in each subset are in some sense
similar and not similar to members in the other
subsets.
2. Clk ? F Not empty
3. Clk n Clj ? F Pairwise disjoint
S Exhaustive
4.
F is the Null Set
10Illustration of Clusters and Cluster centers
11For quantitative data we use measures of
similarity and dissimilarity between pattern
samples and clusters
Euclidean Distance between two pattern vectors x
and y
The smaller the distance the larger the similarity
12A few Methods for Clustering
of Quantitative Data are
1. K-Means Clustering Algorithm 2. Hierarchical
Clustering Algorithm 3. ISODATA Clustering
Algorithm 4. Fuzzy Clustering Algorithm
13K-Means Clustering Algorithm
Basic Procedure
Randomly Select K cluster centers from Pattern
Space or data set Distribute set of patterns to
the cluster center using minimum distance
Compute new Cluster centers for each
cluster Continue this process until the cluster
centers do not change.
14Three Main Types of Data 1. Quantitative
2. Qualitative or categorical 3. Mixed
Quantitative and Qualitative
The first two types can be further broken down
into special categories as follows
15Quantitative Data
16Qualitative or Categorical Data
17Nominal Non-Binary Examples Color of
eyes blue, brown, black,
green, gray Car Owned
Ford, Toyota, Kia, Renault, Mercedes
Nominal Binary Examples Answer to
true false question True, False
Position of a switch on, off
18Linearly ordered Ordinal Qualitative Answer
to a persons health excellent, very
good, average, fair, poor Hierarchical
or Structurally Ordered Qualitative
Answer to type of figure rectangle, triangle,
hexagon, circle, ellipse
19Hierarchical Structured Qualitative Data- Answer
to type of figure
20Lattice Structurally Ordered Qualitative
Answer to type of education Elementary
School, High school, Apprenticeship,
Undergraduate school, on the job training,
Graduate school, post graduate school
21Measure of Performance for Clustering of
Categorical Data Overall performance measure J
for a given set of clusters Clk for k 1, 2, ...
, K
Mk
is the kth cluster Representative Vector
where
i
k
i
k
... is the measure of distance for data
vectors
22We wish to minimize J by the selection of the
representative elements of each cluster and the
elements of each cluster. ( the partition of the
data set) This overall performance measure J
can be minimized by a two-stage iterative process
similar to the steps given in the standard
K-Means algorithm.
23Proposed Modified K-Means Clustering
for Qualitative Data
- Basic Procedure
Randomly Select K cluster centers from Pattern
Space or data set Distribute set of patterns to
the cluster center using minimum distance
Compute new Cluster centers for each
cluster Continue this process until the cluster
centers do not change.
24Proposed Modified K-Means Algorithm
(Step 0) Selection of Initial Cluster Centers.
There are many ways to select the initial
cluster centers. Perhaps the
simplest way is to select a set of
sequences in the data set randomly.
25(Step 1) Redistribution of Sequences to Cluster
Centers We have chosen to redistribute each
sequence to the cluster center that is its
nearest neighbor. Thus, each vector is assigned
to the closest cluster center where closest is
with respect to some predefined distance measure.
26(Step 2) Selection of Cluster Centers
Choose the sequence in the cluster that has
the smallest sum of distances from the
sequence to all other sequences in the
cluster. Resolve ties randomly
The fact that the cluster center is selected in
this way, always a member of the data set,
contrasts with the standard K-means algorithm for
numerical data, where the cluster center is not
necessarily a member of the original data set
because it represents the average of the points
in each cluster.
Steps (1) and (2) are repeated until the cluster
centers do not change
27Examples Using proposed clustering Algorithm
Example 1 Structural data with missing
components
Example 2 Archaeological Sequential Data
28Example 1 Missing data sequential
Letting b equal an unknown symbol the above can
be written as
Use the modified K-Means clustering algorithm to
obtain two clusters of the data set.
29Solution First define a measure of distance
between members of the data sets as
Subjective assignment
Next randomly select cluster centers
30Redistribute the samples to the cluster centers
Using the defined distance measure assign to the
nearest cluster center
This yields the following new trial clusters
31Determine new cluster Center for cluster 1using
minimum row sum
Therefore since row three has the smallest sum
the new Cluster 1 Center becomes
32Determine new cluster Center for Cluster 2 using
minimum row sum
Therefore since row two has the smallest sum the
new Cluster 2 center becomes
33Redistribute to obtain
34Determine new cluster Center for Cluster 1 using
minimum row sum
Therefore since row three has the smallest sum
the new Cluster 1 center becomes
35Determine new cluster Center for Cluster 2 using
minimum row sum
These are same clustering centers as previous
iteration thus the final clustering becomes
36Tree of possible sequences for Example 1
R
R
37Example 2 Clustering of Archaeological data
Sample 1 Sample 2
Sherds general fill general fill sherds
whole pots roof fall roof fall wall
fall wall fall whole pots floor artifacts
floor artifacts
Depositional Sequence
38Archaeological Categorical Sequential Data
Euclidean Distance is no longer a meaningful or
for that matter a computable measure of distance.
Thus, new intra-set distance measures must be
defined as well as different methods for
selecting representative elements or cluster
centers. Techniques for clustering different
size vectors or vectors containing sequential
relationships have not received the attention of
researchers, perhaps because software is limited
to conveniently handle this type of problem.
39The following code was set up for this example
N Floors with few or no artifacts (lt10) M
Floors with many artifacts (gt10) U Layer of
unburned roofing material B Layer of burned
roofing. T Refuse S Deposits of
aeolian sand D Detritus from the cave roof
? Unknown deposits
40 Given the Broken Flute Cave Strata Data Set
ID Seq ID Seq ID Seq
x1 NB x5 NSBT x8a NSB
x2 NB x6 MBT x9 MBTD
x3 NBT x7 MBT x11 NSUT
x4 - NBT x8 MBT x12 NBT
Find four clusters that characterize the data
41Solution
Define a distance measure as the minimum weighted
number of changes or steps required to transform
one stratigraphic sequence into another.
Transformation rules (1) addition or
deletion of a stratum, (2) changes in
kind of stratum, and (3) order of strata.
Reversal of order, Use differentially weighted
measures for transformations as follow
42Weights for various transformations
- The addition or deletion of a stratum, e.g.,
adding sand (S) or deleting trash (T) were
weighted the least. Such transformations were
assigned a distance of 1 unit. - 2) Changes in kind e.g., burned roofing (B) vs.
unburned roofing (U) strata, were more heavily
weighted. Each was assigned a distance of 1.5
units.
43- 3) Had we encountered reversals of order, e.g.,
burned roofing over sand (B S) vs. sand deposited
over burned roofing (S B), we would have weighted
them heaviest assigning each a distance of 2
units. - Reason for weighing reversals highly is
because they represent a significantly different
behavioral and depositional sequence.
44Distances Between Stratigraphic Sequences at
Broken Flute Cave
x1 x2 x3 x4 x5 x6 x7 x8 x8a x9 x11 x12 sum
x1 0 0 1 1 3 2.5 2.5 2.5 2 3.5 3 1 21.5
x2 0 0 1 1 3 2.5 2.5 2.5 2 3.5 3 1 21.5
x3 1 1 0 0 3 1.5 1.5 1.5 3 2.5 3 0 18
x4 1 1 0 0 3 1.5 1.5 1.5 3 2.5 3 0 18
x5 3 3 3 3 0 4.5 4.5 4.5 1 4.5 1 3 35
x6 2.5 2.5 1.5 1.5 4.5 0 0 0 3.5 1 4.5 1.5 22
x7 2.5 2.5 1.5 1.5 4.5 0 0 0 3.5 1 4.5 1.5 22
x8 2.5 2.5 1.5 1.5 4.5 0 0 0 3.5 1 4.5 1.5 22
x8a 2 2 3 2 1 3.5 3.5 3.5 0 4.5 2.5 2 29.5
x9 3.5 3.5 2.5 2.5 4.5 1 1 1 4.5 0 4.5 2.5 31
x11 3 3 3 3 1 4.5 4.5 4.5 2.5 4.5 0 3 35.5
x12 1 1 0 0 3 1.5 1.5 1.5 3 2.5 3 0 18
45Selection of Initial Cluster Centers We chose the
four initial cluster centers randomly as x11, x5
, x9, x8a .
cc1(1) x11 cc2(1) x5 cc3(1) x9 cc4(1) x8a Assign
x1 3 3 3.5 2 cl4(1)
x2 3 3 3.5 2 cl4(1)
x3 3 3 2.5 3 cl3(1)
x4 3 3 2.5 3 cl3(1)
x5 1 0 4.5 1 cl2(1)
x6 4.5 4.5 1 3.5 cl3(1)
x7 4.5 4.5 1 3.5 cl3(1)
x8 4.5 4.5 1 3.5 cl3(1)
x8a 2.5 1 4.5 0 cl4(1)
x9 4.5 4.5 0 4.5 cl3(1)
x11 0 1 4.5 2.5 cl1(1)
x12 3 3 2.5 3 cl3(1)
46Continuing with the iterations until convergence
gives the final four clusters as
Cl1 x11 NSUT Cl2 x5, x8a
NSBT, NSB Cl3 x6, x7, x8, x9 MBT, MBT,
MBT, MBTD Cl4 x1, x2, x3, x4, x12 NB,
NB, NBT, NBT, NBT
We will not take time to discuss the
archaeological significance of this result.
47Advantages of the proposed method
1. Can obtain clusters for non quantitative data
typical of archaeological data obtained in field
work 2. Since the new cluster center of each
cluster is always a member of the data, distances
between samples need only be computed once and
recalled from memory when needed reducing
computation. 3. Rerunning algorithm provides
different interpretations of the data. 4. Some
structural information is provided by the
resulting cluster centers
48Disadvantages of the proposed method
1. Can converge to a local minimum 2. Must be run
several times with different random initial
cluster centers 3. Results are dependent on the
subjective distance measures used. 4. Will
always find clusters whether there are any
physical clusters existing or not. 5. With very
large data set ( N data points ) requires storage
of N(N-1)/2 distances or recalculation of
distances.
49 Lecture Summary
- Discussed the General Concept of Clustering
- Presented definitions of Types of Data
- Reviewed the K-Means Clustering Algorithm
- Described Clustering Method for non-
Quantitative data - Presented two examples illustrating the method.
- Discussed advantages and disadvantages of the
proposed clustering method
50Thank you for your attention and I am happy to
answer any questions you might have regarding
this presentation.
51End of Lecture