CIS732-Lecture-36-20070418

About This Presentation

Title:

CIS732-Lecture-36-20070418

Description:

Let h denote the number of heads in experiment xi (a single data point) ... Molecular biology: pre-clustering DNA acceptor, donor sites (mouse, human) ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 42

Provided by: lindajacks

Category:

more less

Transcript and Presenter's Notes

Title: CIS732-Lecture-36-20070418

1
Lecture 21 of 42
Partitioning-Based Clustering and Expectation
Maximization (EM)
Monday, 10 March 2008 William H. Hsu Department
of Computing and Information Sciences,
KSU http//www.cis.ksu.edu/Courses/Spring-2008/CIS
732 Readings Section 7.1 7.3, Han Kamber 2e
2
EM AlgorithmExample 3
3
EM for Unsupervised Learning

Unsupervised Learning Problem
Objective estimate a probability distribution
with unobserved variables
Use EM to estimate mixture policy (more on this
later see 6.12, Mitchell)
Pattern Recognition Examples
Human-computer intelligent interaction (HCII)
Detecting facial features in emotion recognition
Gesture recognition in virtual environments
Computational medicine Frey, 1998
Determining morphology (shapes) of bacteria,
viruses in microscopy
Identifying cell structures (e.g., nucleus) and
shapes in microscopy
Other image processing
Many other examples (audio, speech, signal
processing motor control etc.)
Inference Examples
Plan recognition mapping from (observed) actions
to agents (hidden) plans
Hidden changes in context e.g., aviation
computer security MUDs

4
Unsupervised LearningAutoClass 1
5
Unsupervised LearningAutoClass 2

AutoClass Algorithm Cheeseman et al, 1988
Based on maximizing P(x ?j, yj, J)
?j class (cluster) parameters (e.g., mean and
variance)
yj synthetic classes (can estimate marginal
P(yj) any time)
Apply Bayess Theorem, use numerical BOC
estimation techniques (cf. Gibbs)
Search objectives
Find best J (ideally integrate out ?j, yj
really start with big J, decrease)
Find ?j, yj use MAP estimation, then integrate
in the neighborhood of yMAP
EM Find MAP Estimate for P(x ?j, yj, J) by
Iterative Refinement
Advantages over Symbolic (Non-Numerical) Methods
Returns probability distribution over class
membership
More robust than best yj
Compare fuzzy set membership (similar but
probabilistically motivated)
Can deal with continuous as well as discrete data

6
Unsupervised LearningAutoClass 3

AutoClass Resources
Beginning tutorial (AutoClass II) Cheeseman et
al, 4.2.2 Buchanan and Wilkins
Project page http//ic-www.arc.nasa.gov/ic/projec
ts/bayes-group/autoclass/
Applications
Knowledge discovery in databases (KDD) and data
mining
Infrared astronomical satellite (IRAS) spectral
atlas (sky survey)
Molecular biology pre-clustering DNA acceptor,
donor sites (mouse, human)
LandSat data from Kansas (30 km2 region, 1024 x
1024 pixels, 7 channels)
Positive findings see book chapter by Cheeseman
and Stutz, online
Other typical applications see KD Nuggets
(http//www.kdnuggets.com)
Implementations
Obtaining source code from project page
AutoClass III Lisp implementation Cheeseman,
Stutz, Taylor, 1992
AutoClass C C implementation Cheeseman, Stutz,
Taylor, 1998
These and others at http//www.recursive-partitio
ning.com/cluster.html

7
Unsupervised LearningCompetitive Learning for
Feature Discovery

Intuitive Idea Competitive Mechanisms for
Unsupervised Learning
Global organization from local, competitive
weight update
Basic principle expressed by Von der Malsburg
Guiding examples from (neuro)biology lateral
inhibition
Previous work Hebb, 1949 Rosenblatt, 1959 Von
der Malsburg, 1973 Fukushima, 1975 Grossberg,
1976 Kohonen, 1982
A Procedural Framework for Unsupervised
Connectionist Learning
Start with identical (neural) processing units,
with random initial parameters
Set limit on activation strength of each unit
Allow units to compete for right to respond to a
set of inputs
Feature Discovery
Identifying (or constructing) new features
relevant to supervised learning
Examples finding distinguishable letter
characteristics in handwriten character
recognition (HCR), optical character recognition
(OCR)
Competitive learning transform X into X train
units in X closest to x

8
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 1

Another Clustering Algorithm
aka Self-Organizing Feature Map (SOFM)
Given vectors of attribute values (x1, x2, ,
xn)
Returns vectors of attribute values (x1, x2,
, xk)
Typically, n gtgt k (n is high, k 1, 2, or 3
hence dimensionality reducing)
Output vectors x, the projections of input
points x also get P(xj xi)
Mapping from x to x is topology preserving
Topology Preserving Networks
Intuitive idea similar input vectors will map to
similar clusters
Recall informal definition of cluster (isolated
set of mutually similar entities)
Restatement clusters of X (high-D) will still
be clusters of X (low-D)
Representation of Node Clusters
Group of neighboring artificial neural network
units (neighborhood of nodes)
SOMs combine ideas of topology-preserving
networks, unsupervised learning
Implementation http//www.cis.hut.fi/nnrc/ and
MATLAB NN Toolkit

9
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 2
10
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 3
j
11
Unsupervised LearningSOM and Other Projections
for Clustering
Cluster Formation and Segmentation Algorithm
(Sketch)
12
Unsupervised LearningOther Algorithms (PCA,
Factor Analysis)

Intuitive Idea
Q Why are dimensionality-reducing transforms
good for supervised learning?
A There may be many attributes with undesirable
properties, e.g.,
Irrelevance xi has little discriminatory power
over c(x) yi
Sparseness of information feature of interest
spread out over many xis (e.g., text document
categorization, where xi is a word position)
We want to increase the information density by
squeezing X down
Principal Components Analysis (PCA)
Combining redundant variables into a single
variable (aka component, or factor)
Example ratings (e.g., Nielsen) and polls (e.g.,
Gallup) responses to certain questions may be
correlated (e.g., like fishing? time spent
boating)
Factor Analysis (FA)
General term for a class of algorithms that
includes PCA
Tutorial http//www.statsoft.com/textbook/stfacan
.html

13
Clustering MethodsDesign Choices
14
Clustering Applications
Information Retrieval Text Document Categorizatio
n
15
Unsupervised Learning andConstructive Induction

Unsupervised Learning in Support of Supervised
Learning
Given D ? labeled vectors (x, y)
Return D ? transformed training examples (x,
y)
Solution approach constructive induction
Feature construction generic term
Cluster definition
Feature Construction Front End
Synthesizing new attributes
Logical x1 ? ? x2, arithmetic x1 x5 / x2
Other synthetic attributes f(x1, x2, , xn),
etc.
Dimensionality-reducing projection, feature
extraction
Subset selection finding relevant attributes for
a given target y
Partitioning finding relevant attributes for
given targets y1, y2, , yp
Cluster Definition Back End
Form, segment, and label clusters to get
intermediate targets y
Change of representation find an (x, y) that
is good for learning target y

x / (x1, , xp)
16
ClusteringRelation to Constructive Induction

Clustering versus Cluster Definition
Clustering 3-step process
Cluster definition back end for feature
construction
Clustering 3-Step Process
Form
(x1, , xk) in terms of (x1, , xn)
NB typically part of construction step,
sometimes integrates both
Segment
(y1, , yJ) in terms of (x1, , xk)
NB number of clusters J not necessarily same as
number of dimensions k
Label
Assign names (discrete/symbolic labels (v1, ,
vJ)) to (y1, , yJ)
Important in document categorization (e.g.,
clustering text for info retrieval)
Hierarchical Clustering Applying Clustering
Recursively

17
CLUTO

Clustering Algorithms
High-performance High-quality partitional
clustering
High-quality agglomerative clustering
High-quality graph-partitioning-based clustering
Hybrid partitional agglomerative algorithms for
building trees for very large datasets.
Cluster Analysis Tools
Cluster signature identification
Cluster organization identification
Visualization Tools
Hierarchical Trees
High-dimensional datasets
Cluster relations
Interfaces
Stand-alone programs
Library with a fully published API
Available on Windows, Sun, and Linux

http//www.cs.umn.edu/cluto
18
Today

Clustering
Distance Measures
Graph-based Techniques
K-Means Clustering
Tools and Software for Clustering

19
Prediction, Clustering, Classification

What is Prediction?
The goal of prediction is to forecast or deduce
the value of an attribute based on values of
other attributes
A model is first created based on the data
distribution
The model is then used to predict future or
unknown values
Supervised vs. Unsupervised Classification
Supervised Classification Classification
We know the class labels and the number of
classes
Unsupervised Classification Clustering
We do not know the class labels and may not know
the number of classes

20
What is Clustering in Data Mining?
Clustering is a process of partitioning a set of
data (or objects) in a set of meaningful
sub-classes, called clusters
Helps users understand the natural grouping or
structure in a data set

Cluster
a collection of data objects that are similar
to one another and thus can be treated
collectively as one group
but as a collection, they are sufficiently
different from other groups
Clustering
unsupervised classification
no predefined classes

21
Requirements of Clustering Methods

Scalability
Dealing with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
The curse of dimensionality
Interpretability and usability

22
Applications of Clustering

Clustering has wide applications in Pattern
Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in
spatial data mining
Image Processing
Market Research
Information Retrieval
Document or term categorization
Information visualization and IR interfaces
Web Mining
Cluster Web usage data to discover groups of
similar access patterns
Web Personalization

23
Clustering Methodologies

Two general methodologies
Partitioning Based Algorithms
Hierarchical Algorithms
Partitioning Based
divide a set of N items into K clusters
(top-down)
Hierarchical
agglomerative pairs of items or clusters are
successively linked to produce larger clusters
divisive start with the whole set as a cluster
and successively divide sets into smaller
partitions

24
Distance or Similarity Measures

Measuring Distance
In order to group similar items, we need a way to
measure the distance between objects (e.g.,
records)
Note distance inverse of similarity
Often based on the representation of objects as
feature vectors

Term Frequencies for Documents
An Employee DB
Which objects are more similar?
25
Distance or Similarity Measures

Properties of Distance Measures
for all objects A and B, dist(A, B) ³ 0, and
dist(A, B) dist(B, A)
for any object A, dist(A, A) 0
dist(A, C) dist(A, B) dist (B, C)
Common Distance Measures
Manhattan distance
Euclidean distance
Cosine similarity

Can be normalized to make values fall between 0
and 1.
26
Distance or Similarity Measures

Weighting Attributes
in some cases we want some attributes to count
more than others
associate a weight with each of the attributes in
calculating distance, e.g.,
Nominal (categorical) Attributes
can use simple matching distance1 if values
match, 0 otherwise
or convert each nominal attribute to a set of
binary attribute, then use the usual distance
measure
if all attributes are nominal, we can normalize
by dividing the number of matches by the total
number of attributes
Normalization
want values to fall between 0 an 1
other variations possible

27
Distance or Similarity Measures

Example
max distance for age 100000-19000 79000
max distance for age 52-27 25
dist(ID2, ID3) SQRT( 0 (0.04)2 (0.44)2 )
0.44
dist(ID2, ID4) SQRT( 1 (0.72)2 (0.12)2 )
1.24

28
Domain Specific Distance Functions

For some data sets, we may need to use
specialized functions
we may want a single or a selected group of
attributes to be used in the computation of
distance - same problem as feature selection
may want to use special properties of one or more
attribute in the data
natural distance functions may exist in the data

Example Zip Codes distzip(A, B) 0, if zip
codes are identical distzip(A, B) 0.1, if
first 3 digits are identical distzip(A, B)
0.5, if first digits are identical distzip(A, B)
1, if first digits are different
Example Customer Solicitation distsolicit(A, B)
0, if both A and B responded distsolicit(A, B)
0.1, both A and B were chosen but did not
respond distsolicit(A, B) 0.5, both A and B
were chosen, but only one responded distsolicit(A
, B) 1, one was chosen, but the other was not
29
Distance (Similarity) Matrix

Similarity (Distance) Matrix
based on the distance or similarity measure we
can construct a symmetric matrix of distance (or
similarity values)
(i, j) entry in the matrix is the distance
(similarity) between items i and j

Note that dij dji (i.e., the matrix is
symmetric. So, we only need the lower triangle
part of the matrix. The diagonal is all 1s
(similarity) or all 0s (distance)
30
Example Term Similarities in Documents
Term-Term Similarity Matrix
31
Similarity (Distance) Thresholds

A similarity (distance) threshold may be used to
mark pairs that are sufficiently similar

Using a threshold value of 10 in the previous
example
32
Graph Representation

The similarity matrix can be visualized as an
undirected graph
each item is represented by a node, and edges
represent the fact that two items are similar (a
one in the similarity threshold matrix)

If no threshold is used, then matrix can be
represented as a weighted graph
33
Simple Clustering Algorithms

If we are interested only in threshold (and not
the degree of similarity or distance), we can use
the graph directly for clustering
Clique Method (complete link)
all items within a cluster must be within the
similarity threshold of all other items in that
cluster
clusters may overlap
generally produces small but very tight clusters
Single Link Method
any item in a cluster must be within the
similarity threshold of at least one other item
in that cluster
produces larger but weaker clusters
Other methods
star method - start with an item and place all
related items in that cluster
string method - start with an item place one
related item in that cluster then place anther
item related to the last item entered, and so on

34
Simple Clustering Algorithms

Clique Method
a clique is a completely connected subgraph of a
graph
in the clique method, each maximal clique in the
graph becomes a cluster

T3
T1
Maximal cliques (and therefore the clusters) in
the previous example are T1, T3, T4,
T6 T2, T4, T6 T2, T6, T8 T1,
T5 T7 Note that, for example, T1, T3, T4
is also a clique, but is not maximal.
T5
T4
T2
T7
T6
T8
35
Simple Clustering Algorithms

Single Link Method
selected an item not in a cluster and place it in
a new cluster
place all other similar item in that cluster
repeat step 2 for each item in the cluster until
nothing more can be added
repeat steps 1-3 for each item that remains
unclustered

T3
T1
In this case the single link method produces only
two clusters T1, T3, T4, T5, T6, T2,
T8 T7 Note that the single link method
does not allow overlapping clusters, thus
partitioning the set of items.
T5
T4
T2
T7
T6
T8
36
Clustering with Existing Clusters

The notion of comparing item similarities can be
extended to clusters themselves, by focusing on a
representative vector for each cluster
cluster representatives can be actual items in
the cluster or other virtual representatives
such as the centroid
this methodology reduces the number of similarity
computations in clustering
clusters are revised successively until a
stopping condition is satisfied, or until no more
changes to clusters can be made
Partitioning Methods
reallocation method - start with an initial
assignment of items to clusters and then move
items from cluster to cluster to obtain an
improved partitioning
Single pass method - simple and efficient, but
produces large clusters, and depends on order in
which items are processed
Hierarchical Agglomerative Methods
starts with individual items and combines into
clusters
then successively combine smaller clusters to
form larger ones
grouping of individual items can be based on any
of the methods discussed earlier

37
K-Means Algorithm

The basic algorithm (based on reallocation
method)
1. select K data points as the initial
representatives
2. for i 1 to N, assign item xi to the most
similar centroid (this gives K clusters)
3. for j 1 to K, recalculate the cluster
centroid Cj
4. repeat steps 2 and 3 until these is (little
or) no change in clusters
Example Clustering Terms

Initial (arbitrary) assignment C1 T1,T2, C2
T3,T4, C3 T5,T6
Cluster Centroids
38
Example K-Means

Example (continued)

Now using simple similarity measure, compute the
new cluster-term similarity matrix
Now compute new cluster centroids using the
original document-term matrix
The process is repeated until no further changes
are made to the clusters
39
K-Means Algorithm

Strength of the k-means
Relatively efficient O(tkn), where n is of
objects, k is of clusters, and t is of
iterations. Normally, k, t ltlt n
Often terminates at a local optimum
Weakness of the k-means
Applicable only when mean is defined what about
categorical data?
Need to specify k, the number of clusters, in
advance
Unable to handle noisy data and outliers
Variations of K-Means usually differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means

40
Terminology

Expectation-Maximization (EM) Algorithm
Iterative refinement repeat until convergence to
a locally optimal label
Expectation step estimate parameters with which
to simulate data
Maximization step use simulated (fictitious)
data to update parameters
Unsupervised Learning and Clustering
Constructive induction using unsupervised
learning for supervised learning
Feature construction front end - construct new
x values
Cluster definition back end - use these to
reformulate y
Clustering problems formation, segmentation,
labeling
Key criterion distance metric (points closer
intra-cluster than inter-cluster)
Algorithms
AutoClass Bayesian clustering
Principal Components Analysis (PCA), factor
analysis (FA)
Self-Organizing Maps (SOM) topology preserving
transform (dimensionality reduction) for
competitive unsupervised learning