Title: CIS732Lecture3720070419
1Lecture 37 of 42
Unsupervised Learning AutoClass, SOM, EM, and
Hierarchical Mixtures of Experts
Thursday, 19 April 2007 William H.
Hsu Department of Computing and Information
Sciences, KSU http//www.cis.ksu.edu/Courses/Sprin
g-2007/CIS732 Readings Bagging, Boosting, and
C4.5, Quinlan Section 5, MLC Utilities 2.0,
Kohavi and Sommerfield
2Mixture ModelsReview
- Intuitive Idea
- Integrate knowledge from multiple experts (or
data from multiple sensors) - Collection of inducers organized into committee
machine (e.g., modular ANN) - Dynamic structure take input signal into account
- References
- Bishop, 1995 (Sections 2.7, 9.7)
- Haykin, 1999 (Section 7.6)
- Problem Definition
- Given collection of inducers (experts) L, data
set D - Perform supervised learning using inducers and
self-organization of experts - Return committee machine with trained gating
network (combiner inducer) - Solution Approach
- Let combiner inducer be generalized linear model
(e.g., threshold gate) - Activation functions linear combination, vote,
smoothed vote (softmax)
3Mixture ModelsProcedure
- Algorithm Combiner-Mixture-Model (D, L,
Activation, k) - m ? D.size
- FOR j ? 1 TO k DO // initialization
- wj ? 1
- UNTIL the termination condition is met, DO
- FOR j ? 1 TO k DO
- Pj ? Lj.Update-Inducer (D) // single
training step for Lj - FOR i ? 1 TO m DO
- Sumi ? 0
- FOR j ? 1 TO k DO Sumi Pj(Di)
- Neti ? Compute-Activation (Sumi) // compute
gj ? Netij - FOR j ? 1 TO k DO wj ? Update-Weights (wj,
Neti, Di) - RETURN (Make-Predictor (P, w))
- Update-Weights Single Training Step for Mixing
Coefficients
4Mixture ModelsProperties
?
5Generalized Linear Models (GLIMs)
- Recall Perceptron (Linear Threshold Gate) Model
- Generalization of LTG Model McCullagh and
Nelder, 1989 - Model parameters connection weights as for LTG
- Representational power depends on transfer
(activation) function - Activation Function
- Type of mixture model depends (in part) on this
definition - e.g., o(x) could be softmax (x w) Bridle,
1990 - NB softmax is computed across j 1, 2, , k
(cf. hard max) - Defines (multinomial) pdf over experts Jordan
and Jacobs, 1995
6Hierarchical Mixture of Experts (HME)Idea
- Hierarchical Model
- Compare stacked generalization network
- Difference trained in multiple passes
- Dynamic Network of GLIMs
All examples x and targets y c(x) identical
7Hierarchical Mixture of Experts (HME)Procedure
- Algorithm Combiner-HME (D, L, Activation, Level,
k, Classes) - m ? D.size
- FOR j ? 1 TO k DO wj ? 1 // initialization
- UNTIL the termination condition is met DO
- IF Level gt 1 THEN
- FOR j ? 1 TO k DO
- Pj ? Combiner-HME (D, Lj, Activation, Level
- 1, k, Classes) - ELSE
- FOR j ? 1 TO k DO Pj ? Lj.Update-Inducer (D)
- FOR i ? 1 TO m DO
- Sumi ? 0
- FOR j ? 1 TO k DO
- Sumi Pj(Di)
- Neti ? Compute-Activation (Sumi)
- FOR l ? 1 TO Classes DO wl ? Update-Weights
(wl, Neti, Di) - RETURN (Make-Predictor (P, w))
8Hierarchical Mixture of Experts (HME)Properties
- Advantages
- Benefits of ME base case is single level of
expert and gating networks - More combiner inducers ? more capability to
decompose complex problems - Views of HME
- Expresses divide-and-conquer strategy
- Problem is distributed across subtrees on the
fly by combiner inducers - Duality data fusion ? problem redistribution
- Recursive decomposition until good fit found to
local structure of D - Implements soft decision tree
- Mixture of experts 1-level decision tree
(decision stump) - Information preservation compared to traditional
(hard) decision tree - Dynamics of HME improves on greedy
(high-commitment) strategy of decision tree
induction
9EM AlgorithmExample 1
- Experiment
- Two coins P(Head on Coin 1) p, P(Head on Coin
2) q - Experimenter first selects a coin P(Coin 1)
? - Chosen coin tossed 3 times (per experimental run)
- Observe D (1 H H T), (1 H T T), (2 T H T)
- Want to predict ?, p, q
- How to model the problem?
- Simple Bayesian network
- Now, can find most likely values of parameters ?,
p, q given data D - Parameter Estimation
- Fully observable case easy to estimate p, q, and
? - Suppose k heads are observed out of n coin flips
- Maximum likelihood estimate vML for Flipi p
k/n - Partially observable case
- Dont know which coin the experimenter chose
- Observe D (H H T), (H T T), (T H T) ? (? H
H T), (? H T T), (? T H T)
P(Coin 1) ?
P(Flipi 1 Coin 1) p P(Flipi 1 Coin
2) q
10EM AlgorithmExample 2
- Problem
- When we knew Coin 1 or Coin 2, there was no
problem - No known analytical solution to the partially
observable problem - i.e., not known how to compute estimates of p, q,
and ? to get vML - Moreover, not known what the computational
complexity is - Solution Approach Iterative Parameter Estimation
- Given a guess of P(Coin 1 x), P(Coin 2
x) - Generate fictional data points, weighted
according to this probability - P(Coin 1 x) P(x Coin 1) P(Coin 1) /
P(x) based on our guess of ?, p, q - Expectation step (the E in EM)
- Now, can find most likely values of parameters ?,
p, q given fictional data - Use gradient descent to update our guess of ?, p,
q - Maximization step (the M in EM)
- Repeat until termination condition met (e.g.,
stopping criterion on validation set) - EM Converges to Local Maxima of the Likelihood
Function P(D ?)
11EM AlgorithmExample 3
12EM for Unsupervised Learning
- Unsupervised Learning Problem
- Objective estimate a probability distribution
with unobserved variables - Use EM to estimate mixture policy (more on this
later see 6.12, Mitchell) - Pattern Recognition Examples
- Human-computer intelligent interaction (HCII)
- Detecting facial features in emotion recognition
- Gesture recognition in virtual environments
- Computational medicine Frey, 1998
- Determining morphology (shapes) of bacteria,
viruses in microscopy - Identifying cell structures (e.g., nucleus) and
shapes in microscopy - Other image processing
- Many other examples (audio, speech, signal
processing motor control etc.) - Inference Examples
- Plan recognition mapping from (observed) actions
to agents (hidden) plans - Hidden changes in context e.g., aviation
computer security MUDs
13Unsupervised LearningAutoClass 1
14Unsupervised LearningAutoClass 2
- AutoClass Algorithm Cheeseman et al, 1988
- Based on maximizing P(x ?j, yj, J)
- ?j class (cluster) parameters (e.g., mean and
variance) - yj synthetic classes (can estimate marginal
P(yj) any time) - Apply Bayess Theorem, use numerical BOC
estimation techniques (cf. Gibbs) - Search objectives
- Find best J (ideally integrate out ?j, yj
really start with big J, decrease) - Find ?j, yj use MAP estimation, then integrate
in the neighborhood of yMAP - EM Find MAP Estimate for P(x ?j, yj, J) by
Iterative Refinement - Advantages over Symbolic (Non-Numerical) Methods
- Returns probability distribution over class
membership - More robust than best yj
- Compare fuzzy set membership (similar but
probabilistically motivated) - Can deal with continuous as well as discrete data
15Unsupervised LearningAutoClass 3
- AutoClass Resources
- Beginning tutorial (AutoClass II) Cheeseman et
al, 4.2.2 Buchanan and Wilkins - Project page http//ic-www.arc.nasa.gov/ic/projec
ts/bayes-group/autoclass/ - Applications
- Knowledge discovery in databases (KDD) and data
mining - Infrared astronomical satellite (IRAS) spectral
atlas (sky survey) - Molecular biology pre-clustering DNA acceptor,
donor sites (mouse, human) - LandSat data from Kansas (30 km2 region, 1024 x
1024 pixels, 7 channels) - Positive findings see book chapter by Cheeseman
and Stutz, online - Other typical applications see KD Nuggets
(http//www.kdnuggets.com) - Implementations
- Obtaining source code from project page
- AutoClass III Lisp implementation Cheeseman,
Stutz, Taylor, 1992 - AutoClass C C implementation Cheeseman, Stutz,
Taylor, 1998 - These and others at http//www.recursive-partitio
ning.com/cluster.html
16Unsupervised LearningCompetitive Learning for
Feature Discovery
- Intuitive Idea Competitive Mechanisms for
Unsupervised Learning - Global organization from local, competitive
weight update - Basic principle expressed by Von der Malsburg
- Guiding examples from (neuro)biology lateral
inhibition - Previous work Hebb, 1949 Rosenblatt, 1959 Von
der Malsburg, 1973 Fukushima, 1975 Grossberg,
1976 Kohonen, 1982 - A Procedural Framework for Unsupervised
Connectionist Learning - Start with identical (neural) processing units,
with random initial parameters - Set limit on activation strength of each unit
- Allow units to compete for right to respond to a
set of inputs - Feature Discovery
- Identifying (or constructing) new features
relevant to supervised learning - Examples finding distinguishable letter
characteristics in handwriten character
recognition (HCR), optical character recognition
(OCR) - Competitive learning transform X into X train
units in X closest to x
17Unsupervised LearningKohonens Self-Organizing
Map (SOM) 1
- Another Clustering Algorithm
- aka Self-Organizing Feature Map (SOFM)
- Given vectors of attribute values (x1, x2, ,
xn) - Returns vectors of attribute values (x1, x2,
, xk) - Typically, n gtgt k (n is high, k 1, 2, or 3
hence dimensionality reducing) - Output vectors x, the projections of input
points x also get P(xj xi) - Mapping from x to x is topology preserving
- Topology Preserving Networks
- Intuitive idea similar input vectors will map to
similar clusters - Recall informal definition of cluster (isolated
set of mutually similar entities) - Restatement clusters of X (high-D) will still
be clusters of X (low-D) - Representation of Node Clusters
- Group of neighboring artificial neural network
units (neighborhood of nodes) - SOMs combine ideas of topology-preserving
networks, unsupervised learning - Implementation http//www.cis.hut.fi/nnrc/ and
MATLAB NN Toolkit
18Unsupervised LearningKohonens Self-Organizing
Map (SOM) 2
19Unsupervised LearningKohonens Self-Organizing
Map (SOM) 3
j
20Unsupervised LearningSOM and Other Projections
for Clustering
Cluster Formation and Segmentation Algorithm
(Sketch)
21Unsupervised LearningOther Algorithms (PCA,
Factor Analysis)
- Intuitive Idea
- Q Why are dimensionality-reducing transforms
good for supervised learning? - A There may be many attributes with undesirable
properties, e.g., - Irrelevance xi has little discriminatory power
over c(x) yi - Sparseness of information feature of interest
spread out over many xis (e.g., text document
categorization, where xi is a word position) - We want to increase the information density by
squeezing X down - Principal Components Analysis (PCA)
- Combining redundant variables into a single
variable (aka component, or factor) - Example ratings (e.g., Nielsen) and polls (e.g.,
Gallup) responses to certain questions may be
correlated (e.g., like fishing? time spent
boating) - Factor Analysis (FA)
- General term for a class of algorithms that
includes PCA - Tutorial http//www.statsoft.com/textbook/stfacan
.html
22Clustering MethodsDesign Choices
23Clustering Applications
Information Retrieval Text Document Categorizatio
n
24Unsupervised Learning andConstructive Induction
- Unsupervised Learning in Support of Supervised
Learning - Given D ? labeled vectors (x, y)
- Return D ? transformed training examples (x,
y) - Solution approach constructive induction
- Feature construction generic term
- Cluster definition
- Feature Construction Front End
- Synthesizing new attributes
- Logical x1 ? ? x2, arithmetic x1 x5 / x2
- Other synthetic attributes f(x1, x2, , xn),
etc. - Dimensionality-reducing projection, feature
extraction - Subset selection finding relevant attributes for
a given target y - Partitioning finding relevant attributes for
given targets y1, y2, , yp - Cluster Definition Back End
- Form, segment, and label clusters to get
intermediate targets y - Change of representation find an (x, y) that
is good for learning target y
x / (x1, , xp)
25ClusteringRelation to Constructive Induction
- Clustering versus Cluster Definition
- Clustering 3-step process
- Cluster definition back end for feature
construction - Clustering 3-Step Process
- Form
- (x1, , xk) in terms of (x1, , xn)
- NB typically part of construction step,
sometimes integrates both - Segment
- (y1, , yJ) in terms of (x1, , xk)
- NB number of clusters J not necessarily same as
number of dimensions k - Label
- Assign names (discrete/symbolic labels (v1, ,
vJ)) to (y1, , yJ) - Important in document categorization (e.g.,
clustering text for info retrieval) - Hierarchical Clustering Applying Clustering
Recursively
26Training Methods forHierarchical Mixture of
Experts (HME)
27Terminology
- Committee Machines aka Combiners
- Static Structures
- Ensemble averaging
- Single-pass, separately trained inducers, common
input - Individual outputs combined to get scalar output
(e.g., linear combination) - Boosting the margin separately trained inducers,
different input distributions - Filtering feed examples to trained inducers
(weak classifiers), pass on to next classifier
iff conflict encountered (consensus model) - Resampling aka subsampling (Si of fixed size
m resampled from D) - Reweighting fixed size Si containing weighted
examples for inducer - Dynamic Structures
- Mixture of experts training in combiner inducer
(aka gating network) - Hierarchical mixtures of experts hierarchy of
inducers, combiners - Mixture Model, aka Mixture of Experts (ME)
- Expert (classification), gating (combiner)
inducers (modules, networks) - Hierarchical Mixtures of Experts (HME) multiple
combiner (gating) levels
28Summary Points
- Committee Machines aka Combiners
- Static Structures (Single-Pass)
- Ensemble averaging
- For improving weak (especially unstable)
classifiers - e.g., weighted majority, bagging, stacking
- Boosting the margin
- Improve performance of any inducer weight
examples to emphasize errors - Variants filtering (aka consensus), resampling
(aka subsampling), reweighting - Dynamic Structures (Multi-Pass)
- Mixture of experts training in combiner inducer
(aka gating network) - Hierarchical mixtures of experts hierarchy of
inducers, combiners - Mixture Model (aka Mixture of Experts)
- Estimation of mixture coefficients (i.e.,
weights) - Hierarchical Mixtures of Experts (HME) multiple
combiner (gating) levels - Next Week Intro to GAs, GP (9.1-9.4, Mitchell
1, 6.1-6.5, Goldberg)