CIS732Lecture3720070419 - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

CIS732Lecture3720070419

Description:

Kansas State University. Department of Computing and Information Sciences. CIS 732: Machine Learning and ... Duality: data fusion problem redistribution ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 29

Provided by: lindajacks

Learn more at: https://www.kddresearch.org

Category:

more less

Transcript and Presenter's Notes

Title: CIS732Lecture3720070419

1
Lecture 37 of 42
Unsupervised Learning AutoClass, SOM, EM, and
Hierarchical Mixtures of Experts
Thursday, 19 April 2007 William H.
Hsu Department of Computing and Information
Sciences, KSU http//www.cis.ksu.edu/Courses/Sprin
g-2007/CIS732 Readings Bagging, Boosting, and
C4.5, Quinlan Section 5, MLC Utilities 2.0,
Kohavi and Sommerfield
2
Mixture ModelsReview

Intuitive Idea
Integrate knowledge from multiple experts (or
data from multiple sensors)
Collection of inducers organized into committee
machine (e.g., modular ANN)
Dynamic structure take input signal into account
References
Bishop, 1995 (Sections 2.7, 9.7)
Haykin, 1999 (Section 7.6)
Problem Definition
Given collection of inducers (experts) L, data
set D
Perform supervised learning using inducers and
self-organization of experts
Return committee machine with trained gating
network (combiner inducer)
Solution Approach
Let combiner inducer be generalized linear model
(e.g., threshold gate)
Activation functions linear combination, vote,
smoothed vote (softmax)

3
Mixture ModelsProcedure

Algorithm Combiner-Mixture-Model (D, L,
Activation, k)
m ? D.size
FOR j ? 1 TO k DO // initialization
wj ? 1
UNTIL the termination condition is met, DO
FOR j ? 1 TO k DO
Pj ? Lj.Update-Inducer (D) // single
training step for Lj
FOR i ? 1 TO m DO
Sumi ? 0
FOR j ? 1 TO k DO Sumi Pj(Di)
Neti ? Compute-Activation (Sumi) // compute
gj ? Netij
FOR j ? 1 TO k DO wj ? Update-Weights (wj,
Neti, Di)
RETURN (Make-Predictor (P, w))
Update-Weights Single Training Step for Mixing
Coefficients

4
Mixture ModelsProperties
?
5
Generalized Linear Models (GLIMs)

Recall Perceptron (Linear Threshold Gate) Model
Generalization of LTG Model McCullagh and
Nelder, 1989
Model parameters connection weights as for LTG
Representational power depends on transfer
(activation) function
Activation Function
Type of mixture model depends (in part) on this
definition
e.g., o(x) could be softmax (x w) Bridle,
1990
NB softmax is computed across j 1, 2, , k
(cf. hard max)
Defines (multinomial) pdf over experts Jordan
and Jacobs, 1995

6
Hierarchical Mixture of Experts (HME)Idea

Hierarchical Model
Compare stacked generalization network
Difference trained in multiple passes
Dynamic Network of GLIMs

All examples x and targets y c(x) identical
7
Hierarchical Mixture of Experts (HME)Procedure

Algorithm Combiner-HME (D, L, Activation, Level,
k, Classes)
m ? D.size
FOR j ? 1 TO k DO wj ? 1 // initialization
UNTIL the termination condition is met DO
IF Level gt 1 THEN
FOR j ? 1 TO k DO
Pj ? Combiner-HME (D, Lj, Activation, Level
- 1, k, Classes)
ELSE
FOR j ? 1 TO k DO Pj ? Lj.Update-Inducer (D)
FOR i ? 1 TO m DO
Sumi ? 0
FOR j ? 1 TO k DO
Sumi Pj(Di)
Neti ? Compute-Activation (Sumi)
FOR l ? 1 TO Classes DO wl ? Update-Weights
(wl, Neti, Di)
RETURN (Make-Predictor (P, w))

8
Hierarchical Mixture of Experts (HME)Properties

Advantages
Benefits of ME base case is single level of
expert and gating networks
More combiner inducers ? more capability to
decompose complex problems
Views of HME
Expresses divide-and-conquer strategy
Problem is distributed across subtrees on the
fly by combiner inducers
Duality data fusion ? problem redistribution
Recursive decomposition until good fit found to
local structure of D
Implements soft decision tree
Mixture of experts 1-level decision tree
(decision stump)
Information preservation compared to traditional
(hard) decision tree
Dynamics of HME improves on greedy
(high-commitment) strategy of decision tree
induction

9
EM AlgorithmExample 1

Experiment
Two coins P(Head on Coin 1) p, P(Head on Coin
2) q
Experimenter first selects a coin P(Coin 1)
?
Chosen coin tossed 3 times (per experimental run)
Observe D (1 H H T), (1 H T T), (2 T H T)
Want to predict ?, p, q
How to model the problem?
Simple Bayesian network
Now, can find most likely values of parameters ?,
p, q given data D
Parameter Estimation
Fully observable case easy to estimate p, q, and
?
Suppose k heads are observed out of n coin flips
Maximum likelihood estimate vML for Flipi p
k/n
Partially observable case
Dont know which coin the experimenter chose
Observe D (H H T), (H T T), (T H T) ? (? H
H T), (? H T T), (? T H T)

P(Coin 1) ?
P(Flipi 1 Coin 1) p P(Flipi 1 Coin
2) q
10
EM AlgorithmExample 2

Problem
When we knew Coin 1 or Coin 2, there was no
problem
No known analytical solution to the partially
observable problem
i.e., not known how to compute estimates of p, q,
and ? to get vML
Moreover, not known what the computational
complexity is
Solution Approach Iterative Parameter Estimation
Given a guess of P(Coin 1 x), P(Coin 2
x)
Generate fictional data points, weighted
according to this probability
P(Coin 1 x) P(x Coin 1) P(Coin 1) /
P(x) based on our guess of ?, p, q
Expectation step (the E in EM)
Now, can find most likely values of parameters ?,
p, q given fictional data
Use gradient descent to update our guess of ?, p,
q
Maximization step (the M in EM)
Repeat until termination condition met (e.g.,
stopping criterion on validation set)
EM Converges to Local Maxima of the Likelihood
Function P(D ?)

11
EM AlgorithmExample 3
12
EM for Unsupervised Learning

Unsupervised Learning Problem
Objective estimate a probability distribution
with unobserved variables
Use EM to estimate mixture policy (more on this
later see 6.12, Mitchell)
Pattern Recognition Examples
Human-computer intelligent interaction (HCII)
Detecting facial features in emotion recognition
Gesture recognition in virtual environments
Computational medicine Frey, 1998
Determining morphology (shapes) of bacteria,
viruses in microscopy
Identifying cell structures (e.g., nucleus) and
shapes in microscopy
Other image processing
Many other examples (audio, speech, signal
processing motor control etc.)
Inference Examples
Plan recognition mapping from (observed) actions
to agents (hidden) plans
Hidden changes in context e.g., aviation
computer security MUDs

13
Unsupervised LearningAutoClass 1
14
Unsupervised LearningAutoClass 2

AutoClass Algorithm Cheeseman et al, 1988
Based on maximizing P(x ?j, yj, J)
?j class (cluster) parameters (e.g., mean and
variance)
yj synthetic classes (can estimate marginal
P(yj) any time)
Apply Bayess Theorem, use numerical BOC
estimation techniques (cf. Gibbs)
Search objectives
Find best J (ideally integrate out ?j, yj
really start with big J, decrease)
Find ?j, yj use MAP estimation, then integrate
in the neighborhood of yMAP
EM Find MAP Estimate for P(x ?j, yj, J) by
Iterative Refinement
Advantages over Symbolic (Non-Numerical) Methods
Returns probability distribution over class
membership
More robust than best yj
Compare fuzzy set membership (similar but
probabilistically motivated)
Can deal with continuous as well as discrete data

15
Unsupervised LearningAutoClass 3

AutoClass Resources
Beginning tutorial (AutoClass II) Cheeseman et
al, 4.2.2 Buchanan and Wilkins
Project page http//ic-www.arc.nasa.gov/ic/projec
ts/bayes-group/autoclass/
Applications
Knowledge discovery in databases (KDD) and data
mining
Infrared astronomical satellite (IRAS) spectral
atlas (sky survey)
Molecular biology pre-clustering DNA acceptor,
donor sites (mouse, human)
LandSat data from Kansas (30 km2 region, 1024 x
1024 pixels, 7 channels)
Positive findings see book chapter by Cheeseman
and Stutz, online
Other typical applications see KD Nuggets
(http//www.kdnuggets.com)
Implementations
Obtaining source code from project page
AutoClass III Lisp implementation Cheeseman,
Stutz, Taylor, 1992
AutoClass C C implementation Cheeseman, Stutz,
Taylor, 1998
These and others at http//www.recursive-partitio
ning.com/cluster.html

16
Unsupervised LearningCompetitive Learning for
Feature Discovery

Intuitive Idea Competitive Mechanisms for
Unsupervised Learning
Global organization from local, competitive
weight update
Basic principle expressed by Von der Malsburg
Guiding examples from (neuro)biology lateral
inhibition
Previous work Hebb, 1949 Rosenblatt, 1959 Von
der Malsburg, 1973 Fukushima, 1975 Grossberg,
1976 Kohonen, 1982
A Procedural Framework for Unsupervised
Connectionist Learning
Start with identical (neural) processing units,
with random initial parameters
Set limit on activation strength of each unit
Allow units to compete for right to respond to a
set of inputs
Feature Discovery
Identifying (or constructing) new features
relevant to supervised learning
Examples finding distinguishable letter
characteristics in handwriten character
recognition (HCR), optical character recognition
(OCR)
Competitive learning transform X into X train
units in X closest to x

17
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 1

Another Clustering Algorithm
aka Self-Organizing Feature Map (SOFM)
Given vectors of attribute values (x1, x2, ,
xn)
Returns vectors of attribute values (x1, x2,
, xk)
Typically, n gtgt k (n is high, k 1, 2, or 3
hence dimensionality reducing)
Output vectors x, the projections of input
points x also get P(xj xi)
Mapping from x to x is topology preserving
Topology Preserving Networks
Intuitive idea similar input vectors will map to
similar clusters
Recall informal definition of cluster (isolated
set of mutually similar entities)
Restatement clusters of X (high-D) will still
be clusters of X (low-D)
Representation of Node Clusters
Group of neighboring artificial neural network
units (neighborhood of nodes)
SOMs combine ideas of topology-preserving
networks, unsupervised learning
Implementation http//www.cis.hut.fi/nnrc/ and
MATLAB NN Toolkit

18
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 2
19
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 3
j
20
Unsupervised LearningSOM and Other Projections
for Clustering
Cluster Formation and Segmentation Algorithm
(Sketch)
21
Unsupervised LearningOther Algorithms (PCA,
Factor Analysis)

Intuitive Idea
Q Why are dimensionality-reducing transforms
good for supervised learning?
A There may be many attributes with undesirable
properties, e.g.,
Irrelevance xi has little discriminatory power
over c(x) yi
Sparseness of information feature of interest
spread out over many xis (e.g., text document
categorization, where xi is a word position)
We want to increase the information density by
squeezing X down
Principal Components Analysis (PCA)
Combining redundant variables into a single
variable (aka component, or factor)
Example ratings (e.g., Nielsen) and polls (e.g.,
Gallup) responses to certain questions may be
correlated (e.g., like fishing? time spent
boating)
Factor Analysis (FA)
General term for a class of algorithms that
includes PCA
Tutorial http//www.statsoft.com/textbook/stfacan
.html

22
Clustering MethodsDesign Choices
23
Clustering Applications
Information Retrieval Text Document Categorizatio
n
24
Unsupervised Learning andConstructive Induction

Unsupervised Learning in Support of Supervised
Learning
Given D ? labeled vectors (x, y)
Return D ? transformed training examples (x,
y)
Solution approach constructive induction
Feature construction generic term
Cluster definition
Feature Construction Front End
Synthesizing new attributes
Logical x1 ? ? x2, arithmetic x1 x5 / x2
Other synthetic attributes f(x1, x2, , xn),
etc.
Dimensionality-reducing projection, feature
extraction
Subset selection finding relevant attributes for
a given target y
Partitioning finding relevant attributes for
given targets y1, y2, , yp
Cluster Definition Back End
Form, segment, and label clusters to get
intermediate targets y
Change of representation find an (x, y) that
is good for learning target y

x / (x1, , xp)
25
ClusteringRelation to Constructive Induction

Clustering versus Cluster Definition
Clustering 3-step process
Cluster definition back end for feature
construction
Clustering 3-Step Process
Form
(x1, , xk) in terms of (x1, , xn)
NB typically part of construction step,
sometimes integrates both
Segment
(y1, , yJ) in terms of (x1, , xk)
NB number of clusters J not necessarily same as
number of dimensions k
Label
Assign names (discrete/symbolic labels (v1, ,
vJ)) to (y1, , yJ)
Important in document categorization (e.g.,
clustering text for info retrieval)
Hierarchical Clustering Applying Clustering
Recursively

26
Training Methods forHierarchical Mixture of
Experts (HME)
27
Terminology

Committee Machines aka Combiners
Static Structures
Ensemble averaging
Single-pass, separately trained inducers, common
input
Individual outputs combined to get scalar output
(e.g., linear combination)
Boosting the margin separately trained inducers,
different input distributions
Filtering feed examples to trained inducers
(weak classifiers), pass on to next classifier
iff conflict encountered (consensus model)
Resampling aka subsampling (Si of fixed size
m resampled from D)
Reweighting fixed size Si containing weighted
examples for inducer
Dynamic Structures
Mixture of experts training in combiner inducer
(aka gating network)
Hierarchical mixtures of experts hierarchy of
inducers, combiners
Mixture Model, aka Mixture of Experts (ME)
Expert (classification), gating (combiner)
inducers (modules, networks)
Hierarchical Mixtures of Experts (HME) multiple
combiner (gating) levels