Title: Spectral Clustering and Embedding with Hidden Markov Models
1Spectral Clustering and Embedding with Hidden
Markov Models
Tony Jebara, Yingbo Song, Kapil
Thadani Department of Computer Science,
Columbia University
2Outline
- Unsupervised learning parametric vs.
nonparametric - Density estimation parametric vs nonparametric
- Semi-parametric likelihood (NIPS07)
- Clustering parametric nonparametric
- Expectation maximization
- Spectral clustering
- Semi-parametric clustering (ECML07)
- Probability product kernels (PPK)
- Hidden Markov model kernel
- Spectral clustering on PPK and Results
- Multidimensional scaling on PPK and Results
- Future/Upcoming Work
3Unsupervised Learning
- Parametric methods (sufficient stats, e-family)
- do not grow with data
- Density Estimation Maximum likelihood
- Clustering Expectation maximization
- Visualization hidden variables (GTM)
- Models mixtures, Bayes nets, hidden Markov
models - Nonparametric frequentist methods
- grow with data
- Density Estimation Parzen, l1 fitting, infinite
mixture - Clustering spectral clustering
- Visualization kNN, multidimensional scaling,
LLE - Models kernels, distance metrics, graphs on data
4Density Estimation
- Density estimation most generally, given samples
find - Nonparametric assumes independently distributed
id - Parametric assumes independent identically
distributed iid - Can we combine the two? Semiparametric density
(NIPS) - kernel pulls models together
5Density Estimation
- Nonparametric estimate
- Parametric estimate
- Semi-parametric estimate
- probability kernel pulls models together
6Probability Product Kernel
- Natural similarity measure between 2
distributions - To compute the kernel for a pair of inputs
- 1) Estimate Densities (maximum likelihood ML)
- 2) Kernel
- Probability Product Kernel uses either
- Non-negative, latter is 1 if pp
- Measures overlap of two distributions, pulls
pairs together
c
c
7Probability Product Kernel
- For exponential family
- The kernel is
- For Gaussian case, get RBF
8Probability Product Kernel
- For hidden Markov models
- The brute-force kernel is exponential work
9Probability Product Kernel
- Instead of brute force cross product, use
forward-backward - Only compute sub-kernels y for common parents
- Forms clique functions and sum via junction tree
10Probability Product Kernel
- PPK for 2 Gaussian HMMs with states S U
- Get SxU interaction table between all pairs
of emissions - Then simple pseudo-code
state prior
transition
11Clustering
- Parametric clustering (EM mixture model)
- local minima
- strict shape assumptions
- Nonparametric clustering (spectral cut, maxcut)
- global optimum
- no parametric assumptions
- instead kernel tweaking
- Semiparametric clustering (probabilty kernel
pulls models) - makes parametric (Markov)
- assumption about each
- datum but not about
- overall cluster shapes
12Parametric EM Clustering
- Parametric clustering
- E Given two models (one per class), get
responsibility for xn - M Maximize expected complete likelihood
- What if each x is a sequence? Cluster two HMM
models. - Just extend EM to HMM mixture with hidden state
trellis - (Alon Sclaroff)
- E
- M
13Parametric EM Clustering
- EM clustering works well if we have a true
mixture - Problem
- what if we dont have a mixture of 2 Gaussians
or HMMs? - example sequences are from two slowly drifting
HMMs
14Nonparametric Spectral Clustering
- Spectral clustering is agnostic about shape of
clusters! - Popular one is stabilized clustering (Ng, Weiss,
Jordan) - Get top eigenvectors of normalized Laplacian
LD-1/2AD-1/2 - Usually use RBF affinity
- What if each datum is timeseries? Can use
Yin-Yang kernel - But how to use parametric assumptions on each
datum? - For example extend so each datum is a 2-state
HMM?
15Motion Capture
- Rotating walk/run motion data
- Each sequence is a 2-state HMM
- But each cluster shape is circlular
16Spectral Clustering with PPK
- For each time series,
- parametrically learn an HMM
- 2) Compute kernel between all pairs of HMMs
- 3) Nonparametric spectral clustering or embedding
(MDS)
17Spectral Clustering with PPK
- Algorithm for spectral clustering HMM models
18Clustering MOCAP
- Starting with a single movie of walk and run
- Generate several rotated versions of each
- Two clusters of sequences walk and run
- Used 2-state Gaussian HMMs in SC-PPK
- Get 2 circlular clusters better than EM Time
Series Kernel
19Clustering MOCAP
- Built dataset from sequences of motion
- Two motion categories mixed with several
sequences - of each (1 sequence 123 dimensional time
series) - Used 2-state Gaussian emission HMMs
- Spectral cluster to predict classes
- Number in parentheses is the subject
20Clustering Arabic Characters
- Dataset is example sequences of two different
characters - About 20-30 examples per class
- Each sequence is a 2 dimensional time series
- Used 2-state Gaussian emission HMMs
21Clustering Sign Language
- Sign language dataset, each sign is a time series
- Have two categories of expressions
- Used multi-state HMMs with Gaussian emissions
22Clustering Network Traces
- Clustering network hosts in Columbia CS
department - Features of packets per port per hour over 24
hours - Fit an HMM to each host and cluster them
- Example cluster (hosts in cluster their packet
volume) - All are web servers, NFS or database servers.
( 1) 128.59.20.66 zinc.cs.columbia.edu.
num packets 75707059 ( 2) 128.59.20.227
planetlab2.cs.columbia.edu. num packets
43710510 ( 3) 128.59.21.157
bagpipe.cs.columbia.edu. num packets
42139618 ( 4) 128.59.16.20 cs.columbia.edu.
num packets 39047751 ( 5) 128.59.16.108
hellfire.cs.columbia.edu. num packets
39019003 ( 6) 128.59.23.17
manycore.cs.columbia.edu. num packets
38135241 ( 7) 128.59.22.220
nemo.cs.columbia.edu. num packets 26873532
( 8) 128.59.18.100 ober.cs.columbia.edu.
num packets 25070903 ( 9) 128.59.22.184
db-pc03.cs.columbia.edu. num packets
24431779 ( 10) 128.59.16.101
ground.cs.columbia.edu. num packets 23581185
( 11) 128.59.16.145 flame.cs.columbia.edu.
num packets 19861350 ( 12) 128.59.21.33
bosch.cs.columbia.edu. num packets 17715535
23Clustering MOCAP ? Runtime
- SC-PPK faster, uses EM for single HMMs
independently - Then, spectral clustering algorithm is O(N3)
- Clustering EM and k-Means must iterate HMM
training
24Embedding MOCAP
- MDS Embedding of rotated walking and running
- Yin-Yang kernel
- b) Probability product kernel
25Emedding Arabic ASL
- MDS Embedding using PPK of
- Arabic Dataset each character is a time series
datum - of spatial coordinates.
- b) Sign Language Dataset each datum is a time
- series of hand movement coordinates
26Conclusions
- Semiparametric methods explore waters between
- complementary parametric nonparametric
approaches. - Semiparametric clustering avoids shape
assumptions - on clusters but keeps assumptions on each datum
- Novel semiparametric likelihood interpolates
between - i.d. nonparametric ? i.i.d. parametric
- encourages model agreement by probability
product kernel - Also gives rise to a new clustering criterion
- 1) fit each datum with parametric maximum
likelihood - 2) compute kernels between models
- 3) solve for spectral clustering or embedding
- Next semiparametric density estimation aspect
(NIPS) - iteratively maximize likelihood to avoid
overfitting HMMs - Next sneak peek at some new applications