Title: Announcements
1Announcements
- Project proposal is due on 03/11
- Three seminars this Friday (EB 3105)
- Dealing with Indefinite Representations in
Pattern Recognition (1000 am - 1100 am) - Computational Analysis of Drosophila Gene
Expression Pattern Image (1100 am - 1200 pm) - 3D General Lesion Segmentation in CT (300 pm -
400 pm)
2Hierarchical Mixture Expert Model
3Good Things about Decision Trees
- Decision trees introduce nonlinearity through the
tree structure - Viewing ABC as ABC
- Compared to kernel methods
- Less adhoc
- Easy understanding
4Example
In general, mixture model is powerful in fitting
complex decision boundary, for instance,
stacking, boosting, bagging
Kernel method
5Generalize Decision Trees
From slides of Andrew Moore
6Partition Datasets
- The goal of each node is to partition the data
set into disjoint subsets such that each subset
is easier to classify.
cylinders 4
Partition by a single attribute
Original Dataset
cylinders 5
cylinders 6
cylinders 8
7Partition Datasets (contd)
- More complicated partitions
Cylinderslt 6 and Weight gt 4 ton
Partition by multiple attributes
Original Dataset
Other cases
Using a classification model for each node
Cylinders ? 6 and Weight lt 3 ton
- How to accomplish such a complicated partition?
- Each partition ?? a class
- Partition a dataset into disjoint subsets ??
Classify a dataset into multiple classes
8A More General Decision Tree
Each node is a linear classifier
?
?
?
?
?
?
a decision tree using classifiers for data
partition
a decision tree with simple data partition
9General Schemes for Decision Trees
- Each node within the tree is a linear classifier
- Pro
- Usually result in shallow trees
- Introducing nonlinearity into linear classifiers
(e.g. logistic regression) - Overcoming overfitting issues through the
regularization mechanism within the classifier. - Partition datasets with soft memberships
- A better way to deal with real-value attributes
- Example
- Neural network
- Hierarchical Mixture Expert Model
10Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
11Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
12Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
r(x) 1
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
13Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
14Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
g1(x) -1
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
15Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
m1,2(x) 1 The class label for 1
16Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
17Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
r(1x) ¾, r(-1x) ¼
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
18Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
r(1x) ¾, r(-1x) ¼
Group Layer
Group 1 g1(x)
Group 2 g2(x)
x
x
Which expert should be used for classifying x ?
?
?
?
?
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
19Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
r(1x) ¾, r(-1x) ¼
Group Layer
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
Group 1 g1(x)
Group 2 g2(x)
x
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
How to compute the probability p(1x) and
p(-1x)?
20HME Probabilistic Description
r(x)
Random variable g 1, 2 r(1x)p(g 1x),
r(-1x)p(g 2x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
Random variable m 11, 12, 21, 22 g1(1x)
p(m11x, g1), g1(-1x) p(m12x,
g1) g2(1x) p(m21x, g2) g2(-1x) p(m22x,
g2)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
21HME Probabilistic Description
r(x)
r(1x) ¾, r(-1x) ¼
Group Layer
Group 1 g1(x)
Group 2 g2(x)
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
ExpertLayer
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
Compute P(1x) and P(-1x)
22HME Probabilistic Description
r(x)
r(1x) ¾, r(-1x) ¼
Group Layer
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
Group 1 g1(x)
Group 2 g2(x)
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
23HME Probabilistic Description
r(1x) ¾, r(-1x) ¼
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
24Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
y
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
Is HME more powerful than a simple majority vote
approach?
25Problem with Training HME
- Using logistic regression to model r(x), g(x),
and m(x) - No training examples r(x), g(x)
- For each training example (x, y), we dont know
its group ID or expert ID. - cant apply training procedure of logistic
regression model to train r(x) and g(x) directly. - Random variables g, m are called hidden variables
since they are not exposed in the training data. - How to train a model with incomplete data?
26Start with Random Guess
- Iteration 1 random guess
- Randomly assign points to groups and experts
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
27Start with Random Guess
- Iteration 1 random guess
- Randomly assign points to groups and experts
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
28Start with Random Guess
- Iteration 1 random guess
- Randomly assign points to groups and experts
- Learn r(x), g1(x), g2(x), m11(x), m12(x),
m21(x), m22(x)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
Now, what should we do?
29Refine HME Model
- Iteration 2 regroup data points
- Reassign the group membership to each data point
- Reassign the expert membership to each expert
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
But, how?
30Determine Group Memberships
Consider an example (x, 1)
r(1x) ¾, r(-1x) ¼
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
Compute the posterior on your own sheet !
31Determine Group Memberships
Consider an example (x, 1)
r(1x) ¾, r(-1x) ¼
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
32Determine Expert Memberships
Consider an example (x, 1)
r(1x) ¾, r(-1x) ¼
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
33Refine HME Model
- Iteration 2 regroup data points
- Reassign the group membership to each data point
- Reassign the expert membership to each expert
- Compute the posteriors p(gx,y) and p(mx,y,g)
for each training example (x,y) - Retrain r(x), g1(x), g2(x), m11(x), m12(x),
m21(x), m22(x) using estimated posteriors
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
But, how ?
34Logistic Regression Soft Memberships
Soft memberships
35Logistic Regression Soft Memberships
Soft memberships
36Start with Random Guess
- Iteration 2 regroup data points
- Reassign the group membership to each data point
- Reassign the expert membership to each expert
- Compute the posteriors p(gx,y) and p(mx,y,g)
for each training example (x,y) - Retrain r(x), g1(x), g2(x), m11(x), m12(x),
m21(x), m22(x)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
Repeat the above procedure until it converges (it
guarantees to converge a local minimum)
This is famous Expectation-Maximization Algorithm
(EM) !
37Formal EM algorithm for HME
- Unknown logistic regression models
- r(x?r), gi(x ?g) and mi(x?m)
- Unknown group memberships and expert memberships
- p(gx,y), p(mx, y, g)
- E-step
- Fixed logistic regression model and estimate
memberships - Estimate p(g1x,y), p(g2x,y) for all training
examples - Estimate p(m11, 12x, y, g1) and p(m21, 22x,
y, g2) for all training examples
38Formal EM algorithm for HME
- Unknown logistic regression models
- r(x?r), gi(x ?g) and mi(x?m)
- Unknown group memberships and expert memberships
- p(gx,y), p(mx, y, g)
- E-step
- Fixed logistic regression model and estimate
memberships - Estimate p(g1x,y), p(g2x,y) for all training
examples - Estimate p(m11, 12x, y, g1) and p(m21, 22x,
y, g2) for all training examples
39What are We Doing?
- What is the objective of doing Expectation-Maximiz
ation? - It is still a simple maximum likelihood!
- Expectation-Maximization algorithm actually tries
to maximize the log-likelihood function - Most time, it converges to local maximum, not a
global one - Improved version annealing EM
40Annealing EM
41Improve HME
- It is sensitive to initial assignments
- How can we reduce the risk of initial
assignments? - Binary tree ? K-way trees
- Logistic regression ? conditional exponential
model - Tree structure
- Can we determine the optimal tree structure for a
given dataset?
42Comparison of Classification Models
- The goal of classifier
- Predicting class label y for an input x
- Estimate p(yx)
- Gaussian generative model
- p(yx) p(xy) p(y) posterior likelihood ?
prior - Difficulty in estimating p(xy) if x comprises of
multiple elements - Naïve Bayes p(xy) p(x1y) p(x2y) p(xdy)
- Linear discriminative model
- Estimate p(yx)
- Focusing on finding the decision boundary
43Comparison of Classification Models
- Logistic regression model
- A linear decision boundary w?xb
- A probabilistic model p(yx)
- Maximum likelihood approach for estimating
weights w and threshold b
44Comparison of Classification Models
- Logistic regression model
- Overfitting issue
- In text classification problem, words that only
appears in only one document will be assigned
with infinite large weight - Solution regularization
- Conditional exponential model
- Maximum entropy model
- A dual problem of conditional exponential model
45Comparison of Classification Models
- Support vector machine
- Classification margin
- Maximum margin principle two objective
- Minimize the classification error over training
data - Maximize classification margin
- Support vector
- Only support vectors have impact on the location
of decision boundary
46Comparison of Classification Models
- Separable case
- Noisy case
Quadratic programming!
47Comparison of Classification Models
- Similarity between logistic regression model and
support vector machine
Logistic regression model is almost identical to
support vector machine except for different
expression for classification errors
48Comparison of Classification Models
- Generative models have trouble at the decision
boundary