Title: Announcements
1Announcements
- Homework 4 is delayed to 02/27 next Thursday
- Project proposal is due on 03/02
2Hierarchical Mixture Expert Model
3Good Things about Decision Trees
- Decision trees introduce nonlinearity through the
tree structure - Viewing ABC as ABC
- Compared to kernel methods
- Less adhoc
- Easy understanding
4Generalize Decision Trees
From slides of Andrew Moore
5Partition Datasets
- The goal of each node is to partition the data
set into disjoint subsets such that each subset
is easier to classify.
cylinders 4
Partition by a single attribute
Original Dataset
cylinders 5
cylinders 6
cylinders 8
6Partition Datasets (contd)
- More complicated partitions
Cylinderslt 6 and Weight gt 4 ton
Partition by multiple attributes
Original Dataset
Other cases
Replacing each node with a classification model
Cylinders ? 6 and Weight lt 3 ton
- How to accomplish such a complicated partition?
- Each partition ?? a classification
- Partition a dataset into disjoint subsets ??
Classifying data points into different classes
7A More General Decision Tree
Each node is a linear classifier
?
?
?
?
?
?
a decision tree using classifiers for data
partition
a decision tree with simple data partition
8General Scheme for Decision Trees
- Each node within the tree is a linear classifier
- Pro
- Usually result in shallow trees
- Introducing nonlinearity into linear classifiers
(e.g. logistic regression) - Overcoming overfitting issues through the
regularization mechanism within the classifier. - Better way to deal with real-value attributes
- Example
- Neural network
- Hierarchical Mixture Expert Model
9Example
Kernel method
10Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
11Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
r(x) 1
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
12Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
13Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
g1(x) -1
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
14Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
m1,2(x) 1 The class label for 1
15Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
17Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
x
x
Which expert should be used for classifying x ?
?
?
?
?
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
18Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
x
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
ExpertLayer
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
p(1x) 9/16, p(-1x) 7/16
19Hierarchical Mixture Expert Model (HME)
Probabilistic Description
r(x)
Random variable g 1, 2 r(1x)p(g 1x),
r(-1x)p(g 2x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
Random variable m 11, 12, 21, 22 g1(1x)
p(m11x, g1), g1(-1x) p(m12x,
g1) g2(1x) p(m21x, g2) g2(-1x) p(m22x,
g2)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
m1,1(x), m1,2(x), m2,1(x), m2,2(x) are
classifiers x ? 1, -1
20Hierarchical Mixture Expert Model (HME)
Probabilistic Description
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
y
How to train function r(x), g(x), m(x) ?
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
21Problem with Training HME
- Consider applying logistic regression to model
r(x), g(x), and m(x) - No training examples r(x), g(x)
- For each training example (x, y), we dont know
its group ID or expert ID. - cant apply training procedure of logistic
regression model to train r(x) and g(x) directly. - Random variables g, m are called hidden variables
since they are not exposed in the training data. - How to train a model with hidden variable?
22Start with Random Guess
- Iteration 1 random guess
- randomly assign points to each group and expert
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
23Start with Random Guess
- Iteration 1 random guess
- randomly assign points to each group and expert
- learning classifier r(x), g(x), m(x)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
24Start with Random Guess
- Iteration 1 random guess
- randomly assign points to each group and expert
- learning classifier r(x), g(x), m(x)
- applying the learned classifier r(x) to
reclassify all data points
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
Group 1 Group 2
1
2
3
4
5
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
25Start with Random Guess
- Iteration 2 regroup
- reassign group membership to each data point
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
1 2 3 4 5
m1,1 0.98 x x x 0.51
m1,2 0.61 x x x 0.90
m2,1 x 0.7 0.90 0.7 x
m2,1 x 0.6 0.5 0.8 x
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
26Start with Random Guess
- Iteration 2 regroup
- reassign the group membership to each data point
- reassign the expert membership to each data point
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
1 2 3 4 5
m1,1 0.98 x x x 0.51
m1,2 0.61 x x x 0.90
m2,1 x 0.7 0.90 0.7 x
m2,1 x 0.6 0.5 0.8 x
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
27Start with Random Guess
- Iteration 2 regroup
- reassign the group membership to each data point
- reassign the expert membership to each data
point - retrain r(x), g(x) and m(x).
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
1 2 3 4 5
m1,1 0.98 x x x 0.51
m1,2 0.61 x x x 0.90
m2,1 x 0.7 0.90 0.7 x
m2,1 x 0.6 0.5 0.8 x
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
28Start with Random Guess
- Iteration 2 regroup
- reassign the group membership to each data point
- reassign the expert membership to each data
point - retrain r(x), g(x) and m(x).
- Repeat the above procedure until it converges (it
guarantees to converge a local minimum)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
This is famous Expectation-Maximization Algorithm
(EM) !
29Formal EM algorithm for HME
- Two things need to estimate
- Logistic regression models for r(x?r), g(x ?g)
and m(x?m) - Unknown group memberships and expert memberships
- ? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)
- E-step
- Estimate p(g1x), p(g2x) for training
examples, given guessed r(x?r), g(x?g) and
m(x?m) - Estimate p(m11, 12x, g1) and p(m21, 22x,
g2) for all training examples, given guessed
r(x?r), g(x?g) and m(x?m)
30Comparison of Classification Models
- The goal of classifier
- Predicating class label y for an input x
- Estimate p(yx)
- Gaussian generative model
- p(yx) p(xy) p(y) posterior likelihood ?
prior - Difficulty in estimating p(xy) if x comprises of
multiple elements - Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
- Linear discriminative model
- Estimate p(yx)
- Focusing on finding the decision boundary
31Comparison of Classification Models
- Logistic regression model
- A linear decision boundary w?xb
- A probabilistic model p(yx)
- Maximum likelihood approach for estimating
weights w and threshold b
32Comparison of Classification Models
- Logistic regression model
- Overfitting issue
- In text classification problem, words that only
appears in only one document will be assigned
with infinite large weight - Solution regularization
- Conditional exponential model
- Maximum entropy model
- A dual problem of conditional exponential model
33Comparison of Classification Models
- Support vector machine
- Classification margin
- Maximum margin principle two objective
- Minimize the classification error over training
data - Maximize classification margin
- Support vector
- Only support vectors have impact on the location
of decision boundary
34Comparison of Classification Models
- Separable case
- Noisy case
Quadratic programming!
35Comparison of Classification Models
- Similarity between logistic regression model and
support vector machine
Logistic regression model is almost identical to
support vector machine except for different
expression for classification errors
36Comparison of Classification Models
- Generative models have trouble at the decision
boundary