Title: Announcements
1Announcements
- Homework 4 is due on this Thursday (02/27/2004)
- Project proposal is due on 03/02
2Hierarchical Mixture Expert Model
3Decision Trees
- Pro
- Bring nonlinearity into the model
- Con
- Each split is only based on a single attribute.
4Generalizing Decision Trees
?
Each node is a linear classifier
?
?
?
?
?
a decision tree using classifiers for data
partition
a decision tree with simple data partition
5Generalized Decision Trees
- Each node is a linear classifier
- Pro
- Usually result in shallow trees
- Introducing nonlinearity into linear classifiers
(e.g. logistic regression) - Overcoming overfitting issues through the
regularization mechanism within the classifier. - Better way to deal with real-value attributes
- Example
- Neural network
- Hierarchical Mixture Expert Model
6Example
Kernel method
7Hierarchical Mixture Expert Model (HME)
- Ask r(x) which group should be used for
classifying input x ? - If group 1 is chosen, which classifier m(x)
should be used ? - Classify input x using the chosen classifier m(x)
8Hierarchical Mixture Expert Model
(HME)Probabilistic Description
Two hidden variables
The hidden variable for groups g 1, 2
The hidden variable for classifiers m 11, 12,
21, 22
9Hierarchical Mixture Expert Model (HME)Example
r(1x) ¾, r(-1x) ¼ g1(1x) ¼, g1(-1x)
¾ g2(1x) ½ , g2(-1x) ½
¾
¼
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
¾
½
¼
½
p(1x) ?, p(-1x) ?
10Training HME
- In the training examples xi, yi
- No information about r(x), g(x) for each example
- Random variables g, m are called hidden variables
since they are not exposed in the training data. - How to train a model with hidden variable?
11Start with Random Guess
1, 2, 3, 4, 5 ? 6, 7, 8, 9
- Randomly Assignment
- Randomly assign points to each group and expert
- Learn classifiers r(x), g(x), m(x) using the
randomly assigned points
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
12Adjust Group Memeberships
- The key is to assign each data point to the
group who classifies the data point correctly
with the largest probability - How ?
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
13Adjust Group Memberships
- The key is to assign each data point to the
group who classifies the data point correctly
with the largest confidence - Compute p(g1x,y) and p(g2x,y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
Posterior Prob. For Groups Posterior Prob. For Groups Posterior Prob. For Groups
Group 1 Group 2
1 0.8 0.2
2 0.4 0.6
3 0.3 0.7
4 0.1 0.9
5 0.65 0.35
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
14Adjust Memberships for Classifiers
- The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence - Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
15Adjust Memberships for Classifiers
- The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence - Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers
1 2 3 4 5
m1,1 0.7 0.1 0.15 0.1 0.05
m1,2 0.2 0.2 0.20 0.1 0.55
m2,1 0.05 0.5 0.60 0.1 0.3
m2,1 0.05 0.2 0.05 0.7 0.1
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16Adjust Memberships for Classifiers
- The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence - Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers
1 2 3 4 5
m1,1 0.7 0.1 0.15 0.1 0.05
m1,2 0.2 0.2 0.20 0.1 0.55
m2,1 0.05 0.5 0.60 0.1 0.3
m2,1 0.05 0.2 0.05 0.7 0.1
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
17Adjust Memberships for Classifiers
- The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence - Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers
1 2 3 4 5
m1,1 0.7 0.1 0.15 0.1 0.05
m1,2 0.2 0.2 0.20 0.1 0.55
m2,1 0.05 0.5 0.60 0.1 0.3
m2,1 0.05 0.2 0.05 0.7 0.1
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
18Retrain The Model
1, 2, 3, 4, 5 ? 6, 7, 8, 9
- Retrain r(x), g(x), m(x) using the new
memberships
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
19Expectation Maximization
- Two things need to estimate
- Logistic regression models for r(x?r), g(x ?g)
and m(x?m) - Unknown group memberships and expert memberships
- ? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)
- E-step
- Estimate p(g1x, y), p(g2x, y) for training
examples, given guessed r(x?r), g(x?g) and
m(x?m) - Estimate p(m11, 12x, y) and p(m21, 22x, y)
for all training examples, given guessed r(x?r),
g(x?g) and m(x?m)
20Comparison of Different Classification Models
- The goal of all classifiers
- Predicating class label y for an input x
- Estimate p(yx)
- Gaussian generative model
- p(yx) p(xy) p(y) posterior likelihood ?
prior - p(xy)
- Describing the input patterns for each class y
- Difficult to estimate if x is of high
dimensionality - Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
- Essentially a linear model
- Linear discriminative model
- Directly estimate p(yx)
- Focusing on finding the decision boundary
21Comparison of Different Classification Models
- Logistic regression model
- A linear decision boundary w?xb
- A probabilistic model p(yx)
- Maximum likelihood approach for estimating
weights w and threshold b
22Comparison of Different Classification Models
- Logistic regression model
- Overfitting issue
- Example text classification
- Every word is assigned with a different weight
- Words that appears in only one document will be
assigned with infinite large weight - Solution regularization
23Comparison of Different Classification Models
- Conditional exponential model
- An extension of logistic regression model to
multiple class case - A different set of weights wy and threshold b for
each class y - Maximum entropy model
- Finding the simplest model that matches with the
data
24Comparison of Different Classification Models
- Support vector machine
- Classification margin
- Maximum margin principle
- Separate data far away from the decision boundary
- Two objectives
- Minimize the classification error over training
data - Maximize the classification margin
- Support vector
- Only support vectors have impact on the location
of decision boundary
denotes 1 denotes -1
25Comparison of Different Classification Models
- Separable case
- Noisy case
Quadratic programming!
26Comparison of Classification Models
- Logistic regression model vs. support vector
machine
27Comparison of Different Classification Models
Logistic regression differs from support vector
machine only in the loss function
28Comparison of Different Classification Models
- Generative models have trouble at the decision
boundary
29Nonlinear Models
- Kernel methods
- Add additional dimensions to help separate data
- Efficiently computing the dot product in a high
dimension space
Kernel method
30Nonlinear Models
- Decision trees
- Nonlinearly combine different features through a
tree structure - Hierarchical Mixture Model
- Replace each node with a logistic regression
model - Nonlinearly combine multiple linear models