Announcements - PowerPoint PPT Presentation

About This Presentation

Title:

Announcements

Description:

Hierarchical Mixture Expert Model. Rong Jin ... Hierarchical Mixture Expert Model (HME) Probabilistic Description. Group ... to each group and expert ... – PowerPoint PPT presentation

Number of Views:501

Avg rating:3.0/5.0

Slides: 31

Provided by: rong7

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Announcements

1
Announcements

Homework 4 is due on this Thursday (02/27/2004)
Project proposal is due on 03/02

2
Hierarchical Mixture Expert Model

Rong Jin

3
Decision Trees

Pro
Bring nonlinearity into the model
Con
Each split is only based on a single attribute.

4
Generalizing Decision Trees
?
Each node is a linear classifier
?

?
?
?
?

a decision tree using classifiers for data
partition
a decision tree with simple data partition
5
Generalized Decision Trees

Each node is a linear classifier
Pro
Usually result in shallow trees
Introducing nonlinearity into linear classifiers
(e.g. logistic regression)
Overcoming overfitting issues through the
regularization mechanism within the classifier.
Better way to deal with real-value attributes
Example
Neural network
Hierarchical Mixture Expert Model

6
Example
Kernel method
7
Hierarchical Mixture Expert Model (HME)

Ask r(x) which group should be used for
classifying input x ?
If group 1 is chosen, which classifier m(x)
should be used ?
Classify input x using the chosen classifier m(x)

8
Hierarchical Mixture Expert Model
(HME)Probabilistic Description
Two hidden variables
The hidden variable for groups g 1, 2
The hidden variable for classifiers m 11, 12,
21, 22
9
Hierarchical Mixture Expert Model (HME)Example
r(1x) ¾, r(-1x) ¼ g1(1x) ¼, g1(-1x)
¾ g2(1x) ½ , g2(-1x) ½
¾
¼
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
¾
½
¼
½
p(1x) ?, p(-1x) ?
10
Training HME

In the training examples xi, yi
No information about r(x), g(x) for each example
Random variables g, m are called hidden variables
since they are not exposed in the training data.
How to train a model with hidden variable?

11
Start with Random Guess
1, 2, 3, 4, 5 ? 6, 7, 8, 9

Randomly Assignment
Randomly assign points to each group and expert
Learn classifiers r(x), g(x), m(x) using the
randomly assigned points

r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
12
Adjust Group Memeberships

The key is to assign each data point to the
group who classifies the data point correctly
with the largest probability
How ?

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
13
Adjust Group Memberships

The key is to assign each data point to the
group who classifies the data point correctly
with the largest confidence
Compute p(g1x,y) and p(g2x,y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
Posterior Prob. For Groups Posterior Prob. For Groups Posterior Prob. For Groups
Group 1 Group 2
1 0.8 0.2
2 0.4 0.6
3 0.3 0.7
4 0.1 0.9
5 0.65 0.35
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
14
Adjust Memberships for Classifiers

The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence
Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
15
Adjust Memberships for Classifiers

The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence
Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers
1 2 3 4 5
m1,1 0.7 0.1 0.15 0.1 0.05
m1,2 0.2 0.2 0.20 0.1 0.55
m2,1 0.05 0.5 0.60 0.1 0.3
m2,1 0.05 0.2 0.05 0.7 0.1
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
Adjust Memberships for Classifiers

The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence
Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)

The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence
Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)

Retrain r(x), g(x), m(x) using the new
memberships

r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
19
Expectation Maximization

Two things need to estimate
Logistic regression models for r(x?r), g(x ?g)
and m(x?m)
Unknown group memberships and expert memberships
? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)

E-step
Estimate p(g1x, y), p(g2x, y) for training
examples, given guessed r(x?r), g(x?g) and
m(x?m)
Estimate p(m11, 12x, y) and p(m21, 22x, y)
for all training examples, given guessed r(x?r),
g(x?g) and m(x?m)

20
Comparison of Different Classification Models

The goal of all classifiers
Predicating class label y for an input x
Estimate p(yx)
Gaussian generative model
p(yx) p(xy) p(y) posterior likelihood ?
prior
p(xy)
Describing the input patterns for each class y
Difficult to estimate if x is of high
dimensionality
Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
Essentially a linear model
Linear discriminative model
Directly estimate p(yx)
Focusing on finding the decision boundary

21
Comparison of Different Classification Models

Logistic regression model
A linear decision boundary w?xb
A probabilistic model p(yx)
Maximum likelihood approach for estimating
weights w and threshold b

22
Comparison of Different Classification Models

Logistic regression model
Overfitting issue
Example text classification
Every word is assigned with a different weight
Words that appears in only one document will be
assigned with infinite large weight
Solution regularization

23
Comparison of Different Classification Models

Conditional exponential model
An extension of logistic regression model to
multiple class case
A different set of weights wy and threshold b for
each class y
Maximum entropy model
Finding the simplest model that matches with the
data

24
Comparison of Different Classification Models

Support vector machine
Classification margin
Maximum margin principle
Separate data far away from the decision boundary
Two objectives
Minimize the classification error over training
data
Maximize the classification margin
Support vector
Only support vectors have impact on the location
of decision boundary

denotes 1 denotes -1
25
Comparison of Different Classification Models

Separable case
Noisy case

Quadratic programming!
26
Comparison of Classification Models

Logistic regression model vs. support vector
machine

27
Comparison of Different Classification Models
Logistic regression differs from support vector
machine only in the loss function
28
Comparison of Different Classification Models