Announcements - PowerPoint PPT Presentation

About This Presentation

Title:

Announcements

Description:

For each training example (x, y), we don't know its group ID or expert ID. ... randomly assign points to each group and expert. learning classifier r(x), g(x), m(x) ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 37

Provided by: rong7

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Announcements

1
Announcements

Homework 4 is delayed to 02/27 next Thursday
Project proposal is due on 03/02

2
Hierarchical Mixture Expert Model

Rong Jin

3
Good Things about Decision Trees

Decision trees introduce nonlinearity through the
tree structure
Viewing ABC as ABC
Compared to kernel methods
Less adhoc
Easy understanding

4
Generalize Decision Trees
From slides of Andrew Moore
5
Partition Datasets

The goal of each node is to partition the data
set into disjoint subsets such that each subset
is easier to classify.

cylinders 4
Partition by a single attribute
Original Dataset
cylinders 5
cylinders 6
cylinders 8
6
Partition Datasets (contd)

More complicated partitions

Cylinderslt 6 and Weight gt 4 ton
Partition by multiple attributes
Original Dataset
Other cases
Replacing each node with a classification model
Cylinders ? 6 and Weight lt 3 ton

How to accomplish such a complicated partition?

Each partition ?? a classification
Partition a dataset into disjoint subsets ??
Classifying data points into different classes

7
A More General Decision Tree
Each node is a linear classifier
?
?

?
?
?
?

a decision tree using classifiers for data
partition
a decision tree with simple data partition
8
General Scheme for Decision Trees

Each node within the tree is a linear classifier
Pro
Usually result in shallow trees
Introducing nonlinearity into linear classifiers
(e.g. logistic regression)
Overcoming overfitting issues through the
regularization mechanism within the classifier.
Better way to deal with real-value attributes
Example
Neural network
Hierarchical Mixture Expert Model

9
Example
Kernel method
10
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
11
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
r(x) 1
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
12
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
13
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
g1(x) -1
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
14
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
m1,2(x) 1 The class label for 1
15
Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
17
Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
x
x
Which expert should be used for classifying x ?
?
?
?
?
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
18
Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
x
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
ExpertLayer
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
p(1x) 9/16, p(-1x) 7/16
19
Hierarchical Mixture Expert Model (HME)
Probabilistic Description
r(x)
Random variable g 1, 2 r(1x)p(g 1x),
r(-1x)p(g 2x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
Random variable m 11, 12, 21, 22 g1(1x)
p(m11x, g1), g1(-1x) p(m12x,
g1) g2(1x) p(m21x, g2) g2(-1x) p(m22x,
g2)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
m1,1(x), m1,2(x), m2,1(x), m2,2(x) are
classifiers x ? 1, -1
20
Hierarchical Mixture Expert Model (HME)
Probabilistic Description
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
y
How to train function r(x), g(x), m(x) ?
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
21
Problem with Training HME

Consider applying logistic regression to model
r(x), g(x), and m(x)
No training examples r(x), g(x)
For each training example (x, y), we dont know
its group ID or expert ID.
cant apply training procedure of logistic
regression model to train r(x) and g(x) directly.
Random variables g, m are called hidden variables
since they are not exposed in the training data.
How to train a model with hidden variable?

22
Start with Random Guess

Iteration 1 random guess
randomly assign points to each group and expert

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
23
Start with Random Guess

Iteration 1 random guess
randomly assign points to each group and expert
learning classifier r(x), g(x), m(x)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
24
Start with Random Guess

Iteration 1 random guess
randomly assign points to each group and expert
learning classifier r(x), g(x), m(x)
applying the learned classifier r(x) to
reclassify all data points

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
Group 1 Group 2
1
2
3
4
5
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
25
Start with Random Guess

Iteration 2 regroup
reassign group membership to each data point

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
1 2 3 4 5
m1,1 0.98 x x x 0.51
m1,2 0.61 x x x 0.90
m2,1 x 0.7 0.90 0.7 x
m2,1 x 0.6 0.5 0.8 x
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
26
Start with Random Guess

Iteration 2 regroup
reassign the group membership to each data point
reassign the expert membership to each data point

Iteration 2 regroup
reassign the group membership to each data point
reassign the expert membership to each data
point
retrain r(x), g(x) and m(x).

Iteration 2 regroup
reassign the group membership to each data point
reassign the expert membership to each data
point
retrain r(x), g(x) and m(x).
Repeat the above procedure until it converges (it
guarantees to converge a local minimum)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
This is famous Expectation-Maximization Algorithm
(EM) !
29
Formal EM algorithm for HME

Two things need to estimate
Logistic regression models for r(x?r), g(x ?g)
and m(x?m)
Unknown group memberships and expert memberships
? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)

E-step
Estimate p(g1x), p(g2x) for training
examples, given guessed r(x?r), g(x?g) and
m(x?m)
Estimate p(m11, 12x, g1) and p(m21, 22x,
g2) for all training examples, given guessed
r(x?r), g(x?g) and m(x?m)

30
Comparison of Classification Models

The goal of classifier
Predicating class label y for an input x
Estimate p(yx)
Gaussian generative model
p(yx) p(xy) p(y) posterior likelihood ?
prior
Difficulty in estimating p(xy) if x comprises of
multiple elements
Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
Linear discriminative model
Estimate p(yx)
Focusing on finding the decision boundary

31
Comparison of Classification Models

Logistic regression model
A linear decision boundary w?xb
A probabilistic model p(yx)
Maximum likelihood approach for estimating
weights w and threshold b

32
Comparison of Classification Models

Logistic regression model
Overfitting issue
In text classification problem, words that only
appears in only one document will be assigned
with infinite large weight
Solution regularization
Conditional exponential model
Maximum entropy model
A dual problem of conditional exponential model

33
Comparison of Classification Models

Support vector machine
Classification margin
Maximum margin principle two objective
Minimize the classification error over training
data
Maximize classification margin
Support vector
Only support vectors have impact on the location
of decision boundary

34
Comparison of Classification Models

Separable case
Noisy case

Quadratic programming!
35
Comparison of Classification Models

Similarity between logistic regression model and
support vector machine

Logistic regression model is almost identical to
support vector machine except for different
expression for classification errors
36
Comparison of Classification Models