Announcements - PowerPoint PPT Presentation

About This Presentation
Title:

Announcements

Description:

For each training example (x, y), we don't know its group ID or expert ID. ... randomly assign points to each group and expert. learning classifier r(x), g(x), m(x) ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 37
Provided by: rong7
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Announcements


1
Announcements
  • Homework 4 is delayed to 02/27 next Thursday
  • Project proposal is due on 03/02

2
Hierarchical Mixture Expert Model
  • Rong Jin

3
Good Things about Decision Trees
  • Decision trees introduce nonlinearity through the
    tree structure
  • Viewing ABC as ABC
  • Compared to kernel methods
  • Less adhoc
  • Easy understanding

4
Generalize Decision Trees
From slides of Andrew Moore
5
Partition Datasets
  • The goal of each node is to partition the data
    set into disjoint subsets such that each subset
    is easier to classify.

cylinders 4
Partition by a single attribute
Original Dataset
cylinders 5
cylinders 6
cylinders 8
6
Partition Datasets (contd)
  • More complicated partitions

Cylinderslt 6 and Weight gt 4 ton
Partition by multiple attributes
Original Dataset
Other cases
Replacing each node with a classification model
Cylinders ? 6 and Weight lt 3 ton
  • How to accomplish such a complicated partition?
  • Each partition ?? a classification
  • Partition a dataset into disjoint subsets ??
    Classifying data points into different classes

7
A More General Decision Tree
Each node is a linear classifier
?
?

?
?
?
?



a decision tree using classifiers for data
partition
a decision tree with simple data partition
8
General Scheme for Decision Trees
  • Each node within the tree is a linear classifier
  • Pro
  • Usually result in shallow trees
  • Introducing nonlinearity into linear classifiers
    (e.g. logistic regression)
  • Overcoming overfitting issues through the
    regularization mechanism within the classifier.
  • Better way to deal with real-value attributes
  • Example
  • Neural network
  • Hierarchical Mixture Expert Model

9
Example
Kernel method
10
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
11
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
r(x) 1
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
12
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
13
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
g1(x) -1
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
14
Hierarchical Mixture Expert Model (HME)
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
m1,2(x) 1 The class label for 1
15
Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
17
Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
x
x
Which expert should be used for classifying x ?
?
?
?
?
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
18
Hierarchical Mixture Expert Model (HME)
More Complicated Case
r(x)
Group Layer
r(1x) ¾, r(-1x) ¼
Group 1 g1(x)
Group 2 g2(x)
x
g1(1x) ¼, g1(-1x) ¾ g2(1x) ½ ,
g2(-1x) ½
ExpertLayer
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
p(1x) 9/16, p(-1x) 7/16
19
Hierarchical Mixture Expert Model (HME)
Probabilistic Description
r(x)
Random variable g 1, 2 r(1x)p(g 1x),
r(-1x)p(g 2x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
Random variable m 11, 12, 21, 22 g1(1x)
p(m11x, g1), g1(-1x) p(m12x,
g1) g2(1x) p(m21x, g2) g2(-1x) p(m22x,
g2)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
m1,1(x), m1,2(x), m2,1(x), m2,2(x) are
classifiers x ? 1, -1
20
Hierarchical Mixture Expert Model (HME)
Probabilistic Description
r(x)
Group Layer
Group 1 g1(x)
Group 2 g2(x)
y
How to train function r(x), g(x), m(x) ?
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
21
Problem with Training HME
  • Consider applying logistic regression to model
    r(x), g(x), and m(x)
  • No training examples r(x), g(x)
  • For each training example (x, y), we dont know
    its group ID or expert ID.
  • cant apply training procedure of logistic
    regression model to train r(x) and g(x) directly.
  • Random variables g, m are called hidden variables
    since they are not exposed in the training data.
  • How to train a model with hidden variable?

22
Start with Random Guess
  • Iteration 1 random guess
  • randomly assign points to each group and expert

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
23
Start with Random Guess
  • Iteration 1 random guess
  • randomly assign points to each group and expert
  • learning classifier r(x), g(x), m(x)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
24
Start with Random Guess
  • Iteration 1 random guess
  • randomly assign points to each group and expert
  • learning classifier r(x), g(x), m(x)
  • applying the learned classifier r(x) to
    reclassify all data points

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
Group 1 Group 2
1
2
3
4
5
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
25
Start with Random Guess
  • Iteration 2 regroup
  • reassign group membership to each data point

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
1 2 3 4 5
m1,1 0.98 x x x 0.51
m1,2 0.61 x x x 0.90
m2,1 x 0.7 0.90 0.7 x
m2,1 x 0.6 0.5 0.8 x
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
26
Start with Random Guess
  • Iteration 2 regroup
  • reassign the group membership to each data point
  • reassign the expert membership to each data point

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
1 2 3 4 5
m1,1 0.98 x x x 0.51
m1,2 0.61 x x x 0.90
m2,1 x 0.7 0.90 0.7 x
m2,1 x 0.6 0.5 0.8 x
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
27
Start with Random Guess
  • Iteration 2 regroup
  • reassign the group membership to each data point
  • reassign the expert membership to each data
    point
  • retrain r(x), g(x) and m(x).

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
1 2 3 4 5
m1,1 0.98 x x x 0.51
m1,2 0.61 x x x 0.90
m2,1 x 0.7 0.90 0.7 x
m2,1 x 0.6 0.5 0.8 x
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
28
Start with Random Guess
  • Iteration 2 regroup
  • reassign the group membership to each data point
  • reassign the expert membership to each data
    point
  • retrain r(x), g(x) and m(x).
  • Repeat the above procedure until it converges (it
    guarantees to converge a local minimum)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
This is famous Expectation-Maximization Algorithm
(EM) !
29
Formal EM algorithm for HME
  • Two things need to estimate
  • Logistic regression models for r(x?r), g(x ?g)
    and m(x?m)
  • Unknown group memberships and expert memberships
  • ? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)
  • E-step
  • Estimate p(g1x), p(g2x) for training
    examples, given guessed r(x?r), g(x?g) and
    m(x?m)
  • Estimate p(m11, 12x, g1) and p(m21, 22x,
    g2) for all training examples, given guessed
    r(x?r), g(x?g) and m(x?m)

30
Comparison of Classification Models
  • The goal of classifier
  • Predicating class label y for an input x
  • Estimate p(yx)
  • Gaussian generative model
  • p(yx) p(xy) p(y) posterior likelihood ?
    prior
  • Difficulty in estimating p(xy) if x comprises of
    multiple elements
  • Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
  • Linear discriminative model
  • Estimate p(yx)
  • Focusing on finding the decision boundary

31
Comparison of Classification Models
  • Logistic regression model
  • A linear decision boundary w?xb
  • A probabilistic model p(yx)
  • Maximum likelihood approach for estimating
    weights w and threshold b

32
Comparison of Classification Models
  • Logistic regression model
  • Overfitting issue
  • In text classification problem, words that only
    appears in only one document will be assigned
    with infinite large weight
  • Solution regularization
  • Conditional exponential model
  • Maximum entropy model
  • A dual problem of conditional exponential model

33
Comparison of Classification Models
  • Support vector machine
  • Classification margin
  • Maximum margin principle two objective
  • Minimize the classification error over training
    data
  • Maximize classification margin
  • Support vector
  • Only support vectors have impact on the location
    of decision boundary

34
Comparison of Classification Models
  • Separable case
  • Noisy case

Quadratic programming!
35
Comparison of Classification Models
  • Similarity between logistic regression model and
    support vector machine

Logistic regression model is almost identical to
support vector machine except for different
expression for classification errors
36
Comparison of Classification Models
  • Generative models have trouble at the decision
    boundary
Write a Comment
User Comments (0)
About PowerShow.com