Announcements - PowerPoint PPT Presentation

About This Presentation
Title:

Announcements

Description:

Hierarchical Mixture Expert Model. Rong Jin ... Hierarchical Mixture Expert Model (HME) Probabilistic Description. Group ... to each group and expert ... – PowerPoint PPT presentation

Number of Views:501
Avg rating:3.0/5.0
Slides: 31
Provided by: rong7
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Announcements


1
Announcements
  • Homework 4 is due on this Thursday (02/27/2004)
  • Project proposal is due on 03/02

2
Hierarchical Mixture Expert Model
  • Rong Jin

3
Decision Trees
  • Pro
  • Bring nonlinearity into the model
  • Con
  • Each split is only based on a single attribute.

4
Generalizing Decision Trees
?
Each node is a linear classifier
?

?
?
?
?



a decision tree using classifiers for data
partition
a decision tree with simple data partition
5
Generalized Decision Trees
  • Each node is a linear classifier
  • Pro
  • Usually result in shallow trees
  • Introducing nonlinearity into linear classifiers
    (e.g. logistic regression)
  • Overcoming overfitting issues through the
    regularization mechanism within the classifier.
  • Better way to deal with real-value attributes
  • Example
  • Neural network
  • Hierarchical Mixture Expert Model

6
Example
Kernel method
7
Hierarchical Mixture Expert Model (HME)
  • Ask r(x) which group should be used for
    classifying input x ?
  • If group 1 is chosen, which classifier m(x)
    should be used ?
  • Classify input x using the chosen classifier m(x)

8
Hierarchical Mixture Expert Model
(HME)Probabilistic Description
Two hidden variables
The hidden variable for groups g 1, 2
The hidden variable for classifiers m 11, 12,
21, 22
9
Hierarchical Mixture Expert Model (HME)Example
r(1x) ¾, r(-1x) ¼ g1(1x) ¼, g1(-1x)
¾ g2(1x) ½ , g2(-1x) ½
¾
¼
1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
¾
½
¼
½
p(1x) ?, p(-1x) ?
10
Training HME
  • In the training examples xi, yi
  • No information about r(x), g(x) for each example
  • Random variables g, m are called hidden variables
    since they are not exposed in the training data.
  • How to train a model with hidden variable?

11
Start with Random Guess
1, 2, 3, 4, 5 ? 6, 7, 8, 9
  • Randomly Assignment
  • Randomly assign points to each group and expert
  • Learn classifiers r(x), g(x), m(x) using the
    randomly assigned points

r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
12
Adjust Group Memeberships
  • The key is to assign each data point to the
    group who classifies the data point correctly
    with the largest probability
  • How ?

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
13
Adjust Group Memberships
  • The key is to assign each data point to the
    group who classifies the data point correctly
    with the largest confidence
  • Compute p(g1x,y) and p(g2x,y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
Posterior Prob. For Groups Posterior Prob. For Groups Posterior Prob. For Groups
Group 1 Group 2
1 0.8 0.2
2 0.4 0.6
3 0.3 0.7
4 0.1 0.9
5 0.65 0.35
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
14
Adjust Memberships for Classifiers
  • The key is to assign each data point to the
    classifier who classifies the data point
    correctly with the largest confidence
  • Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
    y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
15
Adjust Memberships for Classifiers
  • The key is to assign each data point to the
    classifier who classifies the data point
    correctly with the largest confidence
  • Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
    y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers
1 2 3 4 5
m1,1 0.7 0.1 0.15 0.1 0.05
m1,2 0.2 0.2 0.20 0.1 0.55
m2,1 0.05 0.5 0.60 0.1 0.3
m2,1 0.05 0.2 0.05 0.7 0.1
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
Adjust Memberships for Classifiers
  • The key is to assign each data point to the
    classifier who classifies the data point
    correctly with the largest confidence
  • Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
    y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers
1 2 3 4 5
m1,1 0.7 0.1 0.15 0.1 0.05
m1,2 0.2 0.2 0.20 0.1 0.55
m2,1 0.05 0.5 0.60 0.1 0.3
m2,1 0.05 0.2 0.05 0.7 0.1
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
17
Adjust Memberships for Classifiers
  • The key is to assign each data point to the
    classifier who classifies the data point
    correctly with the largest confidence
  • Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
    y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers Posterior Prob. For Classifiers
1 2 3 4 5
m1,1 0.7 0.1 0.15 0.1 0.05
m1,2 0.2 0.2 0.20 0.1 0.55
m2,1 0.05 0.5 0.60 0.1 0.3
m2,1 0.05 0.2 0.05 0.7 0.1
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
18
Retrain The Model
1, 2, 3, 4, 5 ? 6, 7, 8, 9
  • Retrain r(x), g(x), m(x) using the new
    memberships

r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
19
Expectation Maximization
  • Two things need to estimate
  • Logistic regression models for r(x?r), g(x ?g)
    and m(x?m)
  • Unknown group memberships and expert memberships
  • ? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)
  • E-step
  • Estimate p(g1x, y), p(g2x, y) for training
    examples, given guessed r(x?r), g(x?g) and
    m(x?m)
  • Estimate p(m11, 12x, y) and p(m21, 22x, y)
    for all training examples, given guessed r(x?r),
    g(x?g) and m(x?m)

20
Comparison of Different Classification Models
  • The goal of all classifiers
  • Predicating class label y for an input x
  • Estimate p(yx)
  • Gaussian generative model
  • p(yx) p(xy) p(y) posterior likelihood ?
    prior
  • p(xy)
  • Describing the input patterns for each class y
  • Difficult to estimate if x is of high
    dimensionality
  • Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
  • Essentially a linear model
  • Linear discriminative model
  • Directly estimate p(yx)
  • Focusing on finding the decision boundary

21
Comparison of Different Classification Models
  • Logistic regression model
  • A linear decision boundary w?xb
  • A probabilistic model p(yx)
  • Maximum likelihood approach for estimating
    weights w and threshold b

22
Comparison of Different Classification Models
  • Logistic regression model
  • Overfitting issue
  • Example text classification
  • Every word is assigned with a different weight
  • Words that appears in only one document will be
    assigned with infinite large weight
  • Solution regularization

23
Comparison of Different Classification Models
  • Conditional exponential model
  • An extension of logistic regression model to
    multiple class case
  • A different set of weights wy and threshold b for
    each class y
  • Maximum entropy model
  • Finding the simplest model that matches with the
    data

24
Comparison of Different Classification Models
  • Support vector machine
  • Classification margin
  • Maximum margin principle
  • Separate data far away from the decision boundary
  • Two objectives
  • Minimize the classification error over training
    data
  • Maximize the classification margin
  • Support vector
  • Only support vectors have impact on the location
    of decision boundary

denotes 1 denotes -1
25
Comparison of Different Classification Models
  • Separable case
  • Noisy case

Quadratic programming!
26
Comparison of Classification Models
  • Logistic regression model vs. support vector
    machine

27
Comparison of Different Classification Models
Logistic regression differs from support vector
machine only in the loss function
28
Comparison of Different Classification Models
  • Generative models have trouble at the decision
    boundary

29
Nonlinear Models
  • Kernel methods
  • Add additional dimensions to help separate data
  • Efficiently computing the dot product in a high
    dimension space

Kernel method
30
Nonlinear Models
  • Decision trees
  • Nonlinearly combine different features through a
    tree structure
  • Hierarchical Mixture Model
  • Replace each node with a logistic regression
    model
  • Nonlinearly combine multiple linear models
Write a Comment
User Comments (0)
About PowerShow.com