Announcements - PowerPoint PPT Presentation

About This Presentation
Title:

Announcements

Description:

Randomly Assignment. Randomly assign points to each group and expert ... The key is to assign each data point to the classifier who classifies the data ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 42
Provided by: rong7
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Announcements


1
Announcements
  • Homework 4 is due on this Thursday (02/27/2004)
  • Project proposal is due on 03/02

2
Unconstrained Optimization
  • Rong Jin

3
Logistic Regression
The optimization problem is to find weights w and
b that maximizes the above log-likelihood
How to do it efficiently ?
4
Gradient Ascent
  • Compute the gradient
  • Increase weights w and threshold b in the
    gradient direction

5
Problem with Gradient Ascent
  • Difficult to find the appropriate step size
  • Small ? ? slow convergence
  • Large ? ? oscillation or bubbling
  • Convergence conditions
  • Robbins-Monroe conditions
  • Along with regular objective function will
    ensure convergence

?
6
Newton Method
  • Utilizing the second order derivative
  • Expand the objective function to the second order
    around x0
  • The minimum point is
  • Newton method for optimization
  • Guarantee to converge when the objective function
    is convex

7
Multivariate Newton Method
  • Object function comprises of multiple variables
  • Example logistic regression model
  • Text categorization thousands of words ?
    thousands of variables
  • Multivariate Newton Method
  • Multivariate function
  • First order derivative ? a vector
  • Second order derivative ? Hessian matrix
  • Hessian matrix is mxm matrix
  • Each element in Hessian matrix is defined as

8
Multivariate Newton Method
  • Updating equation
  • Hessian matrix for logistic regression model
  • Can be expensive to compute
  • Example text categorization with 10,000 words
  • Hessian matrix is of size 10,000 x 10,000 ? 100
    million entries
  • Even worse, we have compute the inverse of
    Hessian matrix H-1

9
Quasi-Newton Method
  • Approximate the Hessian matrix H with another B
    matrix
  • B is update iteratively (BFGS)
  • Utilizing derivatives of previous iterations

10
Limited-Memory Quasi-Newton
  • Quasi-Newton
  • Avoid computing the inverse of Hessian matrix
  • But, it still requires computing the B matrix ?
    large storage
  • Limited-Memory Quasi-Newton (L-BFGS)
  • Even avoid explicitly computing B matrix
  • B can be expressed as a product of vectors
  • Only keep the most recently vectors of
    (320)

11
Linear Conjugate Gradient Method
  • Consider optimizing the quadratic function
  • Conjugate vectors
  • The set of vector p1, p2, , pl is said to be
    conjugate with respect to a matrix A if
  • Important property
  • The quadratic function can be optimized by simply
    optimizing the function along individual
    direction in the conjugate set.
  • Optimal solution
  • ?k is the minimizer along the kth conjugate
    direction

12
Example
  • Minimize the following function
  • Matrix A
  • Conjugate direction
  • Optimization
  • First direction, x1 x2x
  • Second direction, x1 - x2x
  • Solution x1 x21

13
How to Efficiently Find a Set of Conjugate
Directions
  • Iterative procedure
  • Given conjugate directions p1,p2,, pk-1
  • Set pk as follows
  • Theorem The direction generated in the above
    step is conjugate to all previous directions
    p1,p2,, pk-1, i.e.,
  • Note compute the k direction pk only requires
    the previous direction pk-1

14
Nonlinear Conjugate Gradient
  • Even though conjugate gradient is derived for a
    quadratic objective function, it can be applied
    directly to other nonlinear functions
  • Several variants
  • Fletcher-Reeves conjugate gradient (FR-CG)
  • Polak-Ribiere conjugate gradient (PR-CG)
  • More robust than FR-CG
  • Compared to Newton method
  • No need for computing the Hessian matrix
  • No need for storing the Hessian matrix

15
Generalizing Decision Trees
?
Each node is a linear classifier
?

?
?
?
?



a decision tree using classifiers for data
partition
a decision tree with simple data partition
16
Generalized Decision Trees
  • Each node is a linear classifier
  • Pro
  • Usually result in shallow trees
  • Introducing nonlinearity into linear classifiers
    (e.g. logistic regression)
  • Overcoming overfitting issues through the
    regularization mechanism within the classifier.
  • Better way to deal with real-value attributes
  • Example
  • Neural network
  • Hierarchical Mixture Expert Model

17
Example
Kernel method
18
Hierarchical Mixture Expert Model (HME)
  • Ask r(x) which group should be used for
    classifying input x ?
  • If group 1 is chosen, which classifier m(x)
    should be used ?
  • Classify input x using the chosen classifier m(x)

19
Hierarchical Mixture Expert Model
(HME)Probabilistic Description
Two hidden variables
The hidden variable for groups g 1, 2
The hidden variable for classifiers m 11, 12,
21, 22
20
Hierarchical Mixture Expert Model (HME)Example
r(1x) ¾, r(-1x) ¼ g1(1x) ¼, g1(-1x)
¾ g2(1x) ½ , g2(-1x) ½
¾
¼
¾
½
¼
½
p(1x) ?, p(-1x) ?
21
Training HME
  • In the training examples xi, yi
  • No information about r(x), g(x) for each example
  • Random variables g, m are called hidden variables
    since they are not exposed in the training data.
  • How to train a model with hidden variable?

22
Start with Random Guess
1, 2, 3, 4, 5 ? 6, 7, 8, 9
  • Randomly Assignment
  • Randomly assign points to each group and expert
  • Learn classifiers r(x), g(x), m(x) using the
    randomly assigned points

r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
23
Adjust Group Memeberships
  • The key is to assign each data point to the
    group who classifies the data point correctly
    with the largest probability
  • How ?

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
24
Adjust Group Memberships
  • The key is to assign each data point to the
    group who classifies the data point correctly
    with the largest confidence
  • Compute p(g1x,y) and p(g2x,y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
25
Adjust Memberships for Classifiers
  • The key is to assign each data point to the
    classifier who classifies the data point
    correctly with the largest confidence
  • Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
    y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
26
Adjust Memberships for Classifiers
  • The key is to assign each data point to the
    classifier who classifies the data point
    correctly with the largest confidence
  • Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
    y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
27
Adjust Memberships for Classifiers
  • The key is to assign each data point to the
    classifier who classifies the data point
    correctly with the largest confidence
  • Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
    y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
28
Adjust Memberships for Classifiers
  • The key is to assign each data point to the
    classifier who classifies the data point
    correctly with the largest confidence
  • Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
    y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
29
Retrain The Model
1, 2, 3, 4, 5 ? 6, 7, 8, 9
  • Retrain r(x), g(x), m(x) using the new
    memberships

r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
30
Expectation Maximization
  • Two things need to estimate
  • Logistic regression models for r(x?r), g(x ?g)
    and m(x?m)
  • Unknown group memberships and expert memberships
  • ? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)
  • E-step
  • Estimate p(g1x, y), p(g2x, y) for training
    examples, given guessed r(x?r), g(x?g) and
    m(x?m)
  • Estimate p(m11, 12x, y) and p(m21, 22x, y)
    for all training examples, given guessed r(x?r),
    g(x?g) and m(x?m)

31
Comparison of Different Classification Models
  • The goal of all classifiers
  • Predicating class label y for an input x
  • Estimate p(yx)
  • Gaussian generative model
  • p(yx) p(xy) p(y) posterior likelihood ?
    prior
  • p(xy)
  • Describing the input patterns for each class y
  • Difficult to estimate if x is of high
    dimensionality
  • Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
  • Essentially a linear model
  • Linear discriminative model
  • Directly estimate p(yx)
  • Focusing on finding the decision boundary

32
Comparison of Different Classification Models
  • Logistic regression model
  • A linear decision boundary w?xb
  • A probabilistic model p(yx)
  • Maximum likelihood approach for estimating
    weights w and threshold b

33
Comparison of Different Classification Models
  • Logistic regression model
  • Overfitting issue
  • Example text classification
  • Every word is assigned with a different weight
  • Words that appears in only one document will be
    assigned with infinite large weight
  • Solution regularization

34
Comparison of Different Classification Models
  • Conditional exponential model
  • An extension of logistic regression model to
    multiple class case
  • A different set of weights wy and threshold b for
    each class y
  • Maximum entropy model
  • Finding the simplest model that matches with the
    data

35
Comparison of Different Classification Models
  • Support vector machine
  • Classification margin
  • Maximum margin principle
  • Separate data far away from the decision boundary
  • Two objectives
  • Minimize the classification error over training
    data
  • Maximize the classification margin
  • Support vector
  • Only support vectors have impact on the location
    of decision boundary

denotes 1 denotes -1
36
Comparison of Different Classification Models
  • Separable case
  • Noisy case

Quadratic programming!
37
Comparison of Classification Models
  • Logistic regression model vs. support vector
    machine

38
Comparison of Different Classification Models
Logistic regression differs from support vector
machine only in the loss function
39
Comparison of Different Classification Models
  • Generative models have trouble at the decision
    boundary

40
Nonlinear Models
  • Kernel methods
  • Add additional dimensions to help separate data
  • Efficiently computing the dot product in a high
    dimension space

Kernel method
41
Nonlinear Models
  • Decision trees
  • Nonlinearly combine different features through a
    tree structure
  • Hierarchical Mixture Model
  • Replace each node with a logistic regression
    model
  • Nonlinearly combine multiple linear models
Write a Comment
User Comments (0)
About PowerShow.com