Announcements - PowerPoint PPT Presentation

About This Presentation

Title:

Announcements

Description:

Randomly Assignment. Randomly assign points to each group and expert ... The key is to assign each data point to the classifier who classifies the data ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 42

Provided by: rong7

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Announcements

1
Announcements

Homework 4 is due on this Thursday (02/27/2004)
Project proposal is due on 03/02

2
Unconstrained Optimization

Rong Jin

3
Logistic Regression
The optimization problem is to find weights w and
b that maximizes the above log-likelihood
How to do it efficiently ?
4
Gradient Ascent

Compute the gradient
Increase weights w and threshold b in the
gradient direction

5
Problem with Gradient Ascent

Difficult to find the appropriate step size
Small ? ? slow convergence
Large ? ? oscillation or bubbling
Convergence conditions
Robbins-Monroe conditions
Along with regular objective function will
ensure convergence

?
6
Newton Method

Utilizing the second order derivative
Expand the objective function to the second order
around x0
The minimum point is
Newton method for optimization
Guarantee to converge when the objective function
is convex

7
Multivariate Newton Method

Object function comprises of multiple variables
Example logistic regression model
Text categorization thousands of words ?
thousands of variables
Multivariate Newton Method
Multivariate function
First order derivative ? a vector
Second order derivative ? Hessian matrix
Hessian matrix is mxm matrix
Each element in Hessian matrix is defined as

8
Multivariate Newton Method

Updating equation
Hessian matrix for logistic regression model
Can be expensive to compute
Example text categorization with 10,000 words
Hessian matrix is of size 10,000 x 10,000 ? 100
million entries
Even worse, we have compute the inverse of
Hessian matrix H-1

9
Quasi-Newton Method

Approximate the Hessian matrix H with another B
matrix
B is update iteratively (BFGS)
Utilizing derivatives of previous iterations

10
Limited-Memory Quasi-Newton

Quasi-Newton
Avoid computing the inverse of Hessian matrix
But, it still requires computing the B matrix ?
large storage
Limited-Memory Quasi-Newton (L-BFGS)
Even avoid explicitly computing B matrix
B can be expressed as a product of vectors
Only keep the most recently vectors of
(320)

11
Linear Conjugate Gradient Method

Consider optimizing the quadratic function
Conjugate vectors
The set of vector p1, p2, , pl is said to be
conjugate with respect to a matrix A if
Important property
The quadratic function can be optimized by simply
optimizing the function along individual
direction in the conjugate set.
Optimal solution
?k is the minimizer along the kth conjugate
direction

12
Example

Minimize the following function
Matrix A
Conjugate direction
Optimization
First direction, x1 x2x
Second direction, x1 - x2x
Solution x1 x21

13
How to Efficiently Find a Set of Conjugate
Directions

Iterative procedure
Given conjugate directions p1,p2,, pk-1
Set pk as follows
Theorem The direction generated in the above
step is conjugate to all previous directions
p1,p2,, pk-1, i.e.,
Note compute the k direction pk only requires
the previous direction pk-1

14
Nonlinear Conjugate Gradient

Even though conjugate gradient is derived for a
quadratic objective function, it can be applied
directly to other nonlinear functions
Several variants
Fletcher-Reeves conjugate gradient (FR-CG)
Polak-Ribiere conjugate gradient (PR-CG)
More robust than FR-CG
Compared to Newton method
No need for computing the Hessian matrix
No need for storing the Hessian matrix

15
Generalizing Decision Trees
?
Each node is a linear classifier
?

?
?
?
?

a decision tree using classifiers for data
partition
a decision tree with simple data partition
16
Generalized Decision Trees

Each node is a linear classifier
Pro
Usually result in shallow trees
Introducing nonlinearity into linear classifiers
(e.g. logistic regression)
Overcoming overfitting issues through the
regularization mechanism within the classifier.
Better way to deal with real-value attributes
Example
Neural network
Hierarchical Mixture Expert Model

17
Example
Kernel method
18
Hierarchical Mixture Expert Model (HME)

Ask r(x) which group should be used for
classifying input x ?
If group 1 is chosen, which classifier m(x)
should be used ?
Classify input x using the chosen classifier m(x)

19
Hierarchical Mixture Expert Model
(HME)Probabilistic Description
Two hidden variables
The hidden variable for groups g 1, 2
The hidden variable for classifiers m 11, 12,
21, 22
20
Hierarchical Mixture Expert Model (HME)Example
r(1x) ¾, r(-1x) ¼ g1(1x) ¼, g1(-1x)
¾ g2(1x) ½ , g2(-1x) ½
¾
¼
¾
½
¼
½
p(1x) ?, p(-1x) ?
21
Training HME

In the training examples xi, yi
No information about r(x), g(x) for each example
Random variables g, m are called hidden variables
since they are not exposed in the training data.
How to train a model with hidden variable?

22
Start with Random Guess
1, 2, 3, 4, 5 ? 6, 7, 8, 9

Randomly Assignment
Randomly assign points to each group and expert
Learn classifiers r(x), g(x), m(x) using the
randomly assigned points

r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
23
Adjust Group Memeberships

The key is to assign each data point to the
group who classifies the data point correctly
with the largest probability
How ?

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
24
Adjust Group Memberships

The key is to assign each data point to the
group who classifies the data point correctly
with the largest confidence
Compute p(g1x,y) and p(g2x,y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
25
Adjust Memberships for Classifiers

The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence
Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
26
Adjust Memberships for Classifiers

The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence
Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
27
Adjust Memberships for Classifiers

The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence
Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
28
Adjust Memberships for Classifiers

The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence
Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)

1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
29
Retrain The Model
1, 2, 3, 4, 5 ? 6, 7, 8, 9

Retrain r(x), g(x), m(x) using the new
memberships

r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
30
Expectation Maximization

Two things need to estimate
Logistic regression models for r(x?r), g(x ?g)
and m(x?m)
Unknown group memberships and expert memberships
? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)

E-step
Estimate p(g1x, y), p(g2x, y) for training
examples, given guessed r(x?r), g(x?g) and
m(x?m)
Estimate p(m11, 12x, y) and p(m21, 22x, y)
for all training examples, given guessed r(x?r),
g(x?g) and m(x?m)

31
Comparison of Different Classification Models

The goal of all classifiers
Predicating class label y for an input x
Estimate p(yx)
Gaussian generative model
p(yx) p(xy) p(y) posterior likelihood ?
prior
p(xy)
Describing the input patterns for each class y
Difficult to estimate if x is of high
dimensionality
Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
Essentially a linear model
Linear discriminative model
Directly estimate p(yx)
Focusing on finding the decision boundary

32
Comparison of Different Classification Models

Logistic regression model
A linear decision boundary w?xb
A probabilistic model p(yx)
Maximum likelihood approach for estimating
weights w and threshold b

33
Comparison of Different Classification Models

Logistic regression model
Overfitting issue
Example text classification
Every word is assigned with a different weight
Words that appears in only one document will be
assigned with infinite large weight
Solution regularization

34
Comparison of Different Classification Models

Conditional exponential model
An extension of logistic regression model to
multiple class case
A different set of weights wy and threshold b for
each class y
Maximum entropy model
Finding the simplest model that matches with the
data

35
Comparison of Different Classification Models

Support vector machine
Classification margin
Maximum margin principle
Separate data far away from the decision boundary
Two objectives
Minimize the classification error over training
data
Maximize the classification margin
Support vector
Only support vectors have impact on the location
of decision boundary

denotes 1 denotes -1
36
Comparison of Different Classification Models

Separable case
Noisy case

Quadratic programming!
37
Comparison of Classification Models

Logistic regression model vs. support vector
machine

38
Comparison of Different Classification Models
Logistic regression differs from support vector
machine only in the loss function
39
Comparison of Different Classification Models