Title: Announcements
1Announcements
- Homework 4 is due on this Thursday (02/27/2004)
- Project proposal is due on 03/02
2Unconstrained Optimization
3Logistic Regression
The optimization problem is to find weights w and
b that maximizes the above log-likelihood
How to do it efficiently ?
4Gradient Ascent
- Compute the gradient
- Increase weights w and threshold b in the
gradient direction
5Problem with Gradient Ascent
- Difficult to find the appropriate step size
- Small ? ? slow convergence
- Large ? ? oscillation or bubbling
- Convergence conditions
- Robbins-Monroe conditions
- Along with regular objective function will
ensure convergence -
?
6Newton Method
- Utilizing the second order derivative
- Expand the objective function to the second order
around x0 - The minimum point is
- Newton method for optimization
- Guarantee to converge when the objective function
is convex
7Multivariate Newton Method
- Object function comprises of multiple variables
- Example logistic regression model
- Text categorization thousands of words ?
thousands of variables - Multivariate Newton Method
- Multivariate function
- First order derivative ? a vector
- Second order derivative ? Hessian matrix
- Hessian matrix is mxm matrix
- Each element in Hessian matrix is defined as
8Multivariate Newton Method
- Updating equation
- Hessian matrix for logistic regression model
- Can be expensive to compute
- Example text categorization with 10,000 words
- Hessian matrix is of size 10,000 x 10,000 ? 100
million entries - Even worse, we have compute the inverse of
Hessian matrix H-1
9Quasi-Newton Method
- Approximate the Hessian matrix H with another B
matrix - B is update iteratively (BFGS)
- Utilizing derivatives of previous iterations
10Limited-Memory Quasi-Newton
- Quasi-Newton
- Avoid computing the inverse of Hessian matrix
- But, it still requires computing the B matrix ?
large storage - Limited-Memory Quasi-Newton (L-BFGS)
- Even avoid explicitly computing B matrix
- B can be expressed as a product of vectors
- Only keep the most recently vectors of
(320)
11Linear Conjugate Gradient Method
- Consider optimizing the quadratic function
- Conjugate vectors
- The set of vector p1, p2, , pl is said to be
conjugate with respect to a matrix A if - Important property
- The quadratic function can be optimized by simply
optimizing the function along individual
direction in the conjugate set. - Optimal solution
- ?k is the minimizer along the kth conjugate
direction
12Example
- Minimize the following function
- Matrix A
- Conjugate direction
- Optimization
- First direction, x1 x2x
- Second direction, x1 - x2x
- Solution x1 x21
13How to Efficiently Find a Set of Conjugate
Directions
- Iterative procedure
- Given conjugate directions p1,p2,, pk-1
- Set pk as follows
- Theorem The direction generated in the above
step is conjugate to all previous directions
p1,p2,, pk-1, i.e., - Note compute the k direction pk only requires
the previous direction pk-1
14Nonlinear Conjugate Gradient
- Even though conjugate gradient is derived for a
quadratic objective function, it can be applied
directly to other nonlinear functions - Several variants
- Fletcher-Reeves conjugate gradient (FR-CG)
- Polak-Ribiere conjugate gradient (PR-CG)
- More robust than FR-CG
- Compared to Newton method
- No need for computing the Hessian matrix
- No need for storing the Hessian matrix
15Generalizing Decision Trees
?
Each node is a linear classifier
?
?
?
?
?
a decision tree using classifiers for data
partition
a decision tree with simple data partition
16Generalized Decision Trees
- Each node is a linear classifier
- Pro
- Usually result in shallow trees
- Introducing nonlinearity into linear classifiers
(e.g. logistic regression) - Overcoming overfitting issues through the
regularization mechanism within the classifier. - Better way to deal with real-value attributes
- Example
- Neural network
- Hierarchical Mixture Expert Model
17Example
Kernel method
18Hierarchical Mixture Expert Model (HME)
- Ask r(x) which group should be used for
classifying input x ? - If group 1 is chosen, which classifier m(x)
should be used ? - Classify input x using the chosen classifier m(x)
19Hierarchical Mixture Expert Model
(HME)Probabilistic Description
Two hidden variables
The hidden variable for groups g 1, 2
The hidden variable for classifiers m 11, 12,
21, 22
20Hierarchical Mixture Expert Model (HME)Example
r(1x) ¾, r(-1x) ¼ g1(1x) ¼, g1(-1x)
¾ g2(1x) ½ , g2(-1x) ½
¾
¼
¾
½
¼
½
p(1x) ?, p(-1x) ?
21Training HME
- In the training examples xi, yi
- No information about r(x), g(x) for each example
- Random variables g, m are called hidden variables
since they are not exposed in the training data. - How to train a model with hidden variable?
22Start with Random Guess
1, 2, 3, 4, 5 ? 6, 7, 8, 9
- Randomly Assignment
- Randomly assign points to each group and expert
- Learn classifiers r(x), g(x), m(x) using the
randomly assigned points
r(x)
Group Layer
1,2, 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
23Adjust Group Memeberships
- The key is to assign each data point to the
group who classifies the data point correctly
with the largest probability - How ?
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
24Adjust Group Memberships
- The key is to assign each data point to the
group who classifies the data point correctly
with the largest confidence - Compute p(g1x,y) and p(g2x,y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,2 6,7
3,4,5 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
27
39
5,48
25Adjust Memberships for Classifiers
- The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence - Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
26Adjust Memberships for Classifiers
- The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence - Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
27Adjust Memberships for Classifiers
- The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence - Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
28Adjust Memberships for Classifiers
- The key is to assign each data point to the
classifier who classifies the data point
correctly with the largest confidence - Compute p(m1,1x, y), p(m1,2x, y), p(m2,1x,
y), p(m2,2x, y)
1, 2, 3, 4, 5 ? 6, 7, 8, 9
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
29Retrain The Model
1, 2, 3, 4, 5 ? 6, 7, 8, 9
- Retrain r(x), g(x), m(x) using the new
memberships
r(x)
Group Layer
1,5 6,7
2,3,4 8,9
g1(x)
g2(x)
ExpertLayer
m1,1(x)
m1,2(x)
m2,1(x)
m2,2(x)
16
57
2,39
48
30Expectation Maximization
- Two things need to estimate
- Logistic regression models for r(x?r), g(x ?g)
and m(x?m) - Unknown group memberships and expert memberships
- ? p(g1,2x), p(m11,12x,g1), p(m21,22x,g2)
- E-step
- Estimate p(g1x, y), p(g2x, y) for training
examples, given guessed r(x?r), g(x?g) and
m(x?m) - Estimate p(m11, 12x, y) and p(m21, 22x, y)
for all training examples, given guessed r(x?r),
g(x?g) and m(x?m)
31Comparison of Different Classification Models
- The goal of all classifiers
- Predicating class label y for an input x
- Estimate p(yx)
- Gaussian generative model
- p(yx) p(xy) p(y) posterior likelihood ?
prior - p(xy)
- Describing the input patterns for each class y
- Difficult to estimate if x is of high
dimensionality - Naïve Bayes p(xy) p(x1y) p(x2y) p(xmy)
- Essentially a linear model
- Linear discriminative model
- Directly estimate p(yx)
- Focusing on finding the decision boundary
32Comparison of Different Classification Models
- Logistic regression model
- A linear decision boundary w?xb
- A probabilistic model p(yx)
- Maximum likelihood approach for estimating
weights w and threshold b
33Comparison of Different Classification Models
- Logistic regression model
- Overfitting issue
- Example text classification
- Every word is assigned with a different weight
- Words that appears in only one document will be
assigned with infinite large weight - Solution regularization
34Comparison of Different Classification Models
- Conditional exponential model
- An extension of logistic regression model to
multiple class case - A different set of weights wy and threshold b for
each class y - Maximum entropy model
- Finding the simplest model that matches with the
data
35Comparison of Different Classification Models
- Support vector machine
- Classification margin
- Maximum margin principle
- Separate data far away from the decision boundary
- Two objectives
- Minimize the classification error over training
data - Maximize the classification margin
- Support vector
- Only support vectors have impact on the location
of decision boundary
denotes 1 denotes -1
36Comparison of Different Classification Models
- Separable case
- Noisy case
Quadratic programming!
37Comparison of Classification Models
- Logistic regression model vs. support vector
machine
38Comparison of Different Classification Models
Logistic regression differs from support vector
machine only in the loss function
39Comparison of Different Classification Models
- Generative models have trouble at the decision
boundary
40Nonlinear Models
- Kernel methods
- Add additional dimensions to help separate data
- Efficiently computing the dot product in a high
dimension space
Kernel method
41Nonlinear Models
- Decision trees
- Nonlinearly combine different features through a
tree structure - Hierarchical Mixture Model
- Replace each node with a logistic regression
model - Nonlinearly combine multiple linear models