CS6772 Advanced Machine Learning Fall 2006 - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

CS6772 Advanced Machine Learning Fall 2006

Description:

Discriminative Classifier with Gaussians ... Discriminative Latent Likelihood Ratio Classifiers. Maximum Entropy Discrimination (MED) ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 29

Provided by: Bar7182

Category:

more less

Transcript and Presenter's Notes

Title: CS6772 Advanced Machine Learning Fall 2006

1
1 / 28
CS6772 Advanced Machine Learning Fall
2006 Extending Maximum Entropy Discrimination
on Mixtures of Gaussians With Transduction Fina
l Project by Barry Rafkind
2
2 / 28
Presentation Outline

Maximum Entropy Discrimination
Transduction with MED / SVMs
Application to Yeast Protein Classification
Toy Experiments with Transduction
Conclusions

3
3 / 28
Discriminative Classifier with Gaussians
Discriminative Classifier Log-Likelihood of
Ratio of Gaussians
4
4 / 28
Discriminative Classifier with Gaussians
Specify discriminant function by choice of
One approach from regularization theory would
choose
to agree with labels
Such that
where
Determines the margin
While minimizing the regularization function
5
5 / 28
Maximum Entropy Discrimination (MED)
In MED, we solve for a distribution over
solutions, P(?) such that the expected value of
the distriminant under this distribution agrees
with the labeling.
6
6 / 28
Maximum Entropy Discrimination (MED)
In addition to finding a P(?) that satisfies
classification constraints in the expectation,
MED regularizes the solution distribution P(?) by
either maximizing its entropy or minimizing its
relative entropy toward some prior target
distribution P0(?)
7
7 / 28
Maximum Entropy Discrimination (MED)
Minimize relative entropy toward some prior
target distribution P0(?) by finding relative
Shannon entropy (KL divergence) given by ...
Note that minimizing relative entropy is more
general since choosing uniform P0(?) gives
maximum entropy.
8
8 / 28
Maximum Entropy Discrimination (MED)
Thus, MED solves the constrained optimization
problem
Which projects the prior P0(?) to the closest
point in the admissible set or convex hull
defined by the above t 1..T constraints

P0(?)

P(?)
9
9 / 28
Maximum Entropy Discrimination (MED)
The solution for the posterior P(?) is the
standard maximum entropy setting
10
10 / 28
Maximum Entropy Discrimination (MED)
The solution for the posterior P(?) is the
standard maximum entropy setting
The partition function Z(?) normalizes P(?) .
MED finds the optimal setting of the ? Lagrange
multipliers (?t for t 1..T) by maximizing the
concave objective function J(?) - ln Z(?)
11
11 / 28
Maximum Entropy Discrimination (MED)
Given ?, the solution distribution P(?) is fully
specified. We can then use this for predicting
the label of a new data point X via the equation
12
12 / 28
Maximum Entropy Discrimination (MED)
SVMs as a Special Case of MED

Interestingly, applying MED to a ratio of
Gaussians exactly reproduces support vector
machines.
We simply assume the prior distribution
factorizes into a prior over the vector
parameters and a prior over the scalar bias

The first two priors are white zero-mean
Gaussians over the means

which encourages means of low magnitude for our
Gaussians.

The last prior is a non-informative (i.e. flat)
prior
indicating that any scalar bias is equally
probable a priori.
The resulting objective function is

13
13 / 28
Maximum Entropy Discrimination (MED)
Non-Separable Cases

To work with non-separable problems, we use a
distribution over margins in the prior and
posterior instead of simply setting them equal to
a constant (which is like using a delta-function
prior)

The MED solution distribution then involves an
augmented theta which includes all margin
variables as follows

The formula for the partition function Z(k) is
as above except we now have the following
factorized prior distribution

The margin priors are chosen to favor large
margins

14
14 / 28
Maximum Entropy Discrimination (MED)
Discriminative Latent Likelihood Ratio Classifiers
Consider a discriminant that is a ratio of two
mixture models

Computing the partition function for mixtures
becomes intractable with exponentially many
terms.
To compensate, we can use Jensen's inequality
and variational methods.

Jensen is first applied in the primal MED
problem to tighten the classification
constraints.
Then, Jensen is applied to the dual MED problem
to yield a tractable projection.

15
15 / 28
Transduction with Gaussian Ratio Discriminants
Classification is Straightfoward When All Labels
are Known
16
16 / 28
Transduction with Gaussian Ratio Discriminants
But labeling data is expensive and usually we
have many unlabeled data points that might still
be useful to us.
17
17 / 28
Transduction with Gaussian Ratio Discriminants
Transductive learners can take advantage of
unlabeled data to capture the distribution of
each class better but how?
18
18 / 28
A Principled Approach to Transduction

Uncertain labels can be handled in a principled
way within the MED formalism
Let y y1,...,yT) be a set of binary
variables corresponding to the labels for the
training examples

We can define a prior uncertainty over the
labels by specifying P0(y).

For simplicity, we can take this to be a
product distribution

Pt,0(yt) 1 if the label is known and 1/2
otherwise

19
19 / 28
A Principled Approach to Transduction
The MED solution is found by calculating the
relative entropy projection from the overall
prior distribution
To the admissible set of distributions P (no
longer directly a function of the labels) that
are consistent with the constraints (for all t
1T)
A feasible solution has been proposed for this
using a mean field approximation in a 2-step
process.
20
20 / 28
Thorsten Joachims Approach to Transductive SVMs
21
21 / 28
Thorsten Joachims Approach to Transductive SVMs

Start by training an inductive SVM on the
labeled training data and classifying the
unlabeled test data accordingly.
Then uniformly increase the influence of the
test examples by incrementing the cost-factors
C- and C up to the user-defined value of C.
A criterion condition identifies pairs of
examples for which changing the class labels
leads to a decrease in the current objective
function and then switches their labels.

22
22 / 28
Application of MED to Yeast Protein Classification

A Comparison was Performed among 3 Methods
The latent MED approach (without transduction)
SVMs with single kernels
Semi-Definite Programming (SDP) with a
stationary mixture of kernels
Trained one-versus-all classifiers on 3
functional classes of yeast genetic data from the
Noble Research Lab, University of Washington.
Classes
Energy
Interaction with Cellular Environment
Control of Cellular Organization
Found that MED surpassed the performance of SVMs
with single kernels, but SDP still did the best.
My goal is to extend MED with transduction to try
to improve its accuracy further.

23
23 / 28
Toy Experiments with Transduction
I have been working with Darrin Lewis in Prof.
Jebaras Machine Learning research lab. Since he
already has MED code that works, we would like to
extend it to incorporate transduction. Before we
start changing his code, I am familiarizing
myself with some simple transductive SVM
algorithms on toy data.
24
24 / 28
Toy Experiments with Transduction
25
25 / 28
Toy Experiments with Transduction

Idea simple transductive methods should be
evaluated first if only for comparison with
the more complex principled approaches.
A simple transduction algorithm
Step 1 Train on labeled data.
Step 2 Test on all data (labeled unlabeled)
to get inductive accuracy.
Step 3 Apply predicted labels to unlabeled
data and retrain.
Step 4 Test new classifier on all data and
find accuracy.
Step 5 Repeat from Step 3 for a certain of
iterations.

26
26 / 28
Toy Experiments with Transduction
In this case, transduction does worse than
induction (the first observation)
27
27 / 28
Conclusions

Latent Maximum Entropy Discrimination with
Mixtures of Gaussians can be extended with
Transduction by incorporating distributions
over the labels
Transduction can sometimes be helpful for
incorporating knowledge about the distribution
of unlabeled data into our learning approach
MED is currently inferior to SDP for the
protein classification task. Perhaps
transduction can improve MEDs results.
Further analysis should be done on simple
transductive methods for comparison with more
complicated, more principled ones.
I need more sleep. Good night!

28
28 / 28
References
Jebara, T., Lewis D., Noble W., "Max Margin
Mixture Models and Non-Stationary Kernel
Selection", NIPS 2005, Columbia
University Jaakkola T., Meila M., and Jebara T.,
"Maximum Entropy Discrimination". To appear in
Neural Information Processing Systems 12 (NIPS
'99) , Denver, CO, December 1999. T. Joachims,
Transductive Inference for Text Classification
using Support Vector Machines. Proceedings of the
International Conference on Machine Learning
(ICML), 1999 Questions?

Write a Comment

User Comments (0)