Title: CS6772 Advanced Machine Learning Fall 2006
11 / 28
CS6772 Advanced Machine Learning Fall
2006 Extending Maximum Entropy Discrimination
on Mixtures of Gaussians With Transduction Fina
l Project by Barry Rafkind
22 / 28
Presentation Outline
- Maximum Entropy Discrimination
- Transduction with MED / SVMs
- Application to Yeast Protein Classification
- Toy Experiments with Transduction
- Conclusions
33 / 28
Discriminative Classifier with Gaussians
Discriminative Classifier Log-Likelihood of
Ratio of Gaussians
44 / 28
Discriminative Classifier with Gaussians
Specify discriminant function by choice of
One approach from regularization theory would
choose
to agree with labels
Such that
where
Determines the margin
While minimizing the regularization function
55 / 28
Maximum Entropy Discrimination (MED)
In MED, we solve for a distribution over
solutions, P(?) such that the expected value of
the distriminant under this distribution agrees
with the labeling.
66 / 28
Maximum Entropy Discrimination (MED)
In addition to finding a P(?) that satisfies
classification constraints in the expectation,
MED regularizes the solution distribution P(?) by
either maximizing its entropy or minimizing its
relative entropy toward some prior target
distribution P0(?)
77 / 28
Maximum Entropy Discrimination (MED)
Minimize relative entropy toward some prior
target distribution P0(?) by finding relative
Shannon entropy (KL divergence) given by ...
Note that minimizing relative entropy is more
general since choosing uniform P0(?) gives
maximum entropy.
88 / 28
Maximum Entropy Discrimination (MED)
Thus, MED solves the constrained optimization
problem
Which projects the prior P0(?) to the closest
point in the admissible set or convex hull
defined by the above t 1..T constraints
P(?)
99 / 28
Maximum Entropy Discrimination (MED)
The solution for the posterior P(?) is the
standard maximum entropy setting
1010 / 28
Maximum Entropy Discrimination (MED)
The solution for the posterior P(?) is the
standard maximum entropy setting
The partition function Z(?) normalizes P(?) .
MED finds the optimal setting of the ? Lagrange
multipliers (?t for t 1..T) by maximizing the
concave objective function J(?) - ln Z(?)
1111 / 28
Maximum Entropy Discrimination (MED)
Given ?, the solution distribution P(?) is fully
specified. We can then use this for predicting
the label of a new data point X via the equation
1212 / 28
Maximum Entropy Discrimination (MED)
SVMs as a Special Case of MED
- Interestingly, applying MED to a ratio of
Gaussians exactly reproduces support vector
machines. - We simply assume the prior distribution
factorizes into a prior over the vector
parameters and a prior over the scalar bias
- The first two priors are white zero-mean
Gaussians over the means
which encourages means of low magnitude for our
Gaussians.
- The last prior is a non-informative (i.e. flat)
prior - indicating that any scalar bias is equally
probable a priori. - The resulting objective function is
1313 / 28
Maximum Entropy Discrimination (MED)
Non-Separable Cases
- To work with non-separable problems, we use a
distribution over margins in the prior and
posterior instead of simply setting them equal to
a constant (which is like using a delta-function
prior)
- The MED solution distribution then involves an
augmented theta which includes all margin
variables as follows
- The formula for the partition function Z(k) is
as above except we now have the following
factorized prior distribution
- The margin priors are chosen to favor large
margins
1414 / 28
Maximum Entropy Discrimination (MED)
Discriminative Latent Likelihood Ratio Classifiers
Consider a discriminant that is a ratio of two
mixture models
- Computing the partition function for mixtures
becomes intractable with exponentially many
terms. - To compensate, we can use Jensen's inequality
and variational methods.
- Jensen is first applied in the primal MED
problem to tighten the classification
constraints. - Then, Jensen is applied to the dual MED problem
to yield a tractable projection.
1515 / 28
Transduction with Gaussian Ratio Discriminants
Classification is Straightfoward When All Labels
are Known
1616 / 28
Transduction with Gaussian Ratio Discriminants
But labeling data is expensive and usually we
have many unlabeled data points that might still
be useful to us.
1717 / 28
Transduction with Gaussian Ratio Discriminants
Transductive learners can take advantage of
unlabeled data to capture the distribution of
each class better but how?
1818 / 28
A Principled Approach to Transduction
- Uncertain labels can be handled in a principled
way within the MED formalism - Let y y1,...,yT) be a set of binary
variables corresponding to the labels for the
training examples
- We can define a prior uncertainty over the
labels by specifying P0(y).
- For simplicity, we can take this to be a
product distribution
- Pt,0(yt) 1 if the label is known and 1/2
otherwise
1919 / 28
A Principled Approach to Transduction
The MED solution is found by calculating the
relative entropy projection from the overall
prior distribution
To the admissible set of distributions P (no
longer directly a function of the labels) that
are consistent with the constraints (for all t
1T)
A feasible solution has been proposed for this
using a mean field approximation in a 2-step
process.
2020 / 28
Thorsten Joachims Approach to Transductive SVMs
2121 / 28
Thorsten Joachims Approach to Transductive SVMs
- Start by training an inductive SVM on the
labeled training data and classifying the
unlabeled test data accordingly. - Then uniformly increase the influence of the
test examples by incrementing the cost-factors
C- and C up to the user-defined value of C. - A criterion condition identifies pairs of
examples for which changing the class labels - leads to a decrease in the current objective
function and then switches their labels.
2222 / 28
Application of MED to Yeast Protein Classification
- A Comparison was Performed among 3 Methods
- The latent MED approach (without transduction)
- SVMs with single kernels
- Semi-Definite Programming (SDP) with a
stationary mixture of kernels - Trained one-versus-all classifiers on 3
functional classes of yeast genetic data from the
Noble Research Lab, University of Washington.
Classes - Energy
- Interaction with Cellular Environment
- Control of Cellular Organization
- Found that MED surpassed the performance of SVMs
with single kernels, but SDP still did the best.
My goal is to extend MED with transduction to try
to improve its accuracy further.
2323 / 28
Toy Experiments with Transduction
I have been working with Darrin Lewis in Prof.
Jebaras Machine Learning research lab. Since he
already has MED code that works, we would like to
extend it to incorporate transduction. Before we
start changing his code, I am familiarizing
myself with some simple transductive SVM
algorithms on toy data.
2424 / 28
Toy Experiments with Transduction
2525 / 28
Toy Experiments with Transduction
- Idea simple transductive methods should be
evaluated first if only for comparison with
the more complex principled approaches. - A simple transduction algorithm
- Step 1 Train on labeled data.
- Step 2 Test on all data (labeled unlabeled)
to get inductive accuracy. - Step 3 Apply predicted labels to unlabeled
data and retrain. - Step 4 Test new classifier on all data and
find accuracy. - Step 5 Repeat from Step 3 for a certain of
iterations.
2626 / 28
Toy Experiments with Transduction
In this case, transduction does worse than
induction (the first observation)
2727 / 28
Conclusions
- Latent Maximum Entropy Discrimination with
Mixtures of Gaussians can be extended with
Transduction by incorporating distributions
over the labels - Transduction can sometimes be helpful for
incorporating knowledge about the distribution
of unlabeled data into our learning approach - MED is currently inferior to SDP for the
protein classification task. Perhaps
transduction can improve MEDs results. - Further analysis should be done on simple
transductive methods for comparison with more
complicated, more principled ones. - I need more sleep. Good night!
2828 / 28
References
Jebara, T., Lewis D., Noble W., "Max Margin
Mixture Models and Non-Stationary Kernel
Selection", NIPS 2005, Columbia
University Jaakkola T., Meila M., and Jebara T.,
"Maximum Entropy Discrimination". To appear in
Neural Information Processing Systems 12 (NIPS
'99) , Denver, CO, December 1999. T. Joachims,
Transductive Inference for Text Classification
using Support Vector Machines. Proceedings of the
International Conference on Machine Learning
(ICML), 1999 Questions?