CS6772 Advanced Machine Learning Fall 2006 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CS6772 Advanced Machine Learning Fall 2006

Description:

Discriminative Classifier with Gaussians ... Discriminative Latent Likelihood Ratio Classifiers. Maximum Entropy Discrimination (MED) ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 29
Provided by: Bar7182
Category:

less

Transcript and Presenter's Notes

Title: CS6772 Advanced Machine Learning Fall 2006


1
1 / 28
CS6772 Advanced Machine Learning Fall
2006 Extending Maximum Entropy Discrimination
on Mixtures of Gaussians With Transduction Fina
l Project by Barry Rafkind
2
2 / 28
Presentation Outline
  • Maximum Entropy Discrimination
  • Transduction with MED / SVMs
  • Application to Yeast Protein Classification
  • Toy Experiments with Transduction
  • Conclusions

3
3 / 28
Discriminative Classifier with Gaussians
Discriminative Classifier Log-Likelihood of
Ratio of Gaussians
4
4 / 28
Discriminative Classifier with Gaussians
Specify discriminant function by choice of
One approach from regularization theory would
choose
to agree with labels
Such that
where
Determines the margin
While minimizing the regularization function
5
5 / 28
Maximum Entropy Discrimination (MED)
In MED, we solve for a distribution over
solutions, P(?) such that the expected value of
the distriminant under this distribution agrees
with the labeling.
6
6 / 28
Maximum Entropy Discrimination (MED)
In addition to finding a P(?) that satisfies
classification constraints in the expectation,
MED regularizes the solution distribution P(?) by
either maximizing its entropy or minimizing its
relative entropy toward some prior target
distribution P0(?)
7
7 / 28
Maximum Entropy Discrimination (MED)
Minimize relative entropy toward some prior
target distribution P0(?) by finding relative
Shannon entropy (KL divergence) given by ...
Note that minimizing relative entropy is more
general since choosing uniform P0(?) gives
maximum entropy.
8
8 / 28
Maximum Entropy Discrimination (MED)
Thus, MED solves the constrained optimization
problem
Which projects the prior P0(?) to the closest
point in the admissible set or convex hull
defined by the above t 1..T constraints
  • P0(?)

P(?)
9
9 / 28
Maximum Entropy Discrimination (MED)
The solution for the posterior P(?) is the
standard maximum entropy setting
10
10 / 28
Maximum Entropy Discrimination (MED)
The solution for the posterior P(?) is the
standard maximum entropy setting
The partition function Z(?) normalizes P(?) .
MED finds the optimal setting of the ? Lagrange
multipliers (?t for t 1..T) by maximizing the
concave objective function J(?) - ln Z(?)
11
11 / 28
Maximum Entropy Discrimination (MED)
Given ?, the solution distribution P(?) is fully
specified. We can then use this for predicting
the label of a new data point X via the equation
12
12 / 28
Maximum Entropy Discrimination (MED)
SVMs as a Special Case of MED
  • Interestingly, applying MED to a ratio of
    Gaussians exactly reproduces support vector
    machines.
  • We simply assume the prior distribution
    factorizes into a prior over the vector
    parameters and a prior over the scalar bias
  • The first two priors are white zero-mean
    Gaussians over the means

which encourages means of low magnitude for our
Gaussians.
  • The last prior is a non-informative (i.e. flat)
    prior
  • indicating that any scalar bias is equally
    probable a priori.
  • The resulting objective function is

13
13 / 28
Maximum Entropy Discrimination (MED)
Non-Separable Cases
  • To work with non-separable problems, we use a
    distribution over margins in the prior and
    posterior instead of simply setting them equal to
    a constant (which is like using a delta-function
    prior)
  • The MED solution distribution then involves an
    augmented theta which includes all margin
    variables as follows
  • The formula for the partition function Z(k) is
    as above except we now have the following
    factorized prior distribution
  • The margin priors are chosen to favor large
    margins

14
14 / 28
Maximum Entropy Discrimination (MED)
Discriminative Latent Likelihood Ratio Classifiers
Consider a discriminant that is a ratio of two
mixture models
  • Computing the partition function for mixtures
    becomes intractable with exponentially many
    terms.
  • To compensate, we can use Jensen's inequality
    and variational methods.
  • Jensen is first applied in the primal MED
    problem to tighten the classification
    constraints.
  • Then, Jensen is applied to the dual MED problem
    to yield a tractable projection.

15
15 / 28
Transduction with Gaussian Ratio Discriminants
Classification is Straightfoward When All Labels
are Known
16
16 / 28
Transduction with Gaussian Ratio Discriminants
But labeling data is expensive and usually we
have many unlabeled data points that might still
be useful to us.
17
17 / 28
Transduction with Gaussian Ratio Discriminants
Transductive learners can take advantage of
unlabeled data to capture the distribution of
each class better but how?
18
18 / 28
A Principled Approach to Transduction
  • Uncertain labels can be handled in a principled
    way within the MED formalism
  • Let y y1,...,yT) be a set of binary
    variables corresponding to the labels for the
    training examples
  • We can define a prior uncertainty over the
    labels by specifying P0(y).
  • For simplicity, we can take this to be a
    product distribution
  • Pt,0(yt) 1 if the label is known and 1/2
    otherwise

19
19 / 28
A Principled Approach to Transduction
The MED solution is found by calculating the
relative entropy projection from the overall
prior distribution
To the admissible set of distributions P (no
longer directly a function of the labels) that
are consistent with the constraints (for all t
1T)
A feasible solution has been proposed for this
using a mean field approximation in a 2-step
process.
20
20 / 28
Thorsten Joachims Approach to Transductive SVMs
21
21 / 28
Thorsten Joachims Approach to Transductive SVMs
  • Start by training an inductive SVM on the
    labeled training data and classifying the
    unlabeled test data accordingly.
  • Then uniformly increase the influence of the
    test examples by incrementing the cost-factors
    C- and C up to the user-defined value of C.
  • A criterion condition identifies pairs of
    examples for which changing the class labels
  • leads to a decrease in the current objective
    function and then switches their labels.

22
22 / 28
Application of MED to Yeast Protein Classification
  • A Comparison was Performed among 3 Methods
  • The latent MED approach (without transduction)
  • SVMs with single kernels
  • Semi-Definite Programming (SDP) with a
    stationary mixture of kernels
  • Trained one-versus-all classifiers on 3
    functional classes of yeast genetic data from the
    Noble Research Lab, University of Washington.
    Classes
  • Energy
  • Interaction with Cellular Environment
  • Control of Cellular Organization
  • Found that MED surpassed the performance of SVMs
    with single kernels, but SDP still did the best.
    My goal is to extend MED with transduction to try
    to improve its accuracy further.

23
23 / 28
Toy Experiments with Transduction
I have been working with Darrin Lewis in Prof.
Jebaras Machine Learning research lab. Since he
already has MED code that works, we would like to
extend it to incorporate transduction. Before we
start changing his code, I am familiarizing
myself with some simple transductive SVM
algorithms on toy data.
24
24 / 28
Toy Experiments with Transduction
25
25 / 28
Toy Experiments with Transduction
  • Idea simple transductive methods should be
    evaluated first if only for comparison with
    the more complex principled approaches.
  • A simple transduction algorithm
  • Step 1 Train on labeled data.
  • Step 2 Test on all data (labeled unlabeled)
    to get inductive accuracy.
  • Step 3 Apply predicted labels to unlabeled
    data and retrain.
  • Step 4 Test new classifier on all data and
    find accuracy.
  • Step 5 Repeat from Step 3 for a certain of
    iterations.

26
26 / 28
Toy Experiments with Transduction
In this case, transduction does worse than
induction (the first observation)
27
27 / 28
Conclusions
  • Latent Maximum Entropy Discrimination with
    Mixtures of Gaussians can be extended with
    Transduction by incorporating distributions
    over the labels
  • Transduction can sometimes be helpful for
    incorporating knowledge about the distribution
    of unlabeled data into our learning approach
  • MED is currently inferior to SDP for the
    protein classification task. Perhaps
    transduction can improve MEDs results.
  • Further analysis should be done on simple
    transductive methods for comparison with more
    complicated, more principled ones.
  • I need more sleep. Good night!

28
28 / 28
References
Jebara, T., Lewis D., Noble W., "Max Margin
Mixture Models and Non-Stationary Kernel
Selection", NIPS 2005, Columbia
University Jaakkola T., Meila M., and Jebara T.,
"Maximum Entropy Discrimination". To appear in
Neural Information Processing Systems 12 (NIPS
'99) , Denver, CO, December 1999. T. Joachims,
Transductive Inference for Text Classification
using Support Vector Machines. Proceedings of the
International Conference on Machine Learning
(ICML), 1999 Questions?
Write a Comment
User Comments (0)
About PowerShow.com