Title: An Overview on Semi-Supervised Learning Methods
1An Overview onSemi-Supervised LearningMethods
- Matthias SeegerMPI for Biological Cybernetics
- Tuebingen, Germany
2Overview
- The SSL Problem
- Paradigms for SSL. Examples
- The Importance ofInput-dependent Regularization
- Note Citations omitted here (given inmy
literature review)
3Semi-Supervised Learning
- SSL is Supervised Learning...
- Goal Estimate P(yx) from Labeled DataDl
(xi,yi) - But Additional Source tells about P(x)(e.g.,
Unlabeled Data Duxj)
The Interesting Case
4Obvious Baseline Methods
The Goal of SSL is To Do Better Not Uniformly
and always(No Free Lunch and yes (of course)
Unlabeled data can hurt) But (as always) If our
modelling and algorithmic efforts reflecttrue
problem characteristics
- Do not use info about P(x)? Supervised Learning
- Fit a Mixture Modelusing Unsupervised
Learning, thenlabel up components using yi
5The Generative Paradigm
- Model Class Distributions and
- Implies model for P(yx)and for P(x)
6The Joint Likelihood
- Natural Criterion in this context
- Maximize using EM (idea as old as EM)
- Early and recent theoretical work onasymptotic
variance - Advantage Easy to implement forstandard mixture
model setups
7Drawbacks of Generative SSL
- Choice of source weighting l crucial
- Cross-Validation fails for small n
- Homotopy Continuation (Corduneanu etal.)
- Just like in Supervised Learning
- Model for P(yx) specified indirectly
- Fitting not primarily concerned with
P(yx).Also Have to represent P(x) generally
wellNot just aspects which help with P(yx).
8The Diagnostic Paradigm
- Model P(yx,q) and P(xm)directly
- But Since q,m areindependent a priori,q does
not depend on m, given data? Knowledge of m does
not influence P(yx) prediction in a
probabilistic setup!
9What To Do About It
- Non-probabilistic diagnostic techniques
- Replace expected lossbyTong, Koller Chapelle
etal.? Very limited effect if n small - Some old work (eg., Anderson)
- Drop the prior independence of q,m?
Input-dependent Regularization
10Input-Dependent Regularization
q
- Conditional priors P(qm)make P(yx)
estimationdependent on P(x), - Now, unlabeled data can really help...
- And can hurt for the same reason!
11The Cluster Assumption (CA)
- Empirical Observation Clustering of data xj
w.r.t. sensible distance / features often
fairly compatible with class regions - Weaker Class regions do not tend to cut
high-volume regions of P(x) - Why? Ask Philosophers! My guessSelection bias
for features/distance
No Matter Why Many SSL Methods implement theCA
and work fine in practice
12Examples For IDR Using CA
- Label Propagation, Gaussian Random Fields
Regularization depends on graph structure which
is built from all xj? More smoothness in
regions of high connectivity / affinity
flows - Cluster kernels for SVM (Chapelle etal.)
- Information Regularization(Corduneanu, Jaakkola)
13More Examples for IDR
- Some methods do IDR, but implement the CA only in
special cases - Fisher Kernels (Jaakkola etal.)Kernel from
Fisher features? Automatic feature induction
from P(x) model - Co-Training (Blum, Mitchell)Consistency across
diff. views (features)
14Is SSL Always Generative?
- Wait We have to model P(x) somehow.Is this not
always generative then? ... No! - Generative Model P(xy) fairly directly, P(yx)
model and effect of P(x) are implicit - Diagnostic IDR
- Direct model for P(yx), more flexibility
- Influence of P(x) knowledge on P(yx) prediction
directly controlled, eg. through CA? Model for
P(x) can be much less elaborate
15Conclusions
- Given taxonomy for probabilistic approaches to
SSL - Illustrated paradigms by examples from literature
- Tried to clarify some points which have led to
confusions in the past