Title: Semisupervised Learning
1Semi-supervised Learning
2Overview
- Introduction to SSL Problem
- SSL Algorithms
3Why SSL?
- Data labeling is expensive and difficult
- Labeling is often unreliable
- Unlabeled examples
- Easy to obtain in large numbers
- e.g. webpage classification, bioinformatics,
image classification
4Notations(classification)
- input instance x, label y
- estimate
- labeled data
- unlabeled data , available
during training(additional source tells about
P(x)) - usually
- test data , not available during
training
5SSL vs. Transductive Learning
- Semi-supervised learning is ultimately applied to
the test data (inductive). - Transductive learning is only concerned with the
unlabeled data.
6Glossary
- supervised learning (classification, regression)
- (x1n, y1n)
- semi-supervised classification/regression
- (x1l, y1l), xl1n, xtest
- transductive classification/regression
- (x1l, y1l), xl1n
- semi-supervised clustering
- x1n, must-link, cannot-links
- unsupervised learning (clustering)
- x1n
7Is unlabeled samples useful?
- In general yes, but not always(discuss later)
- Classification error reduces
- Exponentially with labeled examples
- Linearly with unlabeled examples
8SSL Algorithms
- Self-Training
- Generative Models
- S3VMs
- Graph-Based Algorithms
- Co-training
- Multiview algorithms
9Self-Training
- Assumption
- Ones own high confidence predictions are
correct. - Self-training algorithm
- Train f from (x1n, y1n)
- Predict on
- Add (x, f(x)) to labeled data
- Add all
- Add a few most confident pairs
- Add weight for each pairs
- Repeat
10Advantages of Self-Training
- The simplest semi-supervised learning method.
- A wrapper method, applies to existing
classifiers. - Often used in real tasks like natural language
processing.
11Disadvantages of Self-Training
- Early mistakes could reinforce themselves.
- Heuristic solutions, e.g. endowed with weights or
choose the most confident ones. - Cannot say too much in terms of convergence.
12SSL Algorithms
- Self-Training
- Generative Models
- S3VMs
- Graph-Based Algorithms
- Co-training
- Multiview algorithms
13Generative Models
- Assuming each class has a Gaussian distribution,
what is the decision boundary?
14Decision boundary
15Adding unlabeled data
16The new decision boundary
17They are different because
18Basic idea
- If we have the full generative models p(X, Y?)
- quantity of interest
- find the maximum likelihood estimate (MLE) of ,
the maximum a posteriori (MAP) estimate, or be
Bayesian
19Some generative models
- Mixture of Gaussian distributions (GMM)
- image classification
- the EM algorithm
- Mixture of multinomial distributions
- text categorization
- the EM algorithm
- Hidden Markov Models (HMM)
- speech recognition
- Baum-Welch algorithm
20Example GMM
- For simplicity, consider binary classification
with GMM using MLE. - Model parameters ?w1, w2, µ1, µ2, ?1, ?2
- So
- To estimate ?, we maximize
- Then, we have
21Continue
- Now we get ?, then Predict y maximum a
posteriori
22What about SSGMM?
- To estimate ?, we maximize
- More complicated?(a mixture of two normal
distribution)
23A more complicated case
- For simplicity, consider a mixture of two normal
distribution. - Model parameters
- So
24A more complicated case
- Then
- Direct MLE is difficult numerically.
25The EM for GMM
- We consider unobserved latent variables ?i
- If ?i 0, then (xi, yi) comes from model 0
- Else ?i 1, then (xi, yi) comes from model 1
- Suppose we know the values of ?is, then
26The EM for GMM
- The values of the ?i's are actually unknown.
- EMs idea we proceed in an iterative fashion,
substituting for each ?i in its expected value.
27Another version of EM for GMM
- Start from MLE ?w1, w2, µ1, µ2, ?1, ?2 on (Xl,
Yl), repeat - The E-step compute the expected label p(yx, ?)
for all x Xu - label p(y1x, ?)-fraction of x with class 1
- label p(y2x, ?)-fraction of x with class 2
28Another version of EM for GMM
- The M-step update MLE ? with (now weighted
labeled) Xu
29The EM algorithm in general
- Set up
- observed data D (Xl, Yl, Xu)
- hidden data Yu
- Goal find ? to maximize
- Properties
- starts from an arbitrary ?0(or estimate on (Xl,
Yl)) - The E-step estimate p(YuXu, ?0)
- The M-step maximize
- iteratively improves p(D?)
- converges to a local maximum of ?
30Beyond EM
- Key is to maximize p(Xl, Yl, Xu?).
- EM is just one way to maximize it.
- Other ways to find parameters are possible too,
e.g. variational approximation, or direct
optimization.
31Advantages of generative models
- Clear, well-studied probabilistic framework
- Can be extremely effective, if the model is close
to correct
32Disadvantages of generative models
- Often difficult to verify the correctness of the
model - Model identifiability
- p(y1)0.2, p(xy1)unif(0, 0.2),
p(xy-1)unif(0.2, 1) (1) - p(y1)0.6, p(xy1)unif(0, 0.6),
p(xy-1)unif(0.6, 1) - Can we predict on x0.5?
- EM local optima
- Unlabeled data may hurt if generative model is
wrong
33Unlabeled data may hurt SSL
34Heuristics to lessen the danger
- Carefully construct the generative model to
reflect the task - e.g. multiple Gaussian distributions per class,
instead of a single one - Down-weight the unlabeled data (?lt1)
35Related method cluster-and-label
- Instead of probabilistic generative models, any
clustering algorithm can be used for
semi-supervised classification too - Run your favorite clustering algorithm on Xl,Xu.
- Label all points within a cluster by the majority
of labeled points in that cluster. - Pro Yet another simple method using existing
algorithms. - Con Can be difficult to analyze due to their
algorithmic nature.
36SSL Algorithms
- Self-Training
- Generative Models
- S3VMs
- Graph-Based Algorithms
- Co-training
- Multiview algorithms
37Semi-supervised SVMs
- Semi-supervised SVMs(S3VMs)
- Transductive SVMs(TSVMs)
38SVM with hinge loss
- The hinge loss
- The optimization problem(objective function)
39S3VMs
- Assumption
- Unlabeled data from different classes are
separated with large margin. - Basic idea
- Enumerate all 2u possible labeling of Xu
- Build one standard SVM for each labeling (and Xl)
- Pick the SVM with the largest margin
- NP-hard!
40A smart trick
- How to incorporate unlabeled points?
- Assign labels sign(f(x)) to x?Xu, i.e. the
unlabeled ones classified correctly. - Is it equivalent to our basic idea?(Yes)
- The hinge loss on unlabeled points becomes
41S3VM objective function
- S3VM objective
- the decision boundary f 0 wants to be placed so
that there is few unlabeled data near it.
42The class balancing constraint
- Directly optimizing the S3VM objective often
produces unbalanced classification - most points fall in one class.
- Heuristic class balance
- Relaxed class balancing constraint
43S3VM algorithm
- The optimization problem
- Classify a new test point x by sign(f(x))
44The S3VM optimization challenge
- SVM objective is convex.
- S3VM objective is non-convex.
- Finding a solution for semi-supervised SVM is
difficult, which has been the focus of S3VM
research. - Different approaches SVMlight, rS3VM,
continuation S3VM, deterministic annealing, CCCP,
Branch and Bound, SDP convex relaxation, etc.
45Advantages of S3VMs
- Applicable wherever SVMs are applicable, i.e.
almost everywhere - Clear mathematical framework
- More modest assumption than generative model or
graph-based methods
46Disadvantages of S3VMs
- Optimization difficult
- Can be trapped in bad local optima
47SSL Algorithms
- Self-Training
- Generative Models
- S3VMs
- Graph-Based Algorithms
- Co-training
- Multiview algorithms
48Graph-Based Algorithms
- Assumption
- A graph is given on the labeled and unlabeled
data. Instances connected by heavy edge tend to
have the same label. - The optimization problem
- Some algorithms
- mincut
- harmonic
- local and global consistency
- manifold regularization
49SSL Algorithms
- Self-Training
- Generative Models
- S3VMs
- Graph-Based Algorithms
- Co-training
- Multiview algorithms
50Co-training
- Two views of an item image and HTML text
51Feature split
- Each instance is represented by two sets of
features xx(1) x(2) - x(1) image features
- x(2) web page text
- This is a natural feature split (or multiple
views) - Co-training idea
- Train an image classifier and a text classifier
- The two classifiers teach each other
52Co-training assumptions
- Assumptions
- feature split xx(1) x(2) exists
- x(1) or x(2) alone is sufficient to train a good
classifier - x(1) or x(2) are conditionally independent given
the class
53Co-training algorithm
- Train two classifiers
- f(1) from f(2) from
- Classify Xu with f(1) and f(2) separately.
- Add f(1)s k-most-confident (x, f(1)(x)) to
f(2)s labeled data. - Add f(2)s k-most-confident (x, f(2)(x)) to
f(1)s labeled data. - Repeat.
54Pros and cons of co-training
- Pros
- Simple wrapper method. Applies to almost all
existing classifiers - Less sensitive to mistakes than self-training
- Cons
- Natural feature splits may not exist
- Models using BOTH features should do better
55Variants of co-training
- Co-EM add all, not just top k
- Each classifier probabilistically label Xu
- Add (x, y) with weight P(yx)
- Fake feature split
- create random, artificial feature split
- apply co-training
- Multiview agreement among multiple classifiers
- no feature split
- train multiple classifiers of different types
- classify unlabeled data with all classifiers
- add majority vote label
56SSL Algorithms
- Self-Training
- Generative Models
- S3VMs
- Graph-Based Algorithms
- Co-training
- Multiview algorithms
57Multiview algorithms
- A regularized risk minimization framework to
encourage multi-learner agreement
58