Semisupervised Learning - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Semisupervised Learning

Description:

Run your favorite clustering algorithm on Xl,Xu. ... all 2u possible labeling of Xu. Build one standard SVM for ... Classify Xu with f(1) and f(2) separately. ... – PowerPoint PPT presentation

Number of Views:258

Avg rating:3.0/5.0

Slides: 59

Provided by: matthia

Category:

Tags: learning | semisupervised | xu

more less

Transcript and Presenter's Notes

Title: Semisupervised Learning

1
Semi-supervised Learning
2
Overview

Introduction to SSL Problem
SSL Algorithms

3
Why SSL?

Data labeling is expensive and difficult
Labeling is often unreliable
Unlabeled examples
Easy to obtain in large numbers
e.g. webpage classification, bioinformatics,
image classification

4
Notations(classification)

input instance x, label y
estimate
labeled data
unlabeled data , available
during training(additional source tells about
P(x))
usually
test data , not available during
training

5
SSL vs. Transductive Learning

Semi-supervised learning is ultimately applied to
the test data (inductive).
Transductive learning is only concerned with the
unlabeled data.

6
Glossary

supervised learning (classification, regression)
(x1n, y1n)
semi-supervised classification/regression
(x1l, y1l), xl1n, xtest
transductive classification/regression
(x1l, y1l), xl1n
semi-supervised clustering
x1n, must-link, cannot-links
unsupervised learning (clustering)
x1n

7
Is unlabeled samples useful?

In general yes, but not always(discuss later)
Classification error reduces
Exponentially with labeled examples
Linearly with unlabeled examples

8
SSL Algorithms

Self-Training
Generative Models
S3VMs
Graph-Based Algorithms
Co-training
Multiview algorithms

9
Self-Training

Assumption
Ones own high confidence predictions are
correct.
Self-training algorithm
Train f from (x1n, y1n)
Predict on
Add (x, f(x)) to labeled data
Add all
Add a few most confident pairs
Add weight for each pairs
Repeat

10
Advantages of Self-Training

The simplest semi-supervised learning method.
A wrapper method, applies to existing
classifiers.
Often used in real tasks like natural language
processing.

11
Disadvantages of Self-Training

Early mistakes could reinforce themselves.
Heuristic solutions, e.g. endowed with weights or
choose the most confident ones.
Cannot say too much in terms of convergence.

12
SSL Algorithms

Self-Training
Generative Models
S3VMs
Graph-Based Algorithms
Co-training
Multiview algorithms

13
Generative Models

Assuming each class has a Gaussian distribution,
what is the decision boundary?

14
Decision boundary
15
Adding unlabeled data
16
The new decision boundary
17
They are different because
18
Basic idea

If we have the full generative models p(X, Y?)
quantity of interest
find the maximum likelihood estimate (MLE) of ,
the maximum a posteriori (MAP) estimate, or be
Bayesian

19
Some generative models

Mixture of Gaussian distributions (GMM)
image classification
the EM algorithm
Mixture of multinomial distributions
text categorization
the EM algorithm
Hidden Markov Models (HMM)
speech recognition
Baum-Welch algorithm

20
Example GMM

For simplicity, consider binary classification
with GMM using MLE.
Model parameters ?w1, w2, µ1, µ2, ?1, ?2
So
To estimate ?, we maximize
Then, we have

21
Continue

Now we get ?, then Predict y maximum a
posteriori

22
What about SSGMM?

To estimate ?, we maximize
More complicated?(a mixture of two normal
distribution)

23
A more complicated case

For simplicity, consider a mixture of two normal
distribution.
Model parameters
So

24
A more complicated case

Then
Direct MLE is difficult numerically.

25
The EM for GMM

We consider unobserved latent variables ?i
If ?i 0, then (xi, yi) comes from model 0
Else ?i 1, then (xi, yi) comes from model 1
Suppose we know the values of ?is, then

26
The EM for GMM

The values of the ?i's are actually unknown.
EMs idea we proceed in an iterative fashion,
substituting for each ?i in its expected value.

27
Another version of EM for GMM

Start from MLE ?w1, w2, µ1, µ2, ?1, ?2 on (Xl,
Yl), repeat
The E-step compute the expected label p(yx, ?)
for all x Xu
label p(y1x, ?)-fraction of x with class 1
label p(y2x, ?)-fraction of x with class 2

28
Another version of EM for GMM

The M-step update MLE ? with (now weighted
labeled) Xu

29
The EM algorithm in general

Set up
observed data D (Xl, Yl, Xu)
hidden data Yu
Goal find ? to maximize
Properties
starts from an arbitrary ?0(or estimate on (Xl,
Yl))
The E-step estimate p(YuXu, ?0)
The M-step maximize
iteratively improves p(D?)
converges to a local maximum of ?

30
Beyond EM

Key is to maximize p(Xl, Yl, Xu?).
EM is just one way to maximize it.
Other ways to find parameters are possible too,
e.g. variational approximation, or direct
optimization.

31
Advantages of generative models

Clear, well-studied probabilistic framework
Can be extremely effective, if the model is close
to correct

32
Disadvantages of generative models

Often difficult to verify the correctness of the
model
Model identifiability
p(y1)0.2, p(xy1)unif(0, 0.2),
p(xy-1)unif(0.2, 1) (1)
p(y1)0.6, p(xy1)unif(0, 0.6),
p(xy-1)unif(0.6, 1)
Can we predict on x0.5?
EM local optima
Unlabeled data may hurt if generative model is
wrong

33
Unlabeled data may hurt SSL
34
Heuristics to lessen the danger

Carefully construct the generative model to
reflect the task
e.g. multiple Gaussian distributions per class,
instead of a single one
Down-weight the unlabeled data (?lt1)

35
Related method cluster-and-label

Instead of probabilistic generative models, any
clustering algorithm can be used for
semi-supervised classification too
Run your favorite clustering algorithm on Xl,Xu.
Label all points within a cluster by the majority
of labeled points in that cluster.
Pro Yet another simple method using existing
algorithms.
Con Can be difficult to analyze due to their
algorithmic nature.

36
SSL Algorithms

Self-Training
Generative Models
S3VMs
Graph-Based Algorithms
Co-training
Multiview algorithms

37
Semi-supervised SVMs

Semi-supervised SVMs(S3VMs)
Transductive SVMs(TSVMs)

38
SVM with hinge loss

The hinge loss
The optimization problem(objective function)

39
S3VMs

Assumption
Unlabeled data from different classes are
separated with large margin.
Basic idea
Enumerate all 2u possible labeling of Xu
Build one standard SVM for each labeling (and Xl)
Pick the SVM with the largest margin
NP-hard!

40
A smart trick

How to incorporate unlabeled points?
Assign labels sign(f(x)) to x?Xu, i.e. the
unlabeled ones classified correctly.
Is it equivalent to our basic idea?(Yes)
The hinge loss on unlabeled points becomes

41
S3VM objective function

S3VM objective
the decision boundary f 0 wants to be placed so
that there is few unlabeled data near it.

42
The class balancing constraint

Directly optimizing the S3VM objective often
produces unbalanced classification
most points fall in one class.
Heuristic class balance
Relaxed class balancing constraint

43
S3VM algorithm

The optimization problem
Classify a new test point x by sign(f(x))

44
The S3VM optimization challenge

SVM objective is convex.
S3VM objective is non-convex.
Finding a solution for semi-supervised SVM is
difficult, which has been the focus of S3VM
research.
Different approaches SVMlight, rS3VM,
continuation S3VM, deterministic annealing, CCCP,
Branch and Bound, SDP convex relaxation, etc.

45
Advantages of S3VMs

Applicable wherever SVMs are applicable, i.e.
almost everywhere
Clear mathematical framework
More modest assumption than generative model or
graph-based methods

46
Disadvantages of S3VMs

Optimization difficult
Can be trapped in bad local optima

47
SSL Algorithms

Self-Training
Generative Models
S3VMs
Graph-Based Algorithms
Co-training
Multiview algorithms

48
Graph-Based Algorithms

Assumption
A graph is given on the labeled and unlabeled
data. Instances connected by heavy edge tend to
have the same label.
The optimization problem
Some algorithms
mincut
harmonic
local and global consistency
manifold regularization

49
SSL Algorithms