Semisupervised Learning - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Semisupervised Learning

Description:

Run your favorite clustering algorithm on Xl,Xu. ... all 2u possible labeling of Xu. Build one standard SVM for ... Classify Xu with f(1) and f(2) separately. ... – PowerPoint PPT presentation

Number of Views:258
Avg rating:3.0/5.0
Slides: 59
Provided by: matthia
Category:

less

Transcript and Presenter's Notes

Title: Semisupervised Learning


1
Semi-supervised Learning
2
Overview
  • Introduction to SSL Problem
  • SSL Algorithms

3
Why SSL?
  • Data labeling is expensive and difficult
  • Labeling is often unreliable
  • Unlabeled examples
  • Easy to obtain in large numbers
  • e.g. webpage classification, bioinformatics,
    image classification

4
Notations(classification)
  • input instance x, label y
  • estimate
  • labeled data
  • unlabeled data , available
    during training(additional source tells about
    P(x))
  • usually
  • test data , not available during
    training

5
SSL vs. Transductive Learning
  • Semi-supervised learning is ultimately applied to
    the test data (inductive).
  • Transductive learning is only concerned with the
    unlabeled data.

6
Glossary
  • supervised learning (classification, regression)
  • (x1n, y1n)
  • semi-supervised classification/regression
  • (x1l, y1l), xl1n, xtest
  • transductive classification/regression
  • (x1l, y1l), xl1n
  • semi-supervised clustering
  • x1n, must-link, cannot-links
  • unsupervised learning (clustering)
  • x1n

7
Is unlabeled samples useful?
  • In general yes, but not always(discuss later)
  • Classification error reduces
  • Exponentially with labeled examples
  • Linearly with unlabeled examples

8
SSL Algorithms
  • Self-Training
  • Generative Models
  • S3VMs
  • Graph-Based Algorithms
  • Co-training
  • Multiview algorithms

9
Self-Training
  • Assumption
  • Ones own high confidence predictions are
    correct.
  • Self-training algorithm
  • Train f from (x1n, y1n)
  • Predict on
  • Add (x, f(x)) to labeled data
  • Add all
  • Add a few most confident pairs
  • Add weight for each pairs
  • Repeat

10
Advantages of Self-Training
  • The simplest semi-supervised learning method.
  • A wrapper method, applies to existing
    classifiers.
  • Often used in real tasks like natural language
    processing.

11
Disadvantages of Self-Training
  • Early mistakes could reinforce themselves.
  • Heuristic solutions, e.g. endowed with weights or
    choose the most confident ones.
  • Cannot say too much in terms of convergence.

12
SSL Algorithms
  • Self-Training
  • Generative Models
  • S3VMs
  • Graph-Based Algorithms
  • Co-training
  • Multiview algorithms

13
Generative Models
  • Assuming each class has a Gaussian distribution,
    what is the decision boundary?

14
Decision boundary
15
Adding unlabeled data
16
The new decision boundary
17
They are different because
18
Basic idea
  • If we have the full generative models p(X, Y?)
  • quantity of interest
  • find the maximum likelihood estimate (MLE) of ,
    the maximum a posteriori (MAP) estimate, or be
    Bayesian

19
Some generative models
  • Mixture of Gaussian distributions (GMM)
  • image classification
  • the EM algorithm
  • Mixture of multinomial distributions
  • text categorization
  • the EM algorithm
  • Hidden Markov Models (HMM)
  • speech recognition
  • Baum-Welch algorithm

20
Example GMM
  • For simplicity, consider binary classification
    with GMM using MLE.
  • Model parameters ?w1, w2, µ1, µ2, ?1, ?2
  • So
  • To estimate ?, we maximize
  • Then, we have

21
Continue
  • Now we get ?, then Predict y maximum a
    posteriori

22
What about SSGMM?
  • To estimate ?, we maximize
  • More complicated?(a mixture of two normal
    distribution)

23
A more complicated case
  • For simplicity, consider a mixture of two normal
    distribution.
  • Model parameters
  • So

24
A more complicated case
  • Then
  • Direct MLE is difficult numerically.

25
The EM for GMM
  • We consider unobserved latent variables ?i
  • If ?i 0, then (xi, yi) comes from model 0
  • Else ?i 1, then (xi, yi) comes from model 1
  • Suppose we know the values of ?is, then

26
The EM for GMM
  • The values of the ?i's are actually unknown.
  • EMs idea we proceed in an iterative fashion,
    substituting for each ?i in its expected value.

27
Another version of EM for GMM
  • Start from MLE ?w1, w2, µ1, µ2, ?1, ?2 on (Xl,
    Yl), repeat
  • The E-step compute the expected label p(yx, ?)
    for all x Xu
  • label p(y1x, ?)-fraction of x with class 1
  • label p(y2x, ?)-fraction of x with class 2

28
Another version of EM for GMM
  • The M-step update MLE ? with (now weighted
    labeled) Xu

29
The EM algorithm in general
  • Set up
  • observed data D (Xl, Yl, Xu)
  • hidden data Yu
  • Goal find ? to maximize
  • Properties
  • starts from an arbitrary ?0(or estimate on (Xl,
    Yl))
  • The E-step estimate p(YuXu, ?0)
  • The M-step maximize
  • iteratively improves p(D?)
  • converges to a local maximum of ?

30
Beyond EM
  • Key is to maximize p(Xl, Yl, Xu?).
  • EM is just one way to maximize it.
  • Other ways to find parameters are possible too,
    e.g. variational approximation, or direct
    optimization.

31
Advantages of generative models
  • Clear, well-studied probabilistic framework
  • Can be extremely effective, if the model is close
    to correct

32
Disadvantages of generative models
  • Often difficult to verify the correctness of the
    model
  • Model identifiability
  • p(y1)0.2, p(xy1)unif(0, 0.2),
    p(xy-1)unif(0.2, 1) (1)
  • p(y1)0.6, p(xy1)unif(0, 0.6),
    p(xy-1)unif(0.6, 1)
  • Can we predict on x0.5?
  • EM local optima
  • Unlabeled data may hurt if generative model is
    wrong

33
Unlabeled data may hurt SSL
34
Heuristics to lessen the danger
  • Carefully construct the generative model to
    reflect the task
  • e.g. multiple Gaussian distributions per class,
    instead of a single one
  • Down-weight the unlabeled data (?lt1)

35
Related method cluster-and-label
  • Instead of probabilistic generative models, any
    clustering algorithm can be used for
    semi-supervised classification too
  • Run your favorite clustering algorithm on Xl,Xu.
  • Label all points within a cluster by the majority
    of labeled points in that cluster.
  • Pro Yet another simple method using existing
    algorithms.
  • Con Can be difficult to analyze due to their
    algorithmic nature.

36
SSL Algorithms
  • Self-Training
  • Generative Models
  • S3VMs
  • Graph-Based Algorithms
  • Co-training
  • Multiview algorithms

37
Semi-supervised SVMs
  • Semi-supervised SVMs(S3VMs)
  • Transductive SVMs(TSVMs)

38
SVM with hinge loss
  • The hinge loss
  • The optimization problem(objective function)

39
S3VMs
  • Assumption
  • Unlabeled data from different classes are
    separated with large margin.
  • Basic idea
  • Enumerate all 2u possible labeling of Xu
  • Build one standard SVM for each labeling (and Xl)
  • Pick the SVM with the largest margin
  • NP-hard!

40
A smart trick
  • How to incorporate unlabeled points?
  • Assign labels sign(f(x)) to x?Xu, i.e. the
    unlabeled ones classified correctly.
  • Is it equivalent to our basic idea?(Yes)
  • The hinge loss on unlabeled points becomes

41
S3VM objective function
  • S3VM objective
  • the decision boundary f 0 wants to be placed so
    that there is few unlabeled data near it.

42
The class balancing constraint
  • Directly optimizing the S3VM objective often
    produces unbalanced classification
  • most points fall in one class.
  • Heuristic class balance
  • Relaxed class balancing constraint

43
S3VM algorithm
  • The optimization problem
  • Classify a new test point x by sign(f(x))

44
The S3VM optimization challenge
  • SVM objective is convex.
  • S3VM objective is non-convex.
  • Finding a solution for semi-supervised SVM is
    difficult, which has been the focus of S3VM
    research.
  • Different approaches SVMlight, rS3VM,
    continuation S3VM, deterministic annealing, CCCP,
    Branch and Bound, SDP convex relaxation, etc.

45
Advantages of S3VMs
  • Applicable wherever SVMs are applicable, i.e.
    almost everywhere
  • Clear mathematical framework
  • More modest assumption than generative model or
    graph-based methods

46
Disadvantages of S3VMs
  • Optimization difficult
  • Can be trapped in bad local optima

47
SSL Algorithms
  • Self-Training
  • Generative Models
  • S3VMs
  • Graph-Based Algorithms
  • Co-training
  • Multiview algorithms

48
Graph-Based Algorithms
  • Assumption
  • A graph is given on the labeled and unlabeled
    data. Instances connected by heavy edge tend to
    have the same label.
  • The optimization problem
  • Some algorithms
  • mincut
  • harmonic
  • local and global consistency
  • manifold regularization

49
SSL Algorithms
  • Self-Training
  • Generative Models
  • S3VMs
  • Graph-Based Algorithms
  • Co-training
  • Multiview algorithms

50
Co-training
  • Two views of an item image and HTML text

51
Feature split
  • Each instance is represented by two sets of
    features xx(1) x(2)
  • x(1) image features
  • x(2) web page text
  • This is a natural feature split (or multiple
    views)
  • Co-training idea
  • Train an image classifier and a text classifier
  • The two classifiers teach each other

52
Co-training assumptions
  • Assumptions
  • feature split xx(1) x(2) exists
  • x(1) or x(2) alone is sufficient to train a good
    classifier
  • x(1) or x(2) are conditionally independent given
    the class

53
Co-training algorithm
  • Train two classifiers
  • f(1) from f(2) from
  • Classify Xu with f(1) and f(2) separately.
  • Add f(1)s k-most-confident (x, f(1)(x)) to
    f(2)s labeled data.
  • Add f(2)s k-most-confident (x, f(2)(x)) to
    f(1)s labeled data.
  • Repeat.

54
Pros and cons of co-training
  • Pros
  • Simple wrapper method. Applies to almost all
    existing classifiers
  • Less sensitive to mistakes than self-training
  • Cons
  • Natural feature splits may not exist
  • Models using BOTH features should do better

55
Variants of co-training
  • Co-EM add all, not just top k
  • Each classifier probabilistically label Xu
  • Add (x, y) with weight P(yx)
  • Fake feature split
  • create random, artificial feature split
  • apply co-training
  • Multiview agreement among multiple classifiers
  • no feature split
  • train multiple classifiers of different types
  • classify unlabeled data with all classifiers
  • add majority vote label

56
SSL Algorithms
  • Self-Training
  • Generative Models
  • S3VMs
  • Graph-Based Algorithms
  • Co-training
  • Multiview algorithms

57
Multiview algorithms
  • A regularized risk minimization framework to
    encourage multi-learner agreement

58
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com