Title: Semisupervised learning and selftraining
1Semi-supervised learning and self-training
- LING 572
- Fei Xia
- 02/14/06
2Outline
- Overview of Semi-supervised learning (SSL)
- Self-training (a.k.a. Bootstrapping)
3Additional Reference
- Xiaojin Zhu (2006) Semi-supervised learning
literature survey. - Olivier Chapelle et al. (2005) Semi-supervised
Learning. The MIT Press.
4Overview of SSL
5What is SSL?
- Labeled data
- Ex POS tagging tagged sentences
- Creating labeled data is difficult, expensive,
and/or time-consuming. - Unlabeled data
- Ex POS tagging untagged sentences.
- Obtaining unlabeled data is easier.
- Goal use both labeled and unlabeled data to
improve the performance
6- Learning
- Supervised (labeled data only)
- Semi-supervised (both labeled and unlabeled data)
- Unsupervised (unlabeled data only)
- Problems
- Classification
- Regression
- Clustering
-
- ? Focus on semi-supervised classification problem
7A brief history of SSL
- The idea of self-training appeared in the 1960s.
- SSL took off in the 1970s.
- The interest for SSL increased in the 1990s,
mostly due to applications in NLP.
8Does SSL work?
- Yes, under certain conditions.
- Problem itself the knowledge on p(x) carry
information that is useful in the inference of
p(y x). - Algorithm the modeling assumption fits well with
the problem structure. - SSL will be most useful when there are far more
unlabeled data than labeled data. - SSL could degrade the performance when mistakes
reinforce themselves.
9Illustration(Zhu, 2006)
10Illustration (cont)
11Assumptions
- Smoothness (continuity) assumption if two points
x1 and x2 in a high-density region are close,
then so should be the corresponding outputs y1
and y2. - Cluster assumption If points are in the same
cluster, they are likely to be of the same class. -
- Low density separation the decision boundary
should lie in a low density region. - .
12SSL algorithms
- Self-training
- Co-training
- Generative models
- Ex EM with generative mixture models
- Low Density Separations
- Ex Transductive SVM
- Graph-based models
13Which SSL method should we use?
- It depends.
- Semi-supervised methods make strong model
assumptions. Choose the ones whose assumptions
fit the problem structure.
14Semi-supervised and active learning
- They address the same issue labeled data are
hard to get. - Semi-supervised choose the unlabeled data to be
added to the labeled data. - Active learning choose the unlabeled data to be
annotated.
15Self-training
16Basics of self-training
- Probably the earliest SSL idea.
- Also called self-teaching or bootstrapping.
- Appeared in the 1960s and 1970s.
- First well-known NLP paper (Yarowsky, 1995)
17Self-training algorithm
- Let L be the set of labeled data, U be the set of
unlabeled data. - Repeat
- Train a classifier h with training data L
- Classify data in U with h
- Find a subset U of U with the most confident
scores. - L U ? L
- U U ? U
18Case study (Yarowsky, 1995)
19Setting
- Task WSD
- Ex plant living / factory
- Unsupervised just need a few seed collocations
for each sense. - Learner Decision list
20Assumption 1 One sense per collocation
- Nearby words provide strong and consistent clues
to the sense of a target word. - The effect varies depending on the type of
collocation - It is strongest for immediately adjacent
collocations. - Assumption 1 ? Use collocations in the decision
rules.
21Assumption 2 One sense per discourse
- The sense of a target word is highly consistent
within any given document. - The assumption holds most of the time (99.8 in
their experiment) - Assumption 2 ? filter and augment the addition
of unlabeled data.
22Step 1 identify all examples of the given word
plant
Our sample set S
23Step 2 Create initial labeled data using a small
number of seed collocations
Sense A life
Our L(0)
Sense B manufacturing
U(0) S L(0) residual data set.
24Initial labeled data
25Step 3a Train a BL classifier
26Step 3b Apply the classifier to the entire set
- Add to L the members in U which are tagged with
prob above a threshold.
or
27Step 3c filter and augment this addition with
assumption 2
28Repeat step 3 until converge
29The final DL classifier
30The original algorithm
Keep initial labeling unchanged.
31Options for obtaining the seeds
- Use words in dictionary definitions
- Use a single defining collocate for each sense
- Ex bird and machine for the word crane
- Label salient corpus collocates
- Collect frequent collocates
- Manually check the collocates
- ? Getting the seeds is not hard.
32Performance
Baseline 63.9 Supervised learning
96.1 Unsupervised learning 90.6 - 96.5
33Why does it work?
- (Steven Abney, 2004) Understanding the Yarowsky
Algorithm
34Summary of self-training
- The algorithm is straightforward and intuitive.
- It produces outstanding results.
- Added unlabeled data pollute the original labeled
data
35Additional slides
36Papers on self-training
- Yarowsky (1995) WSD
- Riloff et al. (2003) identify subjective nouns
- Maeireizo et al. (2004) classify dialogues as
emotional or non-emotional.