SemiSupervised Boosting for Statistical Word Alignment

About This Presentation

Title:

SemiSupervised Boosting for Statistical Word Alignment

Description:

Phrase-based machine translation. System: Pharaoh. Metrics: NIST and BLEU ... Boosting does improve word alignment and translation quality ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 22

Provided by: wuh9

Category:

more less

Transcript and Presenter's Notes

Title: SemiSupervised Boosting for Statistical Word Alignment

1
Semi-Supervised Boosting for Statistical Word
Alignment

Wu Hua
2006/10/18

2
Outline

Introduction to semi-supervised learning
Introduction to boosting
Semi-supervised boosting for word alignment
Evaluation results
Conclusion

3
Machine Learning Methods

Supervised Learning
Labeled data
Unsupervised learning
Unlabeled data
Semi-supervised learning
Combine both labeled data and unlabeled data

4
Semi-Supervised Learning in NLP

Word sense disambiguation
(Yarowsky, 1995 Pham et al., 2005)
Classification
(Blum and Mitchell, 1998 Thorsten, 1999)
Clustering
(Basu et al., 2004)
Named entity classification
(Collins and Singer, 1999)
Parsing
(Sarkar, 2001)

5
Boosting Supervised Learning
Initialization
Supervised Learning
Call Learner
Calculate Error Rate
Re-weight Training data
Yes
Build Ensemble
6
Boosting in NLP

Tagging and PP attachment
(Abney et al., 1999)
Word sense disambiguation
(Escudero et al., 2000)
Parser construction
(Haruno et al., 1999 Henderson and Brill, 2000)
Sentence generation
(Walker et al., 2001)

7
Semi-Supervised Boosting

Three main problems
Semi-supervised learner
Combine labeled data and unlabeled data
Reference set
Automatically construct a reference set for
unlabeled data
Error rate calculation
How to calculate the error rate with both labeled
data and unlabeled data

8
Semi-Supervised Boosting Applied to Word Alignment
Labeled Data
Unlabeled Data
Supervised Training
Unsupervised Training
Model Interpolation
Real Reference Set
Error Rate Calculation
Pseudo Reference Set
Re-weight Training data
Yes
Build Ensemble
9
Semi-Supervised Boosting Applied to Word Alignment

Five main components
Word alignment model interpolation
Pseudo reference set construction for unlabeled
data
Error rate calculation
Weight update
Final Ensemble

10
Word Alignment Model

Supervised alignment model
Calculate the probabilities for IBM Model 4 based
on the labeled data
Unsupervised alignment model
Use GIZA to train IBM Model 4
Perform model interpolation

11
Pseudo Reference Set Construction

Obtain bi-directional word alignment sets S1 and
S2 on the training data
Obtain the intersection set of these two
alignment sets
Filter the union set of the two alignment sets
Build the pseudo reference set

where
12
Error Rate Calculation

For a sentence pair
Calculate the error rate of a aligner
Based on the labeled data instead of the whole
data

where
is the normalized weight of the ith sentence pair
at the lth round
13
Re-Weight the Training Data

Reweight each sentence pair in the training set
For each sentence pair, there may exist correct
links and incorrect links as compared with the
pseudo reference set
Calculate the weight of each sentence pair
according to the correct and incorrect links

where
K is the number of the error links n is the total
number of the links in the reference
14
Final Ensemble

Obtain the final ensemble according to the
trained word aligners on each round

where
is the final ensemble for word alignment
is the weight of each alignment pair (s,t)
produced by the word aligner
is the weight of the word aligner
15
Evaluation