Title: Syntax-based Statistical Machine Translation Models
1Syntax-based Statistical Machine Translation
Models
- Amr Ahmed
- March 26th 2008
2Outline
- The Translation Problem
- The Noisy Channel Model
- Syntax-light SMT
- Why Syntax?
- Syntax-based SMT Models
- Summary
3Statistical Machine Translation
Problem
- Given a sentence (f) in one language, produce it
is equivalent in another language (e)
I know how to do this
One naturally wonders if the problem of
translation could conceivably be treated as a
problem in cryptography. When I look at an
article in Arabic, I say This is really written
in English, but it has been coded in some strange
symbols. I will now proceed to decode. , Warren
Weaver, 1947
4Statistical Machine Translation
Problem
- Given a sentence (f) in one language, produce it
is equivalent in another language (e)
Noisy Channel Model
Noisy Channel
P(e)
We know how to factor P(e)!
e
f
P(e) models good English P(fe) models good
translation
Today How to factor p(fe)?
5Outline
- The Translation Problem
- The Noisy Channel Model
- Syntax-light SMT
- Word-based Models
- Phrase-based Models
- Why Syntax?
- Syntax-based SMT Models
- Summary
6Word-Translation Models
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
Blue word links arent observed in data.
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
- What is the generative Story?
- IBM Model 1-4
- Roughly equivalent to FST (module reordering)
- Learning and Decoding?
Slide Credit Adapted from Smith et. al.
7Word-Based Translation Models
e
-Stochastic operations -Associated with
probabilities -Estimated using EM
In a Nutshell
Q What are we learning? A Word movement
f
Linguistic Hypothesis
Phrase based Models
1- Words move in blocks 2- Context is important
8Phrase-Based Translation Models
e
Segment
-Stochastic operations -Associated with
probabilities -Estimated using EM
Translation
In a Nutshell
Q What are we learning? A Word movement
Re-ordering
f
Linguistic Hypothesis
Phrase based Models
1- Words move in blocks 2- Context is important
Markovian Dependency
9Phrase-Based Translation Models
e
Segment
-Stochastic operations -Associated with
probabilities -Estimated using EM
Translation
In a Nutshell
a1
a2
a3
Q What are we learning? A Word movement
Re-ordering
f
Linguistic Hypothesis
Phrase based Models
1- Words move in blocks 2- Context is important
Markovian Dependency
10Phrase-Based Models Example
Not necessarily syntactic phrases
Division into phrases is hidden
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
question
I
did
not
unfortunately
receive
an
answer
to
this
Score each phrase pair using several features
Slide Credit from Smith et. al.
11Phrase Table Estimation
Basically count and Normalize
12Outline
- The Translation Problem
- The Noisy Channel Model
- Syntax-light SMT
- Word-based Models
- Phrase-based Models
- Why Syntax?
- Syntax-based SMT Models
- Summary
13Outline
- The Translation Problem
- The Noisy Channel Model
- Syntax-light SMT
- Why Syntax?
- Syntax-based SMT Models
- Summary
14Why Syntax?
- Reference consequently proposals are submitted
to parliament under the assent procedure, meaning
that parliament can no longer table amendments,
as directives in this area were adopted as single
market legislation under the codecision procedure
on the basis of art.100a tec. - Translation consequently, the proposals
parliament after the assent procedure, the tabled
amendments for offers no possibility of community
directives, because as part of the internal
market legislation on the basis of article 100a
of the treaty in the codecision procedure have
been adopted.
Slide Credit Example from Cowan et. al.
15Why Syntax?
Slide Credit Adapted from Cowan et. al.
- Reference consequently proposals are submitted
to parliament under the assent procedure, meaning
that parliament can no longer table amendments,
as directives in this area were adopted as single
market legislation under the codecision procedure
on the basis of art.100a tec. - Translation consequently, the proposals
parliament after the assent procedure, the tabled
amendments for offers no possibility of community
directives, because as part of the internal
market legislation on the basis of article 100a
of the treaty in the codecision procedure have
been adopted.
Here syntax Can help!
What Went Wrong?
- phrase-based systems are very good at predicting
content words, - But are less accurate in producing function
words, or producing output that correctly encodes
grammatical relations between content words
16Structure Does Help!
Does adding more Structure help ?
Se
Se
x1
x2
x3
Noisy Channel
Noisy Channel
Sf
Sf
x2
x1
x3
Word-based
Phrase-based
Syntax-based
Better performance
?
17Syntax and the Translation Pipeline
Input
Pre-reordering
Translation system
Syntax
Syntax in the Translation model
Output
Post processing (re-ranking)
18Early Exposition (Koehn et al 2003)
- Fix a phrase-based System and vary the way
phrases are extracted - Frequency-based, Generative, Constituent
- Adding syntax hurts the performance
- Phrases like there is? es gibt is not a
constituent (this eliminate 80 phrase-pairs) - Explanation
- No hierarchical re-ordering
- Syntax is not fully exploited here!
- Parse trees produce errors
19Outline
- The Translation Problem
- The Noisy Channel Model
- Syntax-light SMT
- Why Syntax?
- Syntax-based SMT Models
- Summary
20The Big Picture Translation Models
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Word-based
Phrase-based
SCFG (Chiang 2005), ITG (Wu 97)
Inter-lingua
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Tree-Tree Transducers
Tree-String Transducers
String-Tree Transducers
21Learning Synchronous Grammar
- No linguistic annotation
- Model P(e,f) jointly
- Trees are hidden variables
- EM doesnt work well with large missing
information - Structural restrictions
- Binary rules (ITG, Wu 97)
- Lexical restriction Chiang 2005
SCFG to represent Hierarchal phrases
What is Synchronous Grammar?
22Interlude Synchronous Grammar
- Extension of monolingual theory to bitext
- CFG ? SCFG
- TAG ? STAG
- etc.
- Monolingual parsers are extended for bitext
parsing
23Synchronous Grammar SCFG
CFG
SCFG
24Learning Synchronous Grammar
- No linguistic annotation
- Model P(e,f) jointly
- Trees are hidden variables
- EM doesnt work well with large missing
information - Structural restrictions
- Binary rules (ITG, Wu 97)
- Lexical restriction Chiang 2005
SCFG to represent Hierarchal phrases
What is Synchronous Grammar?
How
25Hierarchical Phrase-based Model
Hierarchical Phrased-based Models
S1
S1
x1
x1
S2
S2
x3
x3
f3
x2
f4
e3
x2
e4
f1
e1
e5
e6
e6
e5
Phrased-based Models
Se
Sf
x1
x2
x3
x2
x1
x3
26Example (Chiang 2005)
27Hierarchical Phrase-based Model
Question1
How to train the model?
What are the restrictions
-At most two recursive phrases -Restriction on
length
Question 2
How to decode?
28Training and Decoding
- Collect initial grammar rules
29Training and Decoding
- Collect initial grammar rules
- Tune rule weights count and normalize!
- Decoding
- CYK (remember rules has at most two
non-terminals) - Parse the f part only.
30Does it help?
- Experimental Details
- Mandarin-to-English (FBIS corpus)
- 7.2M 9.2 M words
- Devset NIST 2002 MT evaluation
- Test Set 2003 NIST MT evaluation
- 7.5 relative improvement over phrase-based
models using BLEU score - 0.02 absolute improvement over baseline
31Does it help?
- 7.5 relative improvement over phrase-based models
- Learnt rules are formally SCFG but not
linguistically interpretable - The model learns re-ordering patterns guided by
lexical functional words - Capture long-range movements via recursion
32Follow-Up study
- Why not decorate the phrases with their
grammatical constituents? - Zollmann et. Al. 2006, 2007
- If possible decorate the phrase with a
constituent - Generalize phrases as in Chiang 2005
- Parse using chart parsing
- Moved from 31.85 ?32.15 over CMU phrase-based
system - Spanich-English corpus
33The Big Picture Translation Models
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Word-based
Phrase-based
SCFG (Chiang 2005), ITG (Wu 97)
Inter-lingua
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Tree-Tree Transducers
Tree-String Transducers
String-Tree Transducers
34Tree-String Tranceducers
- Linguistic Tools
- English Parse Trees
- from statistical parser
- Alignment
- from Giza
- Conditional Model
- P(f Te)
- Models differ on
- How to factor P(f Te)
- Domain of locality
- SCFG (Yamada,Knight 2001)
- STSG (Galley et. Al 2004)
Caveat
35Tree-String (Yamada Knight)
- Back to noisy channel model
- Traduces Te into f
- Stochastic Channel operations (on trees)
- Reorder children
- Insert node
- Lexical Transplantation
36Channel operations
P(VB T0? T0 VB)
P(rightPRP) Pi(ha)
37Learning
- Learn channel operation probabilities
- Reordering
- Insertion
- Translation
- Standard EM-Training
- E-Step compute expected rule counts (Dyn.)
- M-Step count and normalize
38Decoding As Parsing
- In a nutshell, we learnt how to parse the foreign
side - Add CFG rules from the English side
- Channel rules
- Reordering
- If (VB2-gtVB T0) reordered as (VB2? T0 VB)
- Add rule VB2?p T0 VB
- Insertion
- V?plXV and V?prV X and X?fi
- Translation
- ei?pt fi
39Decoding Example
40Results and Expressiveness
- English-Chinese task
- Short sentences lt 20 words (3M word corpus)
- Test set 347 sentence with at most 14 words
- Better Bleu score (0.102) than IBM-4 (.072)
What it can represent
- Depends on syntactic divergence between languages
pairs - Tree must be isomorphic up to child re-reordering
- Channel rules have the following format
Q What it cant model?
Child re-ordering
41Limitations
- Cant model syntactic movements that cross
brackets - SVO to VSO
- Modal movement between English and French
- Not ? ne .. pas (from English to French)
VP
VP
VP
VP
VP
.
VB
Aux
go
Not
Does
pas
va
ne
The span of Not cant intersect that of Go
Cant Interleave Green with the other two
42Limitations Possible solutions
- Some follow up study showed relative improvement
by - Gildea 2003 added cloning operations
- AER went from .42 ? 0.3 on Korean-English corus
VP
VP
VP
VP
VP
.
VB
Aux
go
Not
Does
pas
va
ne
The span of Not cant intersect that of Go
Cant Interleave Green with the other two
43Tree-String Tranceducers
- Linguistic Tools
- English Parse Trees
- from statistical parser
- Alignment
- from Giza
- Conditional Model
- P(f Te)
- Models differ on
- How to factor P(f Te)
- Domain of locality
- SCFG (Yamada,Knight 2001)
- STSG (Galley et. Al 2004)
Caveat
44Learning Expressive Rules (Galley 2004)
Yamada Knight
Channel Operation Tables
f1,f2,..,fn
Parsing Rules For F-side
Galley, et. al 2004
Rule Extraction
TSG rules
CFG rules
- Condition on larger fragments of the trees
45Rule format Decoding
Rule 1
Current State
Derivation Step
VP
VP
fi1
ne VB pas
fi
NP VP
NP ne VB pas
X2
Aux
go
VB
PRP
PRP
Not
Does
go
Not
he
he
Does
Tree Fragment
CFG
- Tree is build bottom up
- Foreign string at each derivation may have
non-terminals - Rules are extracted from training corpus
- English side trees
- Foreign side strings
- Alignment from Giza
46Rule Extraction
S
VP
NP
Aux
VB
RB
Upward projection
PRP
go
Not
Does
he
pas
il
va
ne
S
go
VP
he il
VP
NP
Frontier nodes Nodes whose span is exclusive
il
Aux
VBva
va
Rx
Frontier Graph
Not
Does
PRP il
VBva
NP il
S,NP,PRP,he, VP, VB,go
pas
ne
PRP il
he il
Gova
Extract Rule as before
47Illustrating Rule Extraction
48Minimality of Extracted rules
- Other rules can be composed form these minimal
rules
VP
VP
Aux
Aux
VB
VBva
RB
Rx
VBva
go
Not
Does
Not
Does
Gova
pas
pas
va
ne
ne
49Probability Estimation
- Just EM
- Modified inside outside for E-Step
- Decoding as parsing
- Training can be done using of the shelf
tree-transducers (Knight et al. 2004)
50Evaluation
- Coverage
- how well the learnt rules explain the corpus
- 100 coverage on F-E and C-E corpus
Translation Results
Decoder was still work in progress
51The Big Picture Translation Models
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Word-based
Phrase-based
SCFG (Chiang 2005), ITG (Wu 97)
Inter-lingua
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Tree-Tree Transducers
Tree-String Transducers
String-Tree Transducers
52Tree-Tree Transducers
- Linguistic Tools
- E/F Parse Trees
- from statistical parser
- Alignment
- from Giza
- Conditional Model
- P(Tf Te)
- Models differ on
- How to factor P(Tf Te)
- Really many many many
- CFG vs. dependency trees
- How to train
- EM most of them
- Discriminative (Collins 2006)
- Directly model P(Te Tf)
Same Caveat as before
53Discriminative tree-tree models
- Directly model P(Te Tf)
- Translate from German to English
- Extended Projections
- Just elementary tree from TAG with
- One verb
- Lexical functional words
- NP, PP place holders
- Learns to map between tree fragments
- German clause ? EP
- Modeled as structured learning
54How to Decode?
- Why no generative story?
- Because this is a direct model!
- Given a German String
- Parse it
- Break it into clauses
- Predict an EP from each clause
- Translate German NP, PP using Pharaoh
- Map translated German NP and NN to holes in EP
- Structure learning comes here
- Stitch clauses to get English translation
55How to train
- Training data Aligned clauses
- Extraction procedures
- Parse English and German
- Align (NP,PP) in them using GIZE
- Break parse trees into clauses
- Order clauses based on verb position
- Discard sentences with different number of
clauses - Training (e1,g1).(en,gn)
56How to train (2)
57How to train (3)
- - Just our good old perceptron friend!
58Results
- German-English Europol corpus
- 750k training sentences ? 441,000 training
clauses, test on 2000 sentences - BLEU Score
- base line 25.26
- This system 23.96
- Human judgment
- 62 equal
- 16 better under this system
- 22 better for baseline
- Largely because lots of restriction were imposed
59Outline
- The Translation Problem
- The Noisy Channel Model
- Syntax-light SMT
- Why Syntax?
- Syntax-based SMT Models
- Summary
60Summary
- Syntax Does help but
- What is the right representation
- Is it language-pair specific?
- How to deal with parser errors?
- Modeling the uncertainty of the parsing process
- Large scale syntax-based models
- Are they possible?
- What are the trade-offs
- Better parameter estimation!
- Should we trust the GIZA alignment results?
- Block-translation vs. word-word?
61 62Related Work
- Fast parsers for synchronous grammars
- Grammar binarization
- Fast K-best list parses
- Re-ranking
- Syntax driven evaluation measures
- Impact of parsing quality on overall system
performance
63SAMT