Title: Dependency Trees and Machine Translation
1Dependency Trees and Machine Translation
- Vamshi Ambati
- Vamshi_at_cs.cmu.edu
- Spring 2008 Adv MT Seminar
- 02 April 2008
2Today
- Introduction
- Dependency formalism
- Syntax in Machine Translation
- Dependency Tree based Machine Translation
- By projection
- By synchronous modeling
- Conclusion and Future
3Today
- Introduction
- Dependency formalism
- Syntax in Machine Translation
- Dependency Tree based Machine Translation
- By projection
- By synchronous modeling
- Conclusion and Future
4Dependency Trees
Phrase Structure Trees
John gave Mary an apple
5Dependency Trees
Phrase Structure Trees Labels
S
VP
NP
JohnN gaveV MaryN anDT appleN
6Dependency Trees
Head Percolation - Usually done
deterministically - Assuming one head per phrase
gave
gave
apple
John gave Mary an apple
7Dependency Trees
gave
apple
John Mary an
8Dependency Trees
John
Mary
gave
an
apple
9Dependency Trees Basics
(optional)
SUBJ
John gave
- Child
- Dependent
- Modifier
- Modifier
- Parent
- Governor
- Head
- Modified
- The direction of arrows can be head-child or
child-head - (has to be mentioned)
10Dependency Trees Basics
- Properties
- Every word has a single head/parent
- Except for the root
- Completely connected tree
- Acyclic
- If wi?wj then never wj?wi
- Variants
- Projective Non-crossing between dependencies
- If wi -gtwj , then for all k between i and j,
either wk -gtwi or wk -gtwj holds - Non-Projective Allow crossings between
depdenencies
11Projective dependency tree
ounces
Projectiveness all the words between here
finally depend on either on
was or .
Example credit Yuji Matsumoto, NAIST, Japan
12Non-projective dependency tree
Direction of edges from a parent to the children
Note Phrases thus extracted which are united by
dependencies could be discontinuous
Example from R. McDonald and F. Pereira EACL,
2006.
13Dependency Grammar (DG) in the Grammar Formalism
Timeline
- Panini (2600 years ago, India) recognised,
distinguished and classified semantic, syntactic
and morphological dependencies (Bharati, Natural
Language Processing) - The Arabic grammarians (1200 years ago, Iraq)
recognised government and syntactic dependency
structure, (The Foundations of Grammar - Owens) - The Latin grammarians (800 years ago) recognised
'determination' and dependency structures. -
Percival, "Reflections on the History of
Dependency Notions - Lucien Tesniere (1930s, France) developed a
relatively formal and sophisticated theory of DG
grammar for use in schools - PSG, CCG etc were around the same time in early
20th century
Source ELLSSI 2000 Tutorial on Dependency
Grammars
14Dependency Trees some phenomenon
- DG has been widely accepted as a variant of PSG,
but it is not strongly equivalent - Constituents are implicit in a DepTree and can be
derived - Relations are explicit and can be labelled
although optional - No explicit non-terminal nodes, which means no
unary productions too - Can handle discontinuous phrases too
- Known problems with Coordination and Gerunds
15Phrase structure vs Dependency
- Phrase structure suitable to languages with
- rather fixed word order patterns
- clear constituency structures
- English etc
- Dependency structure suitable to languages with
- greater freedom of word order
- order is controlled more by pragmatic than by
syntactic factors - Slavonic (Czech, Polish) and some Romance
(Italian , spanish etc)
16Today
- Introduction
- Dependency formalism
- Syntax in Machine Translation
- Dependency Tree based Machine Translation
- By projection
- By synchronous modeling
- Conclusion and Future
17Phrasal SMT discussion
- Advantages
- Do not have to compose translations unnecessarily
- Local re-ordering captured in phrases
- Already specific to the domain and capture
context locally - Disadvantages
- Specificity and no generalization
- Discontiguous phrases not considered
- Global reordering
- Estimation problems (long vs short phrases)
- Can not model phenomenon across phrases
- Limitations
- Phrase sizes (how much before I run into out of
memory?) - Corpus Availability makes it feasible only to
certain language pairs
18Syntax in MT Many Representations
- WordLevel MT No syntax
- SMT Phrases / contiguous sequences
- SMT Hierarchical Pseudo Syntax
- Syntax based SMT Constituent
- Syntax based SMT CCG
- Syntax based SMT LFG
- Syntax based SMT Dependency
19Syntax in MT Many ways of incorporation
- Pre-procesing
- Reordering input
- Reordered training corpus
- Translation models
- Syntactically informed alignment models
- Better distortion models
- Language Models
- Syntactic language models
- Syntax motivated models
- Post-processing
- Nbest list reranking with syntactic information
- Translation correction Case marker/TAM
correction - True casing etc?
- Multi combinations with Syntactic backbones?
20Syntax based SMT discussion
- Inversion Transduction Grammar (Wu 96)
- Very constrained form of syntax One
non-terminal - Some expressive limitations
- Not linguistically motivated
- Effectively learns preferences for flip/no-flip
- Generative Tree to String (Yamada Knight 2001)
- Expressiveness (last week presentation)
- No discontiguous phrases
- Multitext grammars (Melamed 2003)
- Formalized, but MT work yet to be realized
- Hierarchical MT (Chang 2005)
- Linguistic generalizations
- Handles discontiguous phrases recursively
- Estimation problems and Phrase table are
increased even more - Across phrase boundary modeling
21Syntax in MT and Dependency Trees
- Source side tree is provided
- Target side is obtained by projection
- Problem of Isomorphism between trees
- head-switching
- empty-dep extra-dep
Se
Syntax
Source
Target
Tree and String
Se
Sf
Source side tree is provided Target side is
provided Ideally non-isomorphic trees should be
modeled too
Syntax
Syntax
Source
Target
Tree and Tree
22Today
- Introduction
- Dependency formalism
- Syntax in Machine Translation
- Dependency Tree based Machine Translation
- By projection
- By synchronous modeling
- Conclusion and Future
23Dependency Tree based Machine Translation
- By projection
- Fox 2002
- Dekang Lin 2004
- Quirk et al 2004, Quirk et al 2006, Menezes et al
2007 - By synchronous modeling
- Alshawi et al 2001
- Jason Eisner 2003
- Fox 2005
- Yuang Lin and Daniel Marcu 2004
24Phrasal Cohesion and Statistical Machine
TranslationHeidi Fox , EMNLP 2002
- English-French Corpus was used
- En-Fr are similar
- For phrase structure trees -
- Head Crossings involve head constituent of the
phrase with its modifier spans - Modifier Crossings involve only spans of modifier
constituents - For dependency trees
- Head Crossings means crossings of spans of child
with its parent - Modifier crossings same as above
- Dependency structures show cohesive nature across
translation
25A Path-based Transfer modelDekang Lin 2004
- Input
- Word-aligned
- Source parsed
- Syntax translation model
- Set of paths in source tree
- Extract connected target path
- Generalization of paths to POS
- Modeling
- Relative likelihood
- Smoothing factor for noise
26A Path-based Transfer modelDekang Lin 2004
- Decoding
- Parse input and extract all paths, extract target
paths - Find a set of transfer rules
- Cover the entire source tree
- Can be consistently merged
- Lexicalized rule preferred
- Future work?
- Word ordering is addressed
- Transfer rules from same sen follow order in
sentence - Only one example of path follow order in rule
- Many examples pick relative distance from head
- Highest probability
- Dynamic Programming
- Min-set cover problem applied to trees
27A Path-based Transfer modelDekang Lin 2004
- Evaluation
- English-French 1.2M
- Source parsed by Minipar
- 1755 test set
- 5 to 15 words long sentences
- Compared to Koehns results from 2003 paper
- No Language Model or extra generation module
- Order defined by paths is linear
- Some heuristics to maintain linearity
- Generalization of paths (transfer rules)
quadratic vs. exponential - Direct Correspondence Approach (DCA) is violated
when translation divergences exist - Very Naïve notion of reordering and merge
conflict resolution
28Dependency Treelet TranslationQuirk et al ACL
2004,05,06
- Project dependencies from source to target via
word alignment - One-one project dependency to aligned words
- Many-one nothing to do, as the projected is the
head - One-many project to right most, and rest are
attached to it - Reattachment of modifiers to lowest possible node
that preserves target word order - Treelet extraction
- All subtrees on source until a particular limit,
and the corresponding target fragment which is
connected - MLE for scoring
29Dependency Treelet TranslationQuirk et al ACL
2004,05,06
tired men and dogs
hommes et chiens fatigues
et
and
et
hommes
men
chiens
hommes
dogs
chiens
fatigues
fatigues
tired
Treelet with missing roots
30Dependency Treelet TranslationQuirk et al
2004,05,06
- Translation Model
- Trained from the aligned projected corpus
- Log-linear with feature functions
- Channel Model
- Treelet Prob
- Lexical Prob
- Order Model
- Head relative
- Swap model
- Target Model
- Target language model
- Bigram Agreement model (opt)
31Dependency Treelet TranslationQuirk et al ACL
2004,05,06
- Decoding (Step by step)
- Input is a dependency analyzed source
- Challenge is that left-right may not work when
starting with a Tree - Obtain best target tree combining the models
- Exhaustive search using DP
- Translate bottom up, from a given subtree (ITG)
- For each head node extract all matching treelets
x_i - For each uncovered subtrees extract all matching
treelets y_i - Try all insertions of y_i into slots in x_i
- Ordering model ranks all the re-ordering
possibilities for the modifiers
32Dependency Treelet TranslationQuirk et al ACL
2004,05,06
- Decoding Optimizations
- Duplicate translations checkreuse
- Nbest list (only maintain top best candidates)
- Early pruning before reordering (channel model)
- Greedy reordering (pick best one and move on)
- Variable n-best size (dynamically reduce n with
increasing uncovered subtrees) - Determinstic pruning of treelets based on MLE
(allowing decoder to try more reorderings) - A decoding
- Estimate the cost of an uncovered node reordering
instead of computing it exactly - Heuristics for optimistic estimates for each of
the models
33Dependency Treelet TranslationQuirk et al ACL
2004,05,06
- Evaluation
- Eng-French
- 1.5M parallel Microsoft technical documentation
- NLPWIN parsed on Eng side
- GIZA trained
- Target LM French side of parallel data
- Tuned on 250 sens for MaxBLEU
- Tested on 10K unseen
- 1 Reference
34Improvements to Treelet Translation
- Dependency Order Templates (ACL 2007)
- Improve Generality in Translation
- Learn un-lexicalised order templates
- Only use at runtime for restricting search space
in reordering
- Minimal Translation Units (HLT NAACL 2005)
- Bilingual n-gram channel model (Banchs et.al
2005) - M ltm1,m2gt
- m1 ltsi, tjgt
- Instead of conditioning on the surface adjacent
MTU, they condition on Headwordchain
35Dependency Tree based Machine Translation
- By projection
- Fox 2002
- Dekang Lin 2004
- Quirk et al 2004, Quirk et al 2006, Menezes et al
2007 - By synchronous modeling
- Alshawi et al 2001
- Jason Eisner 2003
- Yuang Lin and Daniel Marcu 2004
- Fox 2005
36Learning Dependency Translation Models as
Collections of Finite-State Head
TransducersAlshwai et al 2001
- Head transducers variant
- Middle-out string transduction vs. left-right
- Can be used in a hierarchical fashion, if you
consider input/output for non-head transitions as
strings rather than words - Dependency transduction model
May not always be a dependency model in
conventional sense
Empty in/out
37Learning Dependency Translation Models as
Collections of Finite-State Head
TransducersAlshwai et al 2001
- Training Given unaligned bitext
- Compute coocurrence statistics at wordlevel
- Find a hierarchical synchronous alignment driven
by cost function - Construct a set of head transducers that explain
the alignment - Calculate the transition weights by MLE
- Decoding
- Similar to CKY or Chart Parsing, but middle-out
- Given input, find the best applications of
transducers - A derivation spanning entire input means it
probably has found best dependencies for source
target - Else string together most probable partial
hypothesis to form a tree - Pick the target tree with lowest score and read
off the string
38Learning Dependency Translation Models as
Collections of Finite-State Head
TransducersAlshwai et al 2001
- Evaluation
- Eng Spanish (ATIS data 13,966 train, 1185
test) - Eng Jap (Speech transcribed data 12,226
train, 3253 test) - Discussion
- Language agnostic, direction agnostic
- Induced dependency tree may not be syntactically
motivated, but suited to translation - Application of transducers is done locally, and
so less context information - A single transducer tries to do everything,
training may have sparsity problems
39Learning non-isomorphic tree mappings for
MTJason Eisner 2003
- Non-Isomorphism not just due to language
divergences but free translation - A version of Tree Substitution Grammar
- To learn from unaligned non-isomorphic trees
- A statistical model based generalized instead of
linguistic minimalism - Expressive with empty string insertions
- Formulate for both PSG and DG
- Translation model
- Joint model P (Ts,Tt,A)
- Alignment
- Decoding
- Training
- Factorization helps
- Reconstruct all derivations for a tree by
efficient tree parsing algorithm for TSG - EM as an efficient inside-outside training on all
derivations - Decoding
- Chart Parsing to create a forest of derivations
for input tree - Maximize over probability of derivations
- 1-best derivation parse is syntactic-alignment
1. Kids kiss Sam quite often 2. Lots of kids
give kisses to Sam
40Machine Translation Using Probabilistic
Synchronous Dependency Insertion Grammars Ding
and Marcu 2005
- SDIG
- Like STAG, STIG for phrase structures
- Basic units are elementary trees
- Handles non-isomorphism at sub-tree level
- Cross-lingual inconsistencies are handled if they
appear within basic units - Crossing-dependency
- Broken-dependency
41Machine Translation Using Probabilistic
Synchronous Dependency Insertion Grammars Ding
and Marcu 2005
- Induction of SDIG for MT as Synchronous
hierarchical tree partitioning - Train IBM Mode 1 scores for bitext
- For each category of Node, starting with NP -
- Perform synchronous tree partitioning operations
- Compute Prob of word pair (ei,fi) where operation
can be performed - Heuristic functions (Graphical model) guide the
partitioning
42Machine Translation Using Probabilistic
Synchronous Dependency Insertion Grammars Ding
and Marcu 2005
- Translation
- Decoding for MT
- Translation is obtained by
- maximizing over all possible derivations of the
source tree - translation of the elementary trees
- Analogous to HMM (Emission and Transition probs
with elementary trees) - Decoding is similar to a Viterbi-style algorithm
on the tree - Hooks
- Augmenting corpus by singleton ETs from Model1
- Smoothing probabilities
43Machine Translation Using Probabilistic
Synchronous Dependency Insertion Grammars Ding
and Marcu 2005
- Evaluation
- Chinese-English system
- Dan Bikels parses for both Cn,En trained from
Parallel treebanks - Test with 4 refs
- Compared with
- GIZA trained
- ISI Rewrite Decoder
- NIST increased 97
- BLUE increased 27
- Reordering ignored for now
44Dependency Based Statistical MT Fox 2005
- Czech-English parallel corpus (Penn TB and Prague
TB) - Morphological process and tecto-grammatical
conversion for Czech trees - No processing for English trees
- Alignment of subtrees via IBM Model4 scores
- followed by structural modification of trees to
suit alignment (KEEP,SPLIT,BUD) - Translation Model
45Dependency Based Statistical MT Fox 2005
- Decoding
- Bestfirst decoder
- Process given Czech input to dependency tree and
translate each node independently - For each node
- Choose head position
- Generate english POS seq
- Generate the feature list
- Perform structural mutations
- Syntax Language Model
- Takes as input a forest of phrase structures
- Invert decoder forest output (dep tree nodes)
into phrase structures - Reordering is entirely left to LM
- Evaluation
- Work in progress
- Proposed to use BLEU
46Today
- Introduction
- Dependency formalism
- Syntax in Machine Translation
- Dependency Tree based Machine Translation
- By projection
- By synchronous modeling
- Conclusion and Future
47Conclusion
- The good -
- Easy to work with
- Cohesive during projection
- Builds well on top of existing PBSMT (Effective
combination of lexicalization and syntax) - Supports modeling a target even with crossing
phrase boundaries - Gracefully degrade over new domains
- The bad
- Reordering is not crucial, but expensive
- Lots of hooks for decoding
- Generalization explodes space
- The not so good
- Current approaches require a dependency tree on
source side and a strong model for the target side
48What Next
- 1 year
- Better scoring and estimation in syntactic
translation models - Improvement in Dependency trees parse quality
directly translates? (Chris Quirk et al 2006) ?
What about MST Parser etc? - Better Word-Alignment and effect on model
- Incorporating labeled dependencies. Will it help?
- Factored Dependency Tree based models
- Approximate subtree matching and Parsing
algorithms - 3-5 years
- Decoding Algorithms and the Target-Ordering
problem - Discriminative approaches to MT are catching up.
How can syntax be incorporated into such a
framework - Better syntactic language models based on the
dependency formalisms - Semantics in Translation (Are DepTrees the first
step?) - Fusion of Dependency and Constituent approaches
(LFG style) - Joint Modeling approaches (Eisner 03, Smith 06 QS
Grammar) - Taking MT to other applications like
Cross-lingual Retrieval and QA which already use
DepFormalisms
49Thanks to
- Lori Levin For discussion on Dependency tree
formalism - Amr Ahmed For discussion and separation of work
- Respective authors of the papers for some of the
graphic images I liberally used in the slides
50Questions
51DG Variants
- Case Grammar (Anderson)
- Daughter-Dependency Theory (Hudson)
- Dependency Unification Grammar (Hellwig)
- Functional-Generative Description (Sgall)
- Lexicase (Starosta)
- Meaning-Text Model (Mel'cuk)
- Metataxis (Schubert)
- Unification Dependency Grammar (Maxwell)
- Constraint Dependency Grammar (Maruyama)
52Motivation Questions
- 1. How is dependency analysis used in Syntax MT?
How do the algorithms vary if only the source
side of analysis is present? - 2. How do the decoding and transfer phases adapt
when using dependency analysis? What algorithms
exist and what is the complexity analysis? - 3. How does dependency based syntax incorporation
in MT, compare with other grammar formalisms like
the phrase structure grammar? - 4. Is there a class of languages which yield
better to dependency analysis vs. other analysis? - 5. Dependency analysis being close to semantics,
does it help MT produce better results?
53Other Papers
- QuasiSynchronous Grammars for Soft Syntactic
Projection David Smith and Jason Eisner 2007 - Automatic Learning of Parallel Dependency Treelet
Pairs Yuan Ding and Martha Palmer 2004 - Dependency vs. Constituents for Tree-Based
Alignment Dan Gildea 2003 - My Compilation
- http//kathmandu.lti.cs.cmu.edu8080/wiki/index.ph
p/AMTSchedule