Title: Morphological Analysis for Phrase-Based Statistical Machine Translation
1Morphological Analysis for Phrase-Based
Statistical Machine Translation
HYP update - part1
- Luong Minh Thang
- WING group meeting 15 Aug, 2008
2Agenda
- Introduction - what does my project title mean?
- Language pair
- English-Finnish challenges
- Related works
- Project direction
3Introduction I phrase-based SMT
- Statistical derive statistical information from
large data - Phrase-base capture local constraints
0 1 2 3 4 5 6 7
NULL Mary did not slap the green witch
Source
Target
Maria no daba una botefada a la bruja verde
1 2 3 4 5 6 7 8 9
4Introduction II - Morphology
- Morpheme minimal meaning-bearing unit
- machines machine s
- translation translate ion
- goalkeeper goal keeper
-
- English is a low-inflected language - simple
morphological structure - ? High-inflected languages are much complicated!
5Introduction III high-inflected languages
- Concatenate chain of morphemes to form a word
- Finnish oppositio kansa n edusta ja
- (opposition people of represent -ative)
- opposition of parliarment member
- Turkish uygarlas,tiramadiklarimizdanmis,sinizcasi
na - (uygarlas, tiramadiklarimizdanmis,
sinizcasina) - (behaving) as if you are among those whom we
could not cause to become civilized
This is a word!!!
6Introduction IV Why morphological-aware SMT?
- Tackle the data sparseness problem
- (Statistics from 1.021.180 sentence pairs)
- Capture the relations among words
Type count Token count
English 105.144 121.442.173
Finnish 516.102 130.128.883
7Language pair I our choice?
Vietnamese
- We chose English - Finnish as our main
translation task
8Language pair II why Finnish?
- Honestly, I dont know Finnish
- But because
- Available corpora
- Finnish is an agglutinative morphologically-comple
x language, suitable for our project scope - Investigate in translation from low to high
inflected languages -gt an area to explore, yet
hard !!!
9English-Finnish challenges I many-to-one word
relationship
- Finnish uses suffixes to express grammatical
relations and also to derive new words
Case Suffix English prep. Sample word form Translation of the sample
nominatiivi  - talo house
genetiivi -n of talon of (a) house
essiivi -na as talona as a house
inessiivi -ssa in talossa in (a) house
elatiivi -sta from (inside) talosta from (a) house
komitatiivi -ne- together (with) taloineni with my house(s)
(about 14-15 cases for nouns)
Not merely concatenating
- Many-to-one English-Finnish word relationship ?
need word-morpheme correspondence
10English-Finnish challenges II word order
- Word order is free in Finnish
- Pete rakastaa Annaa Pete loves Annaa (normal)
- Annaa Pete rakastaa emphasizes Annaa
- Rakastaa Pete Annaa emphasizes rakastaa Pete
does love Anna - Pete Annaa rakastaa stress on Pete
- Rakastaa Annaa Pete. not sound like a normal
sentence, quite understandable.
11English-Finnish challenges III surface form
generation
- After translating from English words ? Finnish
morphemes, need a surface generation step - oppositio kansa n edusta ja ?
oppositiokansanedustaja - What if missing morphemes or changes in morpheme
order? - ? Need a more error-tolerate surface recovery
algorithm
12Related works I low-to-high inflected languages
- Many works from high to low inflected languages,
but very few works on the opposite direction,
considered hard in (Koehn, 2005) - (Yang Kirchhoff, 2006) Finnish-English,
backoff - (Oflazer Durgar El-Kahlout, 2006, 2007)
English-Turkish, word-morpheme translation, then
simply concatenating morphemes - All use language-dependent tools syntactic
knowledge TreeTager, Snowball stemmer
13Related works II surface form recovery
- (Toutanova et. al., 2007, 2008) English-Russian,
English-Arabic translate stem-to-stem - predict inflection from stems using many
different features (lexical, morphological, and
syntactic) - (Avramidis Koehn, 2008) English-Greek
- Use syntax to get the missing morphology,
depending on the syntactic position - Noun cases agreement and verb person conjugation
- ? Rely mostly on manual annotation data
14Project direction
- Use language-independent tool (Morfessor), and
based on the unannotated data only - (i.e. no feature data or syntactical
information) - Work on a general surface-form recovery
- We would like to have a unified view of the
transalation process separating low-low,
low-high, high-low, high-high
We are at here
15Reference I
- Chirs Dyer, 2007 http//www.ling.umd.edu/redpony/
edinburgh.pdf - Jurafsky, D., Martin, J. H. (2007). Speech and
language processing book - The Finnish language http//www.cs.tut.fi/jkorpel
a/Finnish.html - Yang Kirchhoff, 2006 Phrase-based backoff
models for machine translation of highly
inflected languages - Oflazer Durgar El-Kahlout, 2006 Initial
Explorations in English to Turkish Statistical
Machine Translation
16Reference II
- Oflazer Durgar El-Kahlout, 2007 Exploring
different representational units in
English-to-Turkish statistical machine
translation - Toutanova et. al., 2007 Generating complex
morphology for machine translation - Toutanova et. al., 2008 Applying morphology
generation models to machine translation - Avramidis Koehn, 2008 Enriching
morphologically poor languages for statistical
machine translation
17Q A?
18To be continued