Morphological Analysis for Phrase-Based Statistical Machine Translation - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Morphological Analysis for Phrase-Based Statistical Machine Translation

Description:

Morphological Analysis for Phrase-Based Statistical Machine Translation. Luong Minh Thang ... (uygar las, tir ama dik lar imiz dan mis, siniz casina) ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 19
Provided by: luongmi
Category:

less

Transcript and Presenter's Notes

Title: Morphological Analysis for Phrase-Based Statistical Machine Translation


1
Morphological Analysis for Phrase-Based
Statistical Machine Translation
HYP update - part1
  • Luong Minh Thang
  • WING group meeting 15 Aug, 2008

2
Agenda
  • Introduction - what does my project title mean?
  • Language pair
  • English-Finnish challenges
  • Related works
  • Project direction

3
Introduction I phrase-based SMT
  • Statistical derive statistical information from
    large data
  • Phrase-base capture local constraints

0 1 2 3 4 5 6 7
NULL Mary did not slap the green witch
Source
Target
Maria no daba una botefada a la bruja verde
1 2 3 4 5 6 7 8 9
4
Introduction II - Morphology
  • Morpheme minimal meaning-bearing unit
  • machines machine s
  • translation translate ion
  • goalkeeper goal keeper
  • English is a low-inflected language - simple
    morphological structure
  • ? High-inflected languages are much complicated!

5
Introduction III high-inflected languages
  • Concatenate chain of morphemes to form a word
  • Finnish oppositio kansa n edusta ja
  • (opposition people of represent -ative)
  • opposition of parliarment member
  • Turkish uygarlas,tiramadiklarimizdanmis,sinizcasi
    na
  • (uygarlas, tiramadiklarimizdanmis,
    sinizcasina)
  • (behaving) as if you are among those whom we
    could not cause to become civilized

This is a word!!!
6
Introduction IV Why morphological-aware SMT?
  • Tackle the data sparseness problem
  • (Statistics from 1.021.180 sentence pairs)
  • Capture the relations among words

Type count Token count
English 105.144 121.442.173
Finnish 516.102 130.128.883
7
Language pair I our choice?
Vietnamese
  • We chose English - Finnish as our main
    translation task

8
Language pair II why Finnish?
  • Honestly, I dont know Finnish
  • But because
  • Available corpora
  • Finnish is an agglutinative morphologically-comple
    x language, suitable for our project scope
  • Investigate in translation from low to high
    inflected languages -gt an area to explore, yet
    hard !!!

9
English-Finnish challenges I many-to-one word
relationship
  • Finnish uses suffixes to express grammatical
    relations and also to derive new words

Case Suffix English prep. Sample word form Translation of the sample
nominatiivi   - talo house
genetiivi -n of talon of (a) house
essiivi -na as talona as a house
inessiivi -ssa in talossa in (a) house
elatiivi -sta from (inside) talosta from (a) house
komitatiivi -ne- together (with) taloineni with my house(s)
(about 14-15 cases for nouns)
Not merely concatenating
  • Many-to-one English-Finnish word relationship ?
    need word-morpheme correspondence

10
English-Finnish challenges II word order
  • Word order is free in Finnish
  • Pete rakastaa Annaa Pete loves Annaa (normal)
  • Annaa Pete rakastaa emphasizes Annaa
  • Rakastaa Pete Annaa emphasizes rakastaa Pete
    does love Anna
  • Pete Annaa rakastaa stress on Pete
  • Rakastaa Annaa Pete. not sound like a normal
    sentence, quite understandable.

11
English-Finnish challenges III surface form
generation
  • After translating from English words ? Finnish
    morphemes, need a surface generation step
  • oppositio kansa n edusta ja ?
    oppositiokansanedustaja
  • What if missing morphemes or changes in morpheme
    order?
  • ? Need a more error-tolerate surface recovery
    algorithm

12
Related works I low-to-high inflected languages
  • Many works from high to low inflected languages,
    but very few works on the opposite direction,
    considered hard in (Koehn, 2005)
  • (Yang Kirchhoff, 2006) Finnish-English,
    backoff
  • (Oflazer Durgar El-Kahlout, 2006, 2007)
    English-Turkish, word-morpheme translation, then
    simply concatenating morphemes
  • All use language-dependent tools syntactic
    knowledge TreeTager, Snowball stemmer

13
Related works II surface form recovery
  • (Toutanova et. al., 2007, 2008) English-Russian,
    English-Arabic translate stem-to-stem
  • predict inflection from stems using many
    different features (lexical, morphological, and
    syntactic)
  • (Avramidis Koehn, 2008) English-Greek
  • Use syntax to get the missing morphology,
    depending on the syntactic position
  • Noun cases agreement and verb person conjugation
  • ? Rely mostly on manual annotation data

14
Project direction
  • Use language-independent tool (Morfessor), and
    based on the unannotated data only
  • (i.e. no feature data or syntactical
    information)
  • Work on a general surface-form recovery
  • We would like to have a unified view of the
    transalation process separating low-low,
    low-high, high-low, high-high

We are at here
15
Reference I
  • Chirs Dyer, 2007 http//www.ling.umd.edu/redpony/
    edinburgh.pdf
  • Jurafsky, D., Martin, J. H. (2007). Speech and
    language processing book
  • The Finnish language http//www.cs.tut.fi/jkorpel
    a/Finnish.html
  • Yang Kirchhoff, 2006 Phrase-based backoff
    models for machine translation of highly
    inflected languages
  • Oflazer Durgar El-Kahlout, 2006 Initial
    Explorations in English to Turkish Statistical
    Machine Translation

16
Reference II
  • Oflazer Durgar El-Kahlout, 2007 Exploring
    different representational units in
    English-to-Turkish statistical machine
    translation
  • Toutanova et. al., 2007 Generating complex
    morphology for machine translation
  • Toutanova et. al., 2008 Applying morphology
    generation models to machine translation
  • Avramidis Koehn, 2008 Enriching
    morphologically poor languages for statistical
    machine translation

17
Q A?
18
To be continued
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com