Morphological Preprocessing for Statistical Machine Translation - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Morphological Preprocessing for Statistical Machine Translation

Description:

Why Hybrid MT? StatMT and RuleMT have complementary advantages ... Exploit scheme complementarity to improve MT quality. Explore two methods of system combination ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 45
Provided by: NizarH2
Category:

less

Transcript and Presenter's Notes

Title: Morphological Preprocessing for Statistical Machine Translation


1
Morphological Preprocessing for Statistical
Machine Translation
NLP Meeting 10/19/2006
  • Nizar Habash Columbia University
    habash_at_cs.columbia.edu

2
Road Map
  • Hybrid MT Research _at_ Columbia
  • Morphological Preprocessing for SMT
  • (Habash Sadat, NAACL 2006)
  • Combination of Preprocessing Schemes
  • (Sadat Habash, ACL 2006)

3
Why Hybrid MT?
  • StatMT and RuleMT have complementary advantages
  • RuleMT Handling of possible but unseen word
    forms
  • StatMT Robust translation of seen words
  • RuleMT Better global target syntactic structure
  • StatMT Robust local phrase-based translation
  • RuleMT Cross-genre generalizations/robustness
  • StatMT Robust within-genre translation
  • StatMT and RuleMT use complementary resources
  • Parallel corpora vs. dictionaries, parsers,
    analyzers, linguists
  • Hybrids can potentially improve over either
    approach

4
Hybrid MT Challenges
  • Linguistic phrase versus StatMT phrase
  • . on the other hand , the
  • Meaningful probabilities for linguistic resources
  • Increased system complexity
  • The potential to produce the combined worst
    rather than the combined best
  • Low Arabic parsing performance (70 Parseval
    F-score)
  • Statistical hallucinations

5
Hybrid MT Continuum
  • Hybrid is a moving target
  • StatMT systems use some rule-based components
  • Orthographic normalization, number/date
    translation, etc.
  • RuleMT systems nowadays use statistical n-gram
    language modeling
  • Hybrid MT systems
  • Different mixes of statistical/rule-based
    components
  • Resource availability
  • General approach directions
  • Adding rules/linguistics to StatMT systems
  • Adding statistics/statistical resources to RuleMT
    systems
  • Depth of hybridization
  • Morphology, syntax, semantics

6
Columbia MT Projects
  • Arabic-English MT focus
  • Different hybrid approaches

7
Columbia MT Projects
  • Arabic-English MT focus
  • Different hybrid approaches

8
System Overview
RuleMT
StatMT

Columbia Contrast
Columbia Primary
Koehn Hybrid Scale
9
Research Directions
  • Syntactic SMT preprocessing
  • Syntax-aware phrase extraction
  • Statistical linearization using richer CFGs
  • Creation and integration of rule-generated
    phrase-tables
  • Lowering dependence on source language resources
  • Extension to other languages and dialects

10
Road Map
  • Hybrid MT Research _at_ Columbia
  • Morphological Preprocessing for SMT
  • Linguistic Issues
  • Previous Work
  • Schemes and Techniques
  • Evaluation
  • Combination of Preprocessing Schemes

11
Arabic Linguistic Issues
  • Rich Morphology
  • Clitics
  • CONJ PART DET BASE PRON w
    l Al mktb and for the office
  • Morphotactics
  • wlAlmktb ? wllmktb ??????? ? ????????
  • Ambiguity
  • ??? wjd he found
  • ??? w jd andgrandfather

12
Previous Work
  • Morphological syntactic preprocessing for SMT
  • French-English (Berger et al., 1994)
  • German-English (Nießen and Ney 2000 2004)
  • Spanish, Catalan and Serbian to English (Popovic
    and Ney, 2004)
  • Czech-English (Goldwater and McClosky, 2005)
  • Arabic-English (Lee, 2004)
  • We focus on morphological preprocessing
  • Larger set of conditions schemes, techniques,
    learning curve, genre variation
  • No additional kinds of preprocessing (e.g. dates,
    numbers)

13
Road Map
  • Hybrid MT Research _at_ Columbia
  • Morphological Preprocessing for SMT
  • Linguistic Issues
  • Previous Work
  • Schemes and Techniques
  • Evaluation
  • Combination of Preprocessing Schemes

14
Preprocessing Schemes
Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
15
Preprocessing Schemes
  • ST Simple Tokenization
  • D1 Decliticize CONJ
  • D2 Decliticize CONJ, PART
  • D3 Decliticize all clitics
  • BW Morphological stem and affixes
  • EN D3, Lemmatize, English-like POS tags, Subj
  • ON Orthographic Normalization
  • WA wa decliticization
  • TB Arabic Treebank
  • L1 Lemmatize, Arabic POS tags
  • L2 Lemmatize, English-like POS tags

Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
16
Preprocessing Schemes
  • ST Simple Tokenization
  • D1 Decliticize CONJ
  • D2 Decliticize CONJ, PART
  • D3 Decliticize all clitics
  • BW Morphological stem and affixes
  • EN D3, Lemmatize, English-like POS tags, Subj
  • ON Orthographic Normalization
  • WA wa decliticization
  • TB Arabic Treebank
  • L1 Lemmatize, Arabic POS tags
  • L2 Lemmatize, English-like POS tags

Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
17
Preprocessing Schemes
  • ST Simple Tokenization
  • D1 Decliticize CONJ
  • D2 Decliticize CONJ, PART
  • D3 Decliticize all clitics
  • BW Morphological stem and affixes
  • EN D3, Lemmatize, English-like POS tags, Subj
  • ON Orthographic Normalization
  • WA wa decliticization
  • TB Arabic Treebank
  • L1 Lemmatize, Arabic POS tags
  • L2 Lemmatize, English-like POS tags

Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
18
Preprocessing Schemes
  • ST Simple Tokenization
  • D1 Decliticize CONJ
  • D2 Decliticize CONJ, PART
  • D3 Decliticize all clitics
  • BW Morphological stem and affixes
  • EN D3, Lemmatize, English-like POS tags, Subj
  • ON Orthographic Normalization
  • WA wa decliticization
  • TB Arabic Treebank
  • L1 Lemmatize, Arabic POS tags
  • L2 Lemmatize, English-like POS tags

Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
19
Preprocessing Schemes
  • ST Simple Tokenization
  • D1 Decliticize CONJ
  • D2 Decliticize CONJ, PART
  • D3 Decliticize all clitics
  • BW Morphological stem and affixes
  • EN D3, Lemmatize, English-like POS tags, Subj
  • ON Orthographic Normalization
  • WA wa decliticization
  • TB Arabic Treebank
  • L1 Lemmatize, Arabic POS tags
  • L2 Lemmatize, English-like POS tags

Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
20
Preprocessing Schemes
  • ST Simple Tokenization
  • D1 Decliticize CONJ
  • D2 Decliticize CONJ, PART
  • D3 Decliticize all clitics
  • BW Morphological stem and affixes
  • EN D3, Lemmatize, English-like POS tags, Subj
  • ON Orthographic Normalization
  • WA wa decliticization
  • TB Arabic Treebank
  • L1 Lemmatize, Arabic POS tags
  • L2 Lemmatize, English-like POS tags

Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
21
Preprocessing Schemes
  • ST Simple Tokenization
  • D1 Decliticize CONJ
  • D2 Decliticize CONJ, PART
  • D3 Decliticize all clitics
  • BW Morphological stem and affixes
  • EN D3, Lemmatize, English-like POS tags, Subj
  • ON Orthographic Normalization
  • WA wa decliticization
  • TB Arabic Treebank
  • L1 Lemmatize, Arabic POS tags
  • L2 Lemmatize, English-like POS tags

22
Preprocessing Schemes
  • MT04,1353 sentences, 36000 words

23
Preprocessing Schemes
  • Scheme Accuracy
  • Measured against Penn Arabic Treebank

24
Preprocessing Techniques
  • REGEX Regular Expressions
  • BAMA Buckwalter Arabic Morphological Analyzer
    (Buckwalter 2002 2004)
  • Pick first analysis
  • Use TOKAN (Habash 2006)
  • A generalized tokenizer
  • Assumes disambiguated morphological analysis
  • Declarative specification of any preprocessing
    scheme
  • MADA Morphological Analysis and Disambiguation
    for Arabic (HabashRambow 2005)
  • Multiple SVM classifiers combiner
  • Selects BAMA analysis
  • Use TOKAN

25
TOKAN
  • A generalized tokenizer
  • Assumes disambiguated morphological analysis
  • Declarative specification of any tokenization
    scheme
  • D1 w f REST
  • D2 w f b k l s REST
  • D3 w f b k l s Al REST P O
  • TB w f b k l REST P O
  • BW MORPH
  • L1 LEXEME POS
  • ENG w f b k l s Al LEXEME BIESPOS S
  • Uses generator (Habash 2006)

26
Road Map
  • Hybrid MT Research _at_ Columbia
  • Morphological Preprocessing for SMT
  • Linguistic Issues
  • Previous Work
  • Schemes and Techniques
  • Evaluation
  • Combination of Preprocessing Schemes

27
Experiments
  • Portage Phrase-based MT (Sadat et al., 2005)
  • Training Data parallel 5 Million words only
  • All in News genre
  • Learning curve 1, 10 and 100
  • Language Modeling 250 Million words
  • Development Tuning Data MT03 Eval Set
  • Test Data
  • MT04 (Mixed genre news, speeches, editorials)
  • MT05 (All news)

28
Experiments (contd)
  • Metric BLEU (Papineni et al 2001)
  • 4 references, case insensitive
  • Each experiment
  • Select a preprocessing scheme
  • Select a preprocessing technique
  • Some combinations do not exist
  • REGEX and EN

29
MT04 Results
Training
100
10
BLEU
1
gt
gt
30
MT05 Results
Training
100
10
BLEU
1
gt
gt
31
MT04 Genre VariationBest Schemes Technique
ENMADA _at_ 1, D2MADA _at_ 100
BLEU
32
Other Results
  • Orthographic Normalization generally did better
    than the baseline ST
  • statistically significant at 1 training data
    only
  • wa decliticization was generally similar to D1
  • Arabic Treebank scheme was similar to D2
  • Full lemmatization schemes behaved like EN but
    always worse
  • 50 Training data
  • D2 _at_ 50 data gt ST _at_ 100 data
  • Larger phrases size (14) did not have a
    significant difference from the size 8 we used

33
Latest Results (July 2006)
34
Road Map
  • Hybrid MT Research _at_ Columbia
  • Morphological Preprocessing for SMT
  • Combination of Preprocessing Schemes

35
Oracle Combination
  • Preliminary study oracle combination
  • In MT04,100 data, MADA technique, 11 schemes,
    sentence level selection
  • Achieved 46.0 Bleu
  • (24 improvement over best system 37.1)

36
System Combination
  • Exploit scheme complementarity to improve MT
    quality
  • Explore two methods of system combination
  • Rescoring-Only Combination (ROC)
  • Decoding-plus-Rescoring Combination (DRC)
  • We us all 11 schemes with MADA technique

37
Rescoring-Only Combination(ROC)
  • Rescore all the one-best outputs generated from
    separate scheme-specific systems and return the
    top choice
  • Each scheme-specific system uses its own
    scheme-specific preprocessing, phrase tables and
    decoding weights

38
Rescoring-Only Combination (ROC)
  • Standard combo
  • Trigram language model, phrase translation model,
    distortion model, and sentence length
  • IBM model 1 and 2 probabilities in both
    directions
  • Other combo add more features
  • Perplexity of source sentence (PPL) against a
    source LM (in same scheme)
  • Number of out-of-vocabulary words source sentence
    (OOV)
  • Source sentence length (SL)
  • An encoding of the specific scheme (SC)

39
Decoding-plus-Rescoring Combination (DRC)
  • Step 1 Decode
  • For each preprocessing scheme
  • Use union of phrase tables from all schemes
  • Optimize and decode (with same scheme)
  • Step 2 Rescore
  • Rescoring the one-best outputs of each
    preprocessing scheme

40
Results
  • MT04 set
  • Best single scheme D2 scores 37.1

41
Results
  • Statistical significance using bootstrap
    re-sampling (Koehn, 2004)

42
Conclusions
  • For large amounts of training data, splitting off
    conjunctions and particles performs best
  • For small amount of training data, following an
    English-like tokenization performs best
  • Suitable choice of preprocessing scheme and
    technique yields an important increase in BLEU
    score if
  • there is little training data
  • there is a change in genre between training and
    test
  • System combination is potentially highly
    rewarding especially when combining the phrase
    tables of different preprocessing schemes.

43
Future Work
  • Study additional variant schemes that current
    results support
  • Factored translation modeling
  • Decoder extension to use multiple schemes in
    parallel
  • Syntactic preprocessing
  • Investigate combination techniques at the
    sentence and sub-sentence levels

44
  • Thank you!
  • Questions?
  • Nizar Habash
  • habash_at_cs.columbia.edu
Write a Comment
User Comments (0)
About PowerShow.com