Title: Morphological Preprocessing for Statistical Machine Translation
1Morphological Preprocessing for Statistical
Machine Translation
NLP Meeting 10/19/2006
- Nizar Habash Columbia University
habash_at_cs.columbia.edu
2Road Map
- Hybrid MT Research _at_ Columbia
- Morphological Preprocessing for SMT
- (Habash Sadat, NAACL 2006)
- Combination of Preprocessing Schemes
- (Sadat Habash, ACL 2006)
3Why Hybrid MT?
- StatMT and RuleMT have complementary advantages
- RuleMT Handling of possible but unseen word
forms - StatMT Robust translation of seen words
- RuleMT Better global target syntactic structure
- StatMT Robust local phrase-based translation
- RuleMT Cross-genre generalizations/robustness
- StatMT Robust within-genre translation
- StatMT and RuleMT use complementary resources
- Parallel corpora vs. dictionaries, parsers,
analyzers, linguists - Hybrids can potentially improve over either
approach
4Hybrid MT Challenges
- Linguistic phrase versus StatMT phrase
- . on the other hand , the
- Meaningful probabilities for linguistic resources
- Increased system complexity
- The potential to produce the combined worst
rather than the combined best - Low Arabic parsing performance (70 Parseval
F-score) - Statistical hallucinations
5Hybrid MT Continuum
- Hybrid is a moving target
- StatMT systems use some rule-based components
- Orthographic normalization, number/date
translation, etc. - RuleMT systems nowadays use statistical n-gram
language modeling - Hybrid MT systems
- Different mixes of statistical/rule-based
components - Resource availability
- General approach directions
- Adding rules/linguistics to StatMT systems
- Adding statistics/statistical resources to RuleMT
systems - Depth of hybridization
- Morphology, syntax, semantics
6Columbia MT Projects
- Arabic-English MT focus
- Different hybrid approaches
7Columbia MT Projects
- Arabic-English MT focus
- Different hybrid approaches
8System Overview
RuleMT
StatMT
Columbia Contrast
Columbia Primary
Koehn Hybrid Scale
9Research Directions
- Syntactic SMT preprocessing
- Syntax-aware phrase extraction
- Statistical linearization using richer CFGs
- Creation and integration of rule-generated
phrase-tables - Lowering dependence on source language resources
- Extension to other languages and dialects
10Road Map
- Hybrid MT Research _at_ Columbia
- Morphological Preprocessing for SMT
- Linguistic Issues
- Previous Work
- Schemes and Techniques
- Evaluation
- Combination of Preprocessing Schemes
-
11Arabic Linguistic Issues
- Rich Morphology
- Clitics
- CONJ PART DET BASE PRON w
l Al mktb and for the office - Morphotactics
- wlAlmktb ? wllmktb ??????? ? ????????
- Ambiguity
- ??? wjd he found
- ??? w jd andgrandfather
12Previous Work
- Morphological syntactic preprocessing for SMT
- French-English (Berger et al., 1994)
- German-English (Nießen and Ney 2000 2004)
- Spanish, Catalan and Serbian to English (Popovic
and Ney, 2004) - Czech-English (Goldwater and McClosky, 2005)
- Arabic-English (Lee, 2004)
- We focus on morphological preprocessing
- Larger set of conditions schemes, techniques,
learning curve, genre variation - No additional kinds of preprocessing (e.g. dates,
numbers)
13Road Map
- Hybrid MT Research _at_ Columbia
- Morphological Preprocessing for SMT
- Linguistic Issues
- Previous Work
- Schemes and Techniques
- Evaluation
- Combination of Preprocessing Schemes
-
14Preprocessing Schemes
Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
15Preprocessing Schemes
- ST Simple Tokenization
- D1 Decliticize CONJ
- D2 Decliticize CONJ, PART
- D3 Decliticize all clitics
- BW Morphological stem and affixes
- EN D3, Lemmatize, English-like POS tags, Subj
- ON Orthographic Normalization
- WA wa decliticization
- TB Arabic Treebank
- L1 Lemmatize, Arabic POS tags
- L2 Lemmatize, English-like POS tags
Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
16Preprocessing Schemes
- ST Simple Tokenization
- D1 Decliticize CONJ
- D2 Decliticize CONJ, PART
- D3 Decliticize all clitics
- BW Morphological stem and affixes
- EN D3, Lemmatize, English-like POS tags, Subj
- ON Orthographic Normalization
- WA wa decliticization
- TB Arabic Treebank
- L1 Lemmatize, Arabic POS tags
- L2 Lemmatize, English-like POS tags
Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
17Preprocessing Schemes
- ST Simple Tokenization
- D1 Decliticize CONJ
- D2 Decliticize CONJ, PART
- D3 Decliticize all clitics
- BW Morphological stem and affixes
- EN D3, Lemmatize, English-like POS tags, Subj
- ON Orthographic Normalization
- WA wa decliticization
- TB Arabic Treebank
- L1 Lemmatize, Arabic POS tags
- L2 Lemmatize, English-like POS tags
Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
18Preprocessing Schemes
- ST Simple Tokenization
- D1 Decliticize CONJ
- D2 Decliticize CONJ, PART
- D3 Decliticize all clitics
- BW Morphological stem and affixes
- EN D3, Lemmatize, English-like POS tags, Subj
- ON Orthographic Normalization
- WA wa decliticization
- TB Arabic Treebank
- L1 Lemmatize, Arabic POS tags
- L2 Lemmatize, English-like POS tags
Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
19Preprocessing Schemes
- ST Simple Tokenization
- D1 Decliticize CONJ
- D2 Decliticize CONJ, PART
- D3 Decliticize all clitics
- BW Morphological stem and affixes
- EN D3, Lemmatize, English-like POS tags, Subj
- ON Orthographic Normalization
- WA wa decliticization
- TB Arabic Treebank
- L1 Lemmatize, Arabic POS tags
- L2 Lemmatize, English-like POS tags
Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
20Preprocessing Schemes
- ST Simple Tokenization
- D1 Decliticize CONJ
- D2 Decliticize CONJ, PART
- D3 Decliticize all clitics
- BW Morphological stem and affixes
- EN D3, Lemmatize, English-like POS tags, Subj
- ON Orthographic Normalization
- WA wa decliticization
- TB Arabic Treebank
- L1 Lemmatize, Arabic POS tags
- L2 Lemmatize, English-like POS tags
Input wsyktbhA? and he will write
it? ST wsyktbhA ? D1 w syktbhA ? D2 w s
yktbhA ? D3 w s yktb hA ? BW w s y ktb
hA ? EN w s ktb/VBZ S3MS hA ?
21Preprocessing Schemes
- ST Simple Tokenization
- D1 Decliticize CONJ
- D2 Decliticize CONJ, PART
- D3 Decliticize all clitics
- BW Morphological stem and affixes
- EN D3, Lemmatize, English-like POS tags, Subj
- ON Orthographic Normalization
- WA wa decliticization
- TB Arabic Treebank
- L1 Lemmatize, Arabic POS tags
- L2 Lemmatize, English-like POS tags
22Preprocessing Schemes
- MT04,1353 sentences, 36000 words
23Preprocessing Schemes
- Scheme Accuracy
- Measured against Penn Arabic Treebank
24Preprocessing Techniques
- REGEX Regular Expressions
- BAMA Buckwalter Arabic Morphological Analyzer
(Buckwalter 2002 2004) - Pick first analysis
- Use TOKAN (Habash 2006)
- A generalized tokenizer
- Assumes disambiguated morphological analysis
- Declarative specification of any preprocessing
scheme - MADA Morphological Analysis and Disambiguation
for Arabic (HabashRambow 2005) - Multiple SVM classifiers combiner
- Selects BAMA analysis
- Use TOKAN
25TOKAN
- A generalized tokenizer
- Assumes disambiguated morphological analysis
- Declarative specification of any tokenization
scheme - D1 w f REST
- D2 w f b k l s REST
- D3 w f b k l s Al REST P O
- TB w f b k l REST P O
- BW MORPH
- L1 LEXEME POS
- ENG w f b k l s Al LEXEME BIESPOS S
- Uses generator (Habash 2006)
26Road Map
- Hybrid MT Research _at_ Columbia
- Morphological Preprocessing for SMT
- Linguistic Issues
- Previous Work
- Schemes and Techniques
- Evaluation
- Combination of Preprocessing Schemes
-
27Experiments
- Portage Phrase-based MT (Sadat et al., 2005)
- Training Data parallel 5 Million words only
- All in News genre
- Learning curve 1, 10 and 100
- Language Modeling 250 Million words
- Development Tuning Data MT03 Eval Set
- Test Data
- MT04 (Mixed genre news, speeches, editorials)
- MT05 (All news)
28Experiments (contd)
- Metric BLEU (Papineni et al 2001)
- 4 references, case insensitive
- Each experiment
- Select a preprocessing scheme
- Select a preprocessing technique
- Some combinations do not exist
- REGEX and EN
29MT04 Results
Training
100
10
BLEU
1
gt
gt
30MT05 Results
Training
100
10
BLEU
1
gt
gt
31MT04 Genre VariationBest Schemes Technique
ENMADA _at_ 1, D2MADA _at_ 100
BLEU
32Other Results
- Orthographic Normalization generally did better
than the baseline ST - statistically significant at 1 training data
only - wa decliticization was generally similar to D1
- Arabic Treebank scheme was similar to D2
- Full lemmatization schemes behaved like EN but
always worse - 50 Training data
- D2 _at_ 50 data gt ST _at_ 100 data
- Larger phrases size (14) did not have a
significant difference from the size 8 we used
33Latest Results (July 2006)
34Road Map
- Hybrid MT Research _at_ Columbia
- Morphological Preprocessing for SMT
-
- Combination of Preprocessing Schemes
-
35Oracle Combination
- Preliminary study oracle combination
- In MT04,100 data, MADA technique, 11 schemes,
sentence level selection - Achieved 46.0 Bleu
- (24 improvement over best system 37.1)
36System Combination
- Exploit scheme complementarity to improve MT
quality - Explore two methods of system combination
- Rescoring-Only Combination (ROC)
- Decoding-plus-Rescoring Combination (DRC)
- We us all 11 schemes with MADA technique
-
37Rescoring-Only Combination(ROC)
- Rescore all the one-best outputs generated from
separate scheme-specific systems and return the
top choice - Each scheme-specific system uses its own
scheme-specific preprocessing, phrase tables and
decoding weights
38Rescoring-Only Combination (ROC)
- Standard combo
- Trigram language model, phrase translation model,
distortion model, and sentence length - IBM model 1 and 2 probabilities in both
directions - Other combo add more features
- Perplexity of source sentence (PPL) against a
source LM (in same scheme) - Number of out-of-vocabulary words source sentence
(OOV) - Source sentence length (SL)
- An encoding of the specific scheme (SC)
39Decoding-plus-Rescoring Combination (DRC)
- Step 1 Decode
- For each preprocessing scheme
- Use union of phrase tables from all schemes
- Optimize and decode (with same scheme)
- Step 2 Rescore
- Rescoring the one-best outputs of each
preprocessing scheme
40Results
- MT04 set
- Best single scheme D2 scores 37.1
41Results
- Statistical significance using bootstrap
re-sampling (Koehn, 2004)
42Conclusions
- For large amounts of training data, splitting off
conjunctions and particles performs best - For small amount of training data, following an
English-like tokenization performs best - Suitable choice of preprocessing scheme and
technique yields an important increase in BLEU
score if - there is little training data
- there is a change in genre between training and
test - System combination is potentially highly
rewarding especially when combining the phrase
tables of different preprocessing schemes.
43Future Work
- Study additional variant schemes that current
results support - Factored translation modeling
- Decoder extension to use multiple schemes in
parallel - Syntactic preprocessing
- Investigate combination techniques at the
sentence and sub-sentence levels
44- Thank you!
- Questions?
- Nizar Habash
- habash_at_cs.columbia.edu