Title: Multilingual Word Alignment
1Multilingual Word Alignment
- Preslav NakovComputer Science DivisionUniversity
of California, Berkeley
2Plan
- Introduction
- Models
- Evaluation
- Conclusions
3Introduction
- Traditionally translation used bilingual texts
- parliamentary debates
- news agencies
- Growing number of multilingual texts
- European Union
- 21 official languages (more to come)
- Europarl corpus 11 languages
- United Nations
- Applications
- Machine translation for new language pairs
- Better pairwise word alignments
4IBM Word Alignments
- Sample IBM word alignment every French word has
a single-word English translation
5The Problem
- How to make use of multi-lingual parallel texts
to obtain better word alignment for a pair of
languages? - For example, given a parallel French-English-Spani
sh corpus, we want to obtain better French to
English word alignments using Spanish.
6Plan
- Introduction
- Models
- Evaluation
- Conclusions
7Four Models
- Heuristics combine pairwise alignments
- Heuristic 1 probabilistic (linear model)
- Heuristic 2 graph-theoretical (looks at
out-degrees for each word)
8 9Heuristic 1 Probabilistic (1)
- Uses IBM Model 1 pairwise
- alignment probabilities
- En?Fr Pr(fjei)
- En?Xx Pr(skei)
- Xx?Fr Pr(fjsk)
- Linear combination
- Viterbi alignment using Pr? (as in IBM Model 1)
10Heuristic 1 Probabilistic (2)
- Plus
- Probabilistic
- Easy to extend to more languages
- Minus
- Requires probability for each link and thus does
not work for IBM Models 3 and 4 - BUT! We could obtain link probabilities for
Model 4 too (e.g. by inspecting the top 10
alignments)
11 12Heuristic 2 Graph-theoretical (1)
- Motivation
- Extend the search space for intersectgrow.
- Allow for any IBM Model.
- Uses 3 Viterbi alignments
- En?Fr does fj align to ei?
- En?Xx does sk align to ei?
- Xx?Fr does fj align to sk?
- Assigns weights (probabilities) to each link
according to the out-degree of the source word.
13Heuristic 2 Graph-theoretical (2)
14Heuristic 2 Graph-theoretical (3)
- Linear combination
- accept all links above a threshold
- Weighted intersect grow using
works better
15Heuristic 2 Graph-theoretical (4)
- Plus
- Works with all IBM models
- Ready to use for more intermediate languages
- Minus
- Non-probabilistic
16Plan
- Introduction
- Models
- Evaluation
- Conclusions
17Dataset
- Europarl (http//www.statmt.org/europarl/)
- 11 languages (10 aligned to English)
- Romanic French, Italian, Spanish, Portuguese
- Germanic English, Dutch, German, Danish, Swedish
- Other Greek and Finnish
- We used a subset
- ACL 2005 shared task on Exploiting Parallel
Texts for Statistical Machine Translation - 5 languages English, French, Spanish, German,
Finnish
18Evaluation Setup
- Training 543,379 sentences
- Testing 38 sentences
- Measure Alignment Error Rate (AER)
- Baseline GIZA, Fr-En alignments, Model 4
- Fr?En 0.2957
- Fr?En 0.2970
- Fr-En intersect 0.3237
- Fr-En union 0.2739
- Fr-En intersectgrow 0.2604
19- Heuristic 1
- Probabilistic
- Fr?Xx?En
20Heuristic 1 Probabilistic (1)Model 1, Fr?Xx?En,
1 language
AER
no gain
?
21Heuristic 1 Probabilistic (2)Model 1,
Fr?Xx,Yy?En, 2 languages
AER
no gain
?
22- Heuristic 1
- (continued)
- Probabilistic
- En?Xx?Fr
23Heuristic 1 Probabilistic (3)Model 1, En?Xx?Fr,
1 language
AER
Gain 1 point on AER via German
?
24Heuristic 1 Probabilistic (4)Model 1,
En?Xx,Yy?Fr, 2 languages
AER
Gain lt1 point on AER via German Finnish
?
25- Heuristic 2
- Graph-theoretical
- Fr ? Xx ? En
26Heuristic 2 Graph-Theoretical (1)Model 4,
intersectgrow via Es
AER, Model 4
1 Fr-En intersect
.3237
2 Fr-Es-En intersect
.3090
3 Fr-En unal.-unal.
4 Fr-Es-En unal.-unal.
.2912
5 Fr-En al.-unal.
6 Fr-Es-En al.-unal.
.2529
7 Fr-Es-En al.-unal. (u)
.2513
Starting from step 3, candidates are considered
in order of decreasing weight.
27Heuristic 2 Graph-Theoretical (2)Model 4,
intersectgrow via Es
Gain 1 AER
28Plan
- Introduction
- Models
- Evaluation
- Conclusions
29Conclusions
- Only a little gain so far
- 1 point in AER
- Language family plays a role
- Need to try other (and more) languages.
30Future Work
- Add more languages from Europarl
- German and Finnish might pose problems because of
compounding. - Try other corpora (e.g. Acquis)
- Try other models
- Error analysis.
31The End