Multilingual Word Alignment - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Multilingual Word Alignment

Description:

Machine translation for new language pairs. Better pairwise word alignments. IBM Word Alignments ... Parallel Texts for Statistical Machine Translation' ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 29
Provided by: sar88
Category:

less

Transcript and Presenter's Notes

Title: Multilingual Word Alignment


1
Multilingual Word Alignment
  • Preslav NakovComputer Science DivisionUniversity
    of California, Berkeley

2
Plan
  • Introduction
  • Models
  • Evaluation
  • Conclusions

3
Introduction
  • Traditionally translation used bilingual texts
  • parliamentary debates
  • news agencies
  • Growing number of multilingual texts
  • European Union
  • 21 official languages (more to come)
  • Europarl corpus 11 languages
  • United Nations
  • Applications
  • Machine translation for new language pairs
  • Better pairwise word alignments

4
IBM Word Alignments
  • Sample IBM word alignment every French word has
    a single-word English translation

5
The Problem
  • How to make use of multi-lingual parallel texts
    to obtain better word alignment for a pair of
    languages?
  • For example, given a parallel French-English-Spani
    sh corpus, we want to obtain better French to
    English word alignments using Spanish.

6
Plan
  • Introduction
  • Models
  • Evaluation
  • Conclusions

7
Four Models
  • Heuristics combine pairwise alignments
  • Heuristic 1 probabilistic (linear model)
  • Heuristic 2 graph-theoretical (looks at
    out-degrees for each word)

8
  • Heuristic 1

9
Heuristic 1 Probabilistic (1)
  • Uses IBM Model 1 pairwise
  • alignment probabilities
  • En?Fr Pr(fjei)
  • En?Xx Pr(skei)
  • Xx?Fr Pr(fjsk)
  • Linear combination
  • Viterbi alignment using Pr? (as in IBM Model 1)

10
Heuristic 1 Probabilistic (2)
  • Plus
  • Probabilistic
  • Easy to extend to more languages
  • Minus
  • Requires probability for each link and thus does
    not work for IBM Models 3 and 4
  • BUT! We could obtain link probabilities for
    Model 4 too (e.g. by inspecting the top 10
    alignments)

11
  • Heuristic 2

12
Heuristic 2 Graph-theoretical (1)
  • Motivation
  • Extend the search space for intersectgrow.
  • Allow for any IBM Model.
  • Uses 3 Viterbi alignments
  • En?Fr does fj align to ei?
  • En?Xx does sk align to ei?
  • Xx?Fr does fj align to sk?
  • Assigns weights (probabilities) to each link
    according to the out-degree of the source word.

13
Heuristic 2 Graph-theoretical (2)
14
Heuristic 2 Graph-theoretical (3)
  • Linear combination
  • accept all links above a threshold
  • Weighted intersect grow using

works better
15
Heuristic 2 Graph-theoretical (4)
  • Plus
  • Works with all IBM models
  • Ready to use for more intermediate languages
  • Minus
  • Non-probabilistic

16
Plan
  • Introduction
  • Models
  • Evaluation
  • Conclusions

17
Dataset
  • Europarl (http//www.statmt.org/europarl/)
  • 11 languages (10 aligned to English)
  • Romanic French, Italian, Spanish, Portuguese
  • Germanic English, Dutch, German, Danish, Swedish
  • Other Greek and Finnish
  • We used a subset
  • ACL 2005 shared task on Exploiting Parallel
    Texts for Statistical Machine Translation
  • 5 languages English, French, Spanish, German,
    Finnish

18
Evaluation Setup
  • Training 543,379 sentences
  • Testing 38 sentences
  • Measure Alignment Error Rate (AER)
  • Baseline GIZA, Fr-En alignments, Model 4
  • Fr?En 0.2957
  • Fr?En 0.2970
  • Fr-En intersect 0.3237
  • Fr-En union 0.2739
  • Fr-En intersectgrow 0.2604

19
  • Heuristic 1
  • Probabilistic
  • Fr?Xx?En

20
Heuristic 1 Probabilistic (1)Model 1, Fr?Xx?En,
1 language
AER
no gain
?
21
Heuristic 1 Probabilistic (2)Model 1,
Fr?Xx,Yy?En, 2 languages
AER
no gain
?
22
  • Heuristic 1
  • (continued)
  • Probabilistic
  • En?Xx?Fr

23
Heuristic 1 Probabilistic (3)Model 1, En?Xx?Fr,
1 language
AER
Gain 1 point on AER via German
?
24
Heuristic 1 Probabilistic (4)Model 1,
En?Xx,Yy?Fr, 2 languages
AER
Gain lt1 point on AER via German Finnish
?
25
  • Heuristic 2
  • Graph-theoretical
  • Fr ? Xx ? En

26
Heuristic 2 Graph-Theoretical (1)Model 4,
intersectgrow via Es
AER, Model 4
1 Fr-En intersect
.3237
2 Fr-Es-En intersect
.3090
3 Fr-En unal.-unal.
4 Fr-Es-En unal.-unal.
.2912
5 Fr-En al.-unal.
6 Fr-Es-En al.-unal.
.2529
7 Fr-Es-En al.-unal. (u)
.2513
Starting from step 3, candidates are considered
in order of decreasing weight.
27
Heuristic 2 Graph-Theoretical (2)Model 4,
intersectgrow via Es
Gain 1 AER
28
Plan
  • Introduction
  • Models
  • Evaluation
  • Conclusions

29
Conclusions
  • Only a little gain so far
  • 1 point in AER
  • Language family plays a role
  • Need to try other (and more) languages.

30
Future Work
  • Add more languages from Europarl
  • German and Finnish might pose problems because of
    compounding.
  • Try other corpora (e.g. Acquis)
  • Try other models
  • Error analysis.

31
The End
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com