Symmetric Probabilistic Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Symmetric Probabilistic Alignment

Description:

In the CMU EBMT system, alignment has been less studied compared to the other components. ... Sub-sentential Alignment ... Positional: alignment to a ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 33
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Symmetric Probabilistic Alignment


1
Symmetric Probabilistic Alignment
Jae Dong Kim Committee Jaime G. Carbonell Ralf
D. Brown Peter J. Jansen
2
Motivation
  • In the CMU EBMT system, alignment has been less
    studied compared to the other components.
  • We want to investigate a new sub-sentential
    aligner which uses translation probabilities in a
    symmetric fashion.

3
Outline
  • Introduction
  • Symmetric Probabilistic Alignment
  • Experiments and Results
  • Conclusions
  • Future Work

4
Aligner in the EBMT
5
Sub-sentential Alignment
  • The CMU EBMT system refers to translation
    examples to translate unknown source sentence
  • Since it is hard to find an exactly matching
    example sentence, the system finds the longest
    match
  • Encapsulated local context
  • Local reordering
  • The aligner should work on fragments
    (sub-sentences)

6
Need for a new aligner
  • Relatively less studied compared to the other
    components
  • The old aligner
  • Heuristic based
  • Builds a correspondence table
  • Finds the longest target fragment and the
    shortest target fragment
  • Checks every substring of the longest one, which
    includes the shortest one
  • Fast but doesnt use probabilities

7
Related Work
  • IBM models (Brown et al, 93)
  • HMM (Vogel et al, 96)
  • Competitive link (Melamed, 97)
  • Explicit Syntactic Information(Yamada et al, 02)
  • ISA (Zhang, 03)
  • The SPA is different from the above in that it
    aligns sub-sentences using translation
    probabilities and some heuristics when the
    boundary of source fragment is given.

8
Outline
  • Introduction
  • Symmetric Probabilistic Alignment
  • Experiments and Results
  • Conclusions
  • Future Work

9
Basic Algorithm (1)
  • Assumptions
  • A bilingual probabilistic dictionary is available
  • Contiguous source fragments are translated into
    contiguous target fragments
  • Fragments are translated independently of
    surrounding context
  • Given and

10
Basic Algorithm (2)
  • Assume that we are considering a candidate target
    fragment 't2 t3 t4' given a source fragment 's7
    s8 s9'
  • Source -gt Target Translation Score
  • S_tmp max( p(t2s7), p(t3s7), p(t4s7), e )
  • x max( p(t2s8), p(t3s8), p(t4s8), e )
  • x max( p(t2s9), p(t3s9), p(t4s9), e )
  • S_st S_tmp1/3

11
Basic Algorithm (3)
  • Source lt- Target Translation Score
  • S_tmp max( p(s7t2), p(s8t2), p(s9t2), e )
  • x max( p(s7t3), p(s8t3), p(s9t3), e )
  • x max( p(s7t4), p(s8t4), p(s9t4), e )
  • S_ts S_tmp1/3
  • Source lt-gtTarget Translation Score
  • Score S_st S_ts

12
Restrictions (1)
  • Untranslated word penalty
  • s7 s8 s9
  • t2 t3 t4
  • Anchor Context
  • s6 s7 s8 s9 s10 s6 s7 s8 s9 s10
  • t1 t2 t3 t4 t5 t1 t2 t3 t4 t5

13
Restrictions (2)
  • Length penalty
  • t2 ... t30 for s7 s8 s9. Realistic?
  • We expect a proportional target fragment length
    to the source fragment length.
  • Distance penalty
  • t45 t46 t47 for s7 s8 s9. Realistic? Maybe.
  • Between similar word order languages, we might
    expect a proportional position.

14
The SPA CFD
15
Combined Aligner
  • Set a threshold for the SPA
  • The SPA produces results with higher score than
    the threshold
  • For each source fragment
  • If there is a result from the SPA -gt use the SPA
    result
  • Otherwise, use the IBM result

16
Outline
  • Introduction
  • Symmetric Probabilistic Alignment
  • Experiments and Results
  • Conclusions
  • Future Work

17
Alignment Accuracy (1)
  • Evaluation Metrics
  • F1 (Precision, Recall) - based on positions
  • Data
  • English-Chinese
  • Xinhua news wire
  • Training data 1m sentence pairs
  • Trained GIZA with default parameters
  • For the SPA, used the dictionary by GIZA
  • Test data
  • 366 sentence pairs - 3 copies by 3 people
  • 20 more sentence pairs - 1 copy by another
  • 27286 3-8 words long source fragments

18
Alignment Accuracy (2)
  • Data
  • French-English
  • Canadian Hansard
  • Training data 1m sentence pairs
  • Trained GIZA with default parameters
  • For the SPA, used the dictionary by GIZA
  • Test data
  • 91 sentence pairs
  • 12466 3-8 words long source fragments

19
Alignment Accuracy (3)
  • Alignments to be compared
  • Random random alignment to a reasonably long
    target fragment
  • Positional alignment to a proportionally
    positioned target fragment
  • Oracle the best possible contiguous human
    alignment
  • SPA-uni unidirectional basic alignment
  • SPA-basic bidirectional basic alignment
  • SPA the best SPA alignment with restrictions
  • IBM4 non-contiguous alignment by IBM Model 4
  • COMB the combination of SPA and IBM4 alignments
  • SPA-top10 the best of top 10 alignment results
    of SPA

20
Alignment Accuracy En-Cn
  • SPA-basic outperformed SPA-uni
  • SPA was the best when we applied untranslated
    word penalty and length penalty
  • Our significance test showed that the difference
    between IBM4 and COMB is significant

21
Alignment Accuracy Fr-En
  • SPA-basic outperformed SPA-uni
  • SPA was the best when we applied all the
    restrictions
  • Our significance test showed that the difference
    between IBM4 and COMB is not significant

22
Human Alignment Evaluation
  • Rough idea about how much humans agree on
    alignment

23
EBMT Performance (1)
  • Data
  • French-English (Canadian Hansard)
  • 20k training sentence pairs
  • Test
  • Development set 100 sentence pairs
  • 2 reference set 2 references for 100 source
    sentences
  • Evaluation set 10 X 100 sentence pairs
  • Evaluation Metric
  • BLEU

24
EBMT performance (2)
  • SPA, IBM4 and COMB performs significantly better
    than EBMT (the old aligner)
  • For 'Test', SPA outperformed EBMT by 28.5
  • Among SPA, IBM4 and COMB, nothing is
    significantly better than the others

25
Outline
  • Introduction
  • Symmetric Probabilistic Alignment
  • Experiments and Results
  • Conclusions
  • Future Work

26
Conclusions
  • Improvement on EBMT performance
  • Combined aligner worked the best on
    English-Chinese set
  • Bidirectional alignment worked better than
    unidirectional alignment

27
Future Work
  • Incorporating human dictionaries to cover more
    general domains
  • Non-contiguous alignment
  • Co-training of the SPA and a dictionary
  • Experiments on different data sets and different
    language pairs
  • Experiments with different metrics
  • Speed up

28
References
  • Ying Zhang, Stephan Vogel and Alex Waibel.
    Integrated Phrase Segmentation and Alignment
    Model for Statistical Machine Translation.
    submitted to Proc. of International Confrerence
    on Natural Language Processing and Knowledge
    Engineering (NLP-KE), 2003, Beijing, China.
  • Peter F. Brown, Stephen A. Della Pietra, Vin-cent
    J. Della Pietra, and Robert L. Mercer. 1993. The
    mathematics of statistical machinetranslation
    Parameter estimation. Computa-tional Linguistics,
    19 (2) 263-311.
  • Stephan Vogel, Hermann Ney, and Christoph
    Till-mann. 1996. HMM-based word alignment in
    statistical translation. In COLING '96 The 16th
    Int. Conf. on Computational Linguistics, pages
    836-841, Copenhagen, August.
  • I. Dan Melamed. "A Word-to-Word Model of
    Translational Equivalence". In Procs. of the
    ACL97. pp 490--497. Madrid Spain, 1997.
  • K. Yamada and K. Knight. A decoder for
    syntax-based statistical MT. In ACL '02, 2002.

29
Thank You !!
  • Questions?

30
Backup Slides
  • Alignment Accuracy Calculation
  • Non-contiguous Alignment

31
Alignment Accuracy Calculation
  • Human Answer
  • ... under the unemployment insurance plan of the
    other country ...
  • Machine Answer
  • ... under the unemployment insurance plan of the
    other country ...
  • Precision 4/5 0.2
  • Recall 4/8 0.5
  • F1 0.2857

32
Non-contiguous Alignment
Write a Comment
User Comments (0)
About PowerShow.com