Stream Decoding for Simultaneous Translation - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Stream Decoding for Simultaneous Translation

Description:

Leads to problems with languages like Chinese-English ... translation probabilities ... Phrase-based systems outperform these word-based translation models ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 40
Provided by: muntsi
Category:

less

Transcript and Presenter's Notes

Title: Stream Decoding for Simultaneous Translation


1
Word-based Translation Models
2
Overview
  • Introduction
  • Lexica
  • Alignment
  • IBM Model 1
  • Higher IBM Models
  • Word Alignment

3
System overview
4
Introduction
  • Word-based models were introduced by Brown et al.
    in early 90s
  • Directly translate source words to target words
  • Model word-by-word translation probabilities
  • First statistical approach to machine translation
  • No longer state of the art
  • Used to generate word alignment for phrase
    extraction in phrase based models

5
Lexica
  • Store translation of the source words
  • One word can have several translations
  • Example
  • Haus house, building, home, household, shell
  • Some are more likely, others are only used in
    certain circumstances
  • How to decide which one to use in the
    translation?
  • Use statistics

6
Lexica
  • Collect counts of different translation
  • Approximate probability distribution


7
Alignment
  • Mapping between source and target words that are
    translations of each other
  • Example
  • Input
  • das Haus ist klein
  • Probabilistic Lexicon
  • Possible word-by-word translation
  • The house is small
  • Implicit alignment between source and target
    sentence

8
Alignment
  • Formalized as a function
  • Maps target word position to source word position
  • Example

9
Alignment Difficulties
  • Word reordering
  • Leads to non-monoton alignment

10
Alignment Difficulties
  • Many-to-one alignments
  • One word of the input language is translated into
    several words

11
Alignment Difficulties
  • Deletion
  • For some source words there is no equivalent in
    the translation

12
Alignment Difficulties
  • Insertion
  • Some words of the target sentence have no
    equivalent in the source sentence
  • Add NULL word to have still a fully defined
    alignment function

13
Alignment Remarks
  • Many-to-one alignments are possible but no
    one-to-many alignment
  • In this models alignments are represented by a
    function
  • Leads to problems with languages like
    Chinese-English
  • In phrase-based system this is solved by looking
    at the translation process from both directions

14
IBM Model 1
  • Model that generates different translations for a
    sentence with associated probability
  • Generative Model Break modeling of sentence
    translations into smaller steps of word-to-word
    translations with a coherent story
  • Probability of the English sentence e and
    Alignment a given Foreign sentence f
  • Number of possible alignments
  • Normalization constant

15
IBM 1 Example
16
IBM 1 Training
  • Learn translation probability distributions
  • Problem incomplete data
  • Only large amounts sentence-aligned parallel
    texts are available
  • Lack alignment information
  • Consider alignment as a hidden variable
  • Approach Expectation maximization (EM) algorithm

17
EM Algorithm
  • Initialize the model
  • Use uniform distribution
  • Apply the model to the data (expectation step)?
  • Compute alignment probabilities
  • First all are equal but later Hause will be
    most likely translated to house
  • Learn the model from the data (maximization
    step)?
  • Learn translation probabilities from guess
    alignment
  • Use best alignment or all with weights according
    to their probability
  • Iterate steps 2 and 3 until convergence

18
Step 2
  • Calculate probability of an alignment
  • Using dynamic programming we can reduce the
    complexity from exponential to quadratic in
    sentence length

19
Step 3
  • Collect counts from every sentence pair (e,f)
  • Calculate translation probabilities

20
Pseudo-code
21
Example
22
Convergence
  • Probability of the training data increases with
    each iteration
  • EM training converges to local minimum
  • IBM Model 1 EM will reach a global minimum

23
Higher IBM Models
  • IBM1 is very simple
  • No treatment of reordering and adding or dropping
    words
  • Five models of increasing complexity were
    proposed by Brown et al.

24
Higher IBM Models
  • Complexity of training grows, but general
    principal stays the same
  • During training
  • First train IBM Model 1
  • Use IBM Model 1 to initialize IBM Model 2
  • All models are implemented in the GIZA Toolkit
  • Used by many groups
  • Parallel version developed at CMU

25
IBM Model 2
  • Problem of IBM Model 1 same probability for
    these both sentence pairs
  • Model for the alignment based on positions of
    input and output words

26
IBM Model 2
  • Two step procedure
  • Mathematical formulation

27
IBM Model 2
  • Training
  • Similar to IBM Model 1 training
  • Initialization
  • Initialize with values of IBM Model 1 training
  • Alignment probability

28
IBM Model 3
  • Did not model how many words are generated by a
    input word
  • Model fertility by a probability distribution
  • Examples

29
IBM Model 3
  • Word insertions
  • Model by using the NULL word
  • should depend on the sentence
    length
  • Insert NULL token after every word with
    probability or not with probability

30
IBM Model 3
31
IBM Model 3
32
IBM Model 3
  • Distortion model instead of Alignment model
  • Different distortions in both productions by same
    alignment
  • Different direction of both models

33
IBM Model 3 Training
  • Dynamic Programming approach no longer possible
  • Use sampling approach
  • Find most probable alignments using hill climbing
  • Add additional similar alignments
  • Use only these alignment for normalization

34
IBM Model 4
  • Long sentences are relative rare
  • Distortion probability can not approximated well
  • Use relative position instead
  • Use word classes to better approximate the
    distribution

35
IBM Model 5
  • Deficiency According to IBM Model 3 and 4
    multiple output words can be placed at the same
    position
  • Positive probability for impossible alignments
  • IBM Model 5 prevent this
  • No improvement in alignment quality
  • Not used in most state-of-the-art systems

36
IBM Models
  • Phrase-based systems outperform these word-based
    translation models
  • IBM Models can be used to generate a word
    alignment by using the viterbi path
  • Problem 1-to-many
  • But we can generate many-to-1 alignments
  • Use alignments from both directions and combine
    with a heuristic

37
Word alignment
38
Word alignment
39
Word alignment
  • Reference alignment using sure and possible
    metrics
  • Most common metric
  • Alignment error rate
Write a Comment
User Comments (0)
About PowerShow.com