Stream Decoding for Simultaneous Translation - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Stream Decoding for Simultaneous Translation

Description:

Leads to problems with languages like Chinese-English ... translation probabilities ... Phrase-based systems outperform these word-based translation models ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 40

Provided by: muntsi

Category:

more less

Transcript and Presenter's Notes

Title: Stream Decoding for Simultaneous Translation

1
Word-based Translation Models
2
Overview

Introduction
Lexica
Alignment
IBM Model 1
Higher IBM Models
Word Alignment

3
System overview
4
Introduction

Word-based models were introduced by Brown et al.
in early 90s
Directly translate source words to target words
Model word-by-word translation probabilities
First statistical approach to machine translation
No longer state of the art
Used to generate word alignment for phrase
extraction in phrase based models

5
Lexica

Store translation of the source words
One word can have several translations
Example
Haus house, building, home, household, shell
Some are more likely, others are only used in
certain circumstances
How to decide which one to use in the
translation?
Use statistics

6
Lexica

Collect counts of different translation
Approximate probability distribution

7
Alignment

Mapping between source and target words that are
translations of each other
Example
Input
das Haus ist klein
Probabilistic Lexicon
Possible word-by-word translation
The house is small
Implicit alignment between source and target
sentence

8
Alignment

Formalized as a function
Maps target word position to source word position
Example

9
Alignment Difficulties

Word reordering
Leads to non-monoton alignment

10
Alignment Difficulties

Many-to-one alignments
One word of the input language is translated into
several words

11
Alignment Difficulties

Deletion
For some source words there is no equivalent in
the translation

12
Alignment Difficulties

Insertion
Some words of the target sentence have no
equivalent in the source sentence
Add NULL word to have still a fully defined
alignment function

13
Alignment Remarks

Many-to-one alignments are possible but no
one-to-many alignment
In this models alignments are represented by a
function
Leads to problems with languages like
Chinese-English
In phrase-based system this is solved by looking
at the translation process from both directions

14
IBM Model 1

Model that generates different translations for a
sentence with associated probability
Generative Model Break modeling of sentence
translations into smaller steps of word-to-word
translations with a coherent story
Probability of the English sentence e and
Alignment a given Foreign sentence f
Number of possible alignments
Normalization constant

15
IBM 1 Example
16
IBM 1 Training

Learn translation probability distributions
Problem incomplete data
Only large amounts sentence-aligned parallel
texts are available
Lack alignment information
Consider alignment as a hidden variable
Approach Expectation maximization (EM) algorithm

17
EM Algorithm

Initialize the model
Use uniform distribution
Apply the model to the data (expectation step)?
Compute alignment probabilities
First all are equal but later Hause will be
most likely translated to house
Learn the model from the data (maximization
step)?
Learn translation probabilities from guess
alignment
Use best alignment or all with weights according
to their probability
Iterate steps 2 and 3 until convergence

18
Step 2

Calculate probability of an alignment
Using dynamic programming we can reduce the
complexity from exponential to quadratic in
sentence length

19
Step 3

Collect counts from every sentence pair (e,f)
Calculate translation probabilities

20
Pseudo-code
21
Example
22
Convergence

Probability of the training data increases with
each iteration
EM training converges to local minimum
IBM Model 1 EM will reach a global minimum

23
Higher IBM Models

IBM1 is very simple
No treatment of reordering and adding or dropping
words
Five models of increasing complexity were
proposed by Brown et al.

24
Higher IBM Models

Complexity of training grows, but general
principal stays the same
During training
First train IBM Model 1
Use IBM Model 1 to initialize IBM Model 2
All models are implemented in the GIZA Toolkit
Used by many groups
Parallel version developed at CMU

25
IBM Model 2

Problem of IBM Model 1 same probability for
these both sentence pairs
Model for the alignment based on positions of
input and output words

26
IBM Model 2

Two step procedure
Mathematical formulation

27
IBM Model 2

Training
Similar to IBM Model 1 training
Initialization
Initialize with values of IBM Model 1 training
Alignment probability

28
IBM Model 3

Did not model how many words are generated by a
input word
Model fertility by a probability distribution
Examples

29
IBM Model 3

Word insertions
Model by using the NULL word
should depend on the sentence
length
Insert NULL token after every word with
probability or not with probability

30
IBM Model 3
31
IBM Model 3
32
IBM Model 3

Distortion model instead of Alignment model
Different distortions in both productions by same
alignment
Different direction of both models

33
IBM Model 3 Training

Dynamic Programming approach no longer possible
Use sampling approach
Find most probable alignments using hill climbing
Add additional similar alignments
Use only these alignment for normalization

34
IBM Model 4

Long sentences are relative rare
Distortion probability can not approximated well
Use relative position instead
Use word classes to better approximate the
distribution

35
IBM Model 5

Deficiency According to IBM Model 3 and 4
multiple output words can be placed at the same
position
Positive probability for impossible alignments
IBM Model 5 prevent this
No improvement in alignment quality
Not used in most state-of-the-art systems

36
IBM Models