Machine Translation- 2

About This Presentation

Title:

Machine Translation- 2

Description:

'Maria did not slap the green witch' Statistical Machine Translation. Most likely English ... Manuals: PHP, KDE, OpenOffice (all from OPUS, many languages) ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 133

Provided by: IBMU306

Category:

more less

Transcript and Presenter's Notes

Title: Machine Translation- 2

1
Machine Translation- 2

Autumn 2008

Lecture 17 4 Sep 2008
2
Statistical Machine Translation

Goal
Given foreign sentence f
Maria no dio una bofetada a la bruja verde
Find the most likely English translation e
Maria did not slap the green witch

3
Statistical Machine Translation

Most likely English translation e is given by
P(ef) estimates conditional probability of any e
given f

4
What makes a good translation

Translators often talk about two factors we want
to maximize
Faithfulness or fidelity
How close is the meaning of the translation to
the meaning of the original
(Even better does the translation cause the
reader to draw the same inferences as the
original would have)
Fluency or naturalness
How natural the translation is, just considering
its fluency in the target language

5
Statistical MT Systems
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Broken English
Spanish
English
What hunger have I, Hungry I am so, I am so
hungry, Have I that hunger
Que hambre tengo yo
I am so hungry
6
Statistical MT Systems
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Broken English
Spanish
English
Translation Model P(se)
Language Model P(e)
Que hambre tengo yo
I am so hungry
Decoding algorithm argmax P(e) P(se) e
7
Statistical MT Faithfulness and Fluency
formalized!

Best-translation of a source sentence S
Developed by researchers who were originally in
speech recognition at IBM
Called the IBM model

8
Three Problems for Statistical MT

Language model
Given an English string e, assigns P(e) by
formula
good English string -gt high P(e)
random word sequence -gt low P(e)
Translation model
Given a pair of strings ltf,egt, assigns P(f e)
by formula
ltf,egt look like translations -gt high P(f e)
ltf,egt dont look like translations -gt low P(f
e)
Decoding algorithm
Given a language model, a translation model, and
a new sentence f find translation e maximizing
P(e) P(f e)

9
Parallel Corpus

Example from DE-News (8/1/1996)

English German
Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform
The discussion around the envisaged major tax reform continues . Die Diskussion um die vorgesehene grosse Steuerreform dauert an .
The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 . Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .
10
Word-Level Alignments

Given a parallel sentence pair we can link
(align) words or phrases that are translations of
each other

11
Parallel Resources

Newswire DE-News (German-English), Hong-Kong
News, Xinhua News (Chinese-English),
Government Canadian-Hansards (French-English),
Europarl (Danish, Dutch, English, Finnish,
French, German, Greek, Italian, Portugese,
Spanish, Swedish), UN Treaties (Russian, English,
Arabic, . . . )
Manuals PHP, KDE, OpenOffice (all from OPUS,
many languages)
Web pages STRAND project (Philip Resnik)

12
Sentence Alignment

If document De is translation of document Df how
do we find the translation for each sentence?
The n-th sentence in De is not necessarily the
translation of the n-th sentence in document Df
In addition to 11 alignments, there are also
10, 01, 1n, and n1 alignments
Approximately 90 of the sentence alignments are
11

13
Sentence Alignment (cntd)

There are several sentence alignment algorithms
Align (Gale Church) Aligns sentences based on
their character length (shorter sentences tend to
have shorter translations then longer sentences).
Works astonishingly well
Char-align (Church) Aligns based on shared
character sequences. Works fine for similar
languages or technical domains
K-Vec (Fung Church) Induces a translation
lexicon from the parallel texts based on the
distribution of foreign-English word pairs.

14
Computing Translation Probabilities

Given a parallel corpus we can estimate
P(e f) The maximum likelihood estimation of P(e
f) is freq(e,f)/freq(f)
Way too specific to get any reasonable
frequencies! Vast majority of unseen data will
have zero counts!
P(e f ) could be re-defined as
Problem The English words maximizing
P(e f ) might not result in a readable
sentence

15
Computing Translation Probabilities (ctnd)

We can account for adequacy each foreign word
translates into its most likely English word
We cannot guarantee that this will result in a
fluent English sentence
Solution transform P(e f) with Bayes rule
P(e f) P(e) P(f e) / P(f)
P(f e) accounts for adequacy
P(e) accounts for fluency

16
Decoding

The decoder combines the evidence from P(e) and
P(f e) to find the sequence e that is the best
translation
The choice of word e as translation of f
depends on the translation probability P(f e)
and on the context, i.e. other English words
preceding e

17
Noisy Channel Model for Translation
18
Noisy Channel Model

Generative story
Generate e with probability p(e)
Pass e through noisy channel
Out comes f with probability p(fe)
Translation task
Given f, deduce most likely e that produced f, or

19
Translation Model

How to model P(fe)?
Learn parameters of P(fe) from a bilingual
corpus S of sentence pairs ltei,figt
lt e1,f1 gt ltthe blue witch, la bruja azulgt
lt e2,f2 gt ltgreen, verdegt
lt eS,fS gt ltthe witch, la brujagt

20
Translation Model

Insufficient data in parallel corpus to estimate
P(fe) at the sentence level (Why?)
Decompose process of translating e -gt f into
small steps whose probabilities can be estimated

21
Translation Model

English sentence e e1el
Foreign sentence f f1fm
Alignment A a1am, where aj e 0l
A indicates which English word generates each
foreign word

22
Alignments

e the blue witch
f la bruja azul

A 1,3,2 (intuitively good alignment)
23
Alignments

e the blue witch
f la bruja azul

A 1,1,1 (intuitively bad alignment)
24
Alignments

e the blue witch
f la bruja azul

(illegal alignment!)
25
Alignments

Question how many possible alignments are there
for a given e and f, where e l and f m?

26
Alignments

Question how many possible alignments are there
for a given e and f, where e l and f m?
Answer
Each foreign word can align with any one of e
l words, or it can remain unaligned
Each foreign word has (l 1) choices for an
alignment, and there are f m foreign words
So, there are (l1)m alignments for a given e
and f

27
Alignments

Question If all alignments are equally likely,
what is the probability of any one alignment,
given e?

28
Alignments

Question If all alignments are equally likely,
what is the probability of any one alignment,
given e?
Answer
P(Ae) p(f m) 1/(l1)m
If we assume that p(f m) is uniform over all
possible values of f, then we can let p(f
m) C
P(Ae) C /(l1)m

29
Generative Story

e blue witch
f bruja azul

?
How do we get from e to f?
30
(No Transcript)
31
Language Modeling

Determines the probability of some English
sequence of length l
P(e) is hard to estimate directly, unless l is
very small
P(e) is normally approximated as
where m is size of the context, i.e. number of
previous words that are considered, normally m2
(tri-gram language model

32
Translation Modeling

Determines the probability that the foreign word
f is a translation of the English word e
How to compute P(f e) from a parallel corpus?
Statistical approaches rely on the co-occurrence
of e and f in the parallel data If e and f tend
to co-occur in parallel sentence pairs, they are
likely to be translations of one another

33
Finding Translations in a Parallel Corpus

Into which foreign words f, . . . , f does e
translate?
Commonly, four factors are used
How often do e and f co-occur? (translation)
How likely is a word occurring at position i to
translate into a word occurring at position j?
(distortion) For example English is a
verb-second language, whereas German is a
verb-final language
How likely is e to translate into more than one
word? (fertility) For example defeated can
translate into eine Niederlage erleiden
How likely is a foreign word to be spuriously
generated? (null translation)

34
Translation Steps

35
IBM Models 15

Model 1 Bag of words
Unique local maxima
Efficient EM algorithm (Model 12)
Model 2 General alignment
Model 3 fertility n(k e)
No full EM, count only neighbors (Model 35)
Deficient (Model 34)
Model 4 Relative distortion, word classes
Model 5 Extra variables to avoid deficiency

36
IBM Model 1

Model parameters
T(fj eaj ) translation probability of foreign
word given English word that generated it

37
IBM Model 1

Generative story
Given e
Pick m f, where all lengths m are equally
probable
Pick A with probability P(Ae) 1/(l1)m, since
all alignments are equally likely given l and m
Pick f1fm with probability
where T(fj eaj ) is the translation
probability of fj given the English word it is
aligned to

38
IBM Model 1 Example

e blue witch

39
IBM Model 1 Example

e blue witch
f f1 f2

Pick m f 2
40
IBM Model 1 Example

e blue witch
f f1 f2

Pick A 2,1 with probability 1/(l1)m
41
IBM Model 1 Example

e blue witch
f bruja f2

Pick f1 bruja with probability t(brujawitch)
42
IBM Model 1 Example

e blue witch
f bruja azul

Pick f2 azul with probability t(azulblue)
43
IBM Model 1 Parameter Estimation

How does this generative story help us to
estimate P(fe) from the data?
Since the model for P(fe) contains the parameter
T(fj eaj ), we first need to estimate
T(fj eaj )

44
lBM Model 1 Parameter Estimation

How to estimate T(fj eaj ) from the data?
If we had the data and the alignments A, along
with P(Af,e), then we could estimate T(fj eaj
) using expected counts as follows

45
lBM Model 1 Parameter Estimation

How to estimate P(Af,e)?
P(Af,e) P(A,fe) / P(fe)
But
So we need to compute P(A,fe)
This is given by the Model 1 generative story

46
IBM Model 1 Example

e the blue witch
f la bruja azul

P(Af,e) P(f,Ae)/ P(fe)
47
IBM Model 1 Parameter Estimation

So, in order to estimate P(fe), we first need to
estimate the model parameter
T(fj eaj )
In order to compute T(fj eaj ) , we need to
estimate P(Af,e)
And in order to compute P(Af,e), we need to
estimate T(fj eaj )

48
IBM Model 1 Parameter Estimation

Training data is a set of pairs lt ei, figt
Log likelihood of training data given model
parameters is
To maximize log likelihood of training data given
model parameters, use EM
hidden variable alignments A
model parameters translation probabilities T

49
EM

Initialize model parameters T(fe)
Calculate alignment probabilities P(Af,e) under
current values of T(fe)
Calculate expected counts from alignment
probabilities
Re-estimate T(fe) from these expected counts
Repeat until log likelihood of training data
converges to a maximum

50
IBM Model 1 Example

Parallel corpus
the dog le chien
the cat le chat
Step 12 (collect candidates and initialize
uniformly)
P(le the) P(chien the) P(chat
the) 1/3
P(le dog) P(chien dog) P(chat dog)
1/3
P(le cat) P(chien cat) P(chat
cat) 1/3
P(le NULL) P(chien NULL) P(chat NULL)
1/3

51
IBM Model 1 Example

Step 3 Iterate
NULL the dog le chien
j1
total P(le NULL)P(le the)P(le dog) 1
tc(le NULL) P(le NULL)/1 0 .333/1
0.333
tc(le the) P(le the)/1 0 .333/1
0.333
tc(le dog) P(le dog)/1 0 .333/1
0.333
j2
total P(chien NULL)P(chien the)P(chien
dog)1
tc(chien NULL) P(chien NULL)/1 0
.333/1 0.333
tc(chien the) P(chien the)/1 0
.333/1 0.333
tc(chien dog) P(chien dog)/1 0
.333/1 0.333

52
IBM Model 1 Example

NULL the cat le chat
j1
total P(le NULL)P(le the)P(le cat)1
tc(le NULL) P(le NULL)/1 0.333
.333/1 0.666
tc(le the) P(le the)/1 0.333
.333/1 0.666
tc(le cat) P(le cat)/1 0
.333/1 0.333
j2
total P(chien NULL)P(chien the)P(chien
dog)1
tc(chat NULL) P(chat NULL)/1 0
.333/1 0.333
tc(chat the) P(chat the)/1
0 .333/1 0.333
tc(chat cat) P(chat dog)/1 0
.333/1 0.333

53
IBM Model 1 Example

Re-compute translation probabilities
total(the) tc(le the) tc(chien the)
tc(chat the)
0.666 0.333 0.333
1.333
P(le the) tc(le the)/total(the)
0.666 / 1.333 0.5
P(chien the) tc(chien the)/total(the)
0.333/1.333 0.25
P(chat the) tc(chat the)/total(the)
0.333/1.333 0.25
total(dog) tc(le dog) tc(chien dog)
0.666
P(le dog) tc(le dog)/total(dog)
0.333 / 0.666 0.5
P(chien dog) tc(chien dog)/total(dog)
0.333 / 0.666 0.5

54
IBM Model 1 Example

Iteration 2
NULL the dog le chien
j1
total P(le NULL)P(le the)P(le dog)
1.5
0.5 0.5 0.5 1.5
tc(le NULL) P(le NULL)/1 0 .5/1.5
0.333
tc(le the) P(le the)/1 0 .5/1.5
0.333
tc(le dog) P(le dog)/1 0 .5/1.5
0.333
j2
total P(chien NULL)P(chien the)P(chien
dog)1
0.25 0.25 0.5 1
tc(chien NULL) P(chien NULL)/1 0
.25/1 0.25
tc(chien the) P(chien the)/1 0
.25/1 0.25
tc(chien dog) P(chien dog)/1 0
.5/1 0.5

55
IBM Model 1 Example

NULL the cat le chat
j1
total P(le NULL)P(le the)P(le cat)
1.5
0.5 0.5 0.5 1.5
tc(le NULL) P(le NULL)/1 0.333
.5/1 0.833
tc(le the) P(le the)/1 0.333 .5/1
0.833
tc(le cat) P(le cat)/1 0 .5/1
0.5
j2
total P(chat NULL)P(chat the)P(chat
cat)1
0.25 0.25 0.5 1
tc(chat NULL) P(chat NULL)/1 0
.25/1 0.25
tc(chat the) P(chat the)/1 0 .25/1
0.25
tc(chat cat) P(chat cat)/1 0 .5/1
0.5

56
IBM Model 1 Example

Re-compute translations (iteration 2)
total(the) tc(le the) tc(chien the)
tc(chat the)
.833 0.25 0.25 1.333
P(le the) tc(le the)/total(the)
.833 / 1.333 0.625
P(chien the) tc(chien the)/total(the)
0.25/1.333 0.188
P(chat the) tc(chat the)/total(the)
0.25/1.333 0.188
total(dog) tc(le dog) tc(chien dog)
0.333 0.5 0.833
P(le dog) tc(le dog)/total(dog)
0.333 / 0.833 0.4
P(chien dog) tc(chien dog)/total(dog)
0.5 / 0.833 0.6

57
IBM Model 1Example

After 5 iterations
P(le NULL) 0.755608028335301
P(chien NULL) 0.122195985832349
P(chat NULL) 0.122195985832349
P(le the) 0.755608028335301
P(chien the) 0.122195985832349
P(chat the) 0.122195985832349
P(le dog) 0.161943319838057
P(chien dog) 0.838056680161943
P(le cat) 0.161943319838057
P(chat cat) 0.838056680161943

58
IBM Model 1 Recap

IBM Model 1 allows for an efficient computation
of translation probabilities
No notion of fertility, i.e., its possible that
the same English word is the best translation for
all foreign words
No positional information, i.e., depending on the
language pair, there might be a tendency that
words occurring at the beginning of the English
sentence are more likely to align to words at the
beginning of the foreign sentence

59
IBM Model 2

Model parameters
T(fj eaj ) translation probability of foreign
word fj given English word eaj that generated it
d(ij,l,m) distortion probability, or
probability that fj is aligned to ei , given l
and m

60
IBM Model 3

Model parameters
T(fj eaj ) translation probability of foreign
word fj given English word eaj that generated it
r(ji,l,m) reverse distortion probability, or
probability of position fj, given its alignment
to ei, l, and m
n(ei) fertility of word ei , or number of
foreign words aligned to ei
p1 probability of generating a foreign word by
alignment with the NULL English word

61
IBM Model 3

IBM Model 3 offers two additional features
compared to IBM Model 1
How likely is an English word e to align to k
foreign words (fertility)?
Positional information (distortion), how likely
is a word in position i to align to a word in
position j?

62
IBM Model 3 Fertility

The best Model 1 alignment could be that a single
English word aligns to all foreign words
This is clearly not desirable and we want to
constrain the number of words an English word can
align to
Fertility models a probability distribution that
word e aligns to k words n(k,e)
Consequence translation probabilities cannot be
computed independently of each other anymore
IBM Model 3 has to work with full alignments,
note there are up to (l1)m different alignments

63
IBM Model 3

Generative Story
Choose fertilities for each English word
Insert spurious words according to probability of
being aligned to the NULL English word
Translate English words -gt foreign words
Reorder words according to reverse distortion
probabilities

64
IBM Model 3 Example

Consider the following example from Knight
1999
Maria did not slap the green witch

65
IBM Model 3 Example

Maria did not slap the green witch
Maria not slap slap slap the green witch
Choose fertilities phi(Maria) 1

66
IBM Model 3 Example

Maria did not slap the green witch
Maria not slap slap slap the green witch
Maria not slap slap slap NULL the green witch
Insert spurious words p(NULL)

67
IBM Model 3 Example

Maria did not slap the green witch
Maria not slap slap slap the green witch
Maria not slap slap slap NULL the green witch
Maria no dio una bofetada a la verde bruja
Translate words t(verdegreen)

68
IBM Model 3 Example

Maria no dio una bofetada a la verde bruja
Maria no dio una bofetada a la bruja verde
Reorder words

69
IBM Model 3

For models 1 and 2
We can compute exact EM updates
For models 3 and 4
Exact EM updates cannot be efficiently computed
Use best alignments from previous iterations to
initialize each successive model
Explore only the subspace of potential alignments
that lies within same neighborhood as the initial
alignments

70
IBM Model 4

Model parameters
Same as model 3, except uses more complicated
model of reordering (for details, see Brown et
al. 1993)

71
(No Transcript)
72
IBM Model 1 Model 3

Iterating over all possible alignments is
computationally infeasible
Solution Compute the best alignment with Model 1
and change some of the alignments to generate a
set of likely alignments (pegging)
Model 3 takes this restricted set of alignments
as input

73
Pegging

Given an alignment a we can derive additional
alignments from it by making small changes
Changing a link (j,i) to (j,i)
Swapping a pair of links (j,i) and (j,i) to
(j,i) and (j,i)
The resulting set of alignments is called the
neighborhood of a

74
IBM Model 3 Distortion

The distortion factor determines how likely it is
that an English word in position i aligns to a
foreign word in position j, given the lengths of
both sentences
d(j i, l, m)
Note, positions are absolute positions

75
Deficiency

Problem with IBM Model 3 It assigns probability
mass to impossible strings
Well formed string This is possible
Ill-formed but possible string This possible
is
Impossible string
Impossible strings are due to distortion values
that generate different words at the same
position
Impossible strings can still be filtered out in
later stages of the translation process

76
Limitations of IBM Models

Only 1-to-N word mapping
Handling fertility-zero words (difficult for
decoding)
Almost no syntactic information
Word classes
Relative distortion
Long-distance word movement
Fluency of the output depends entirely on the
English language model

77
Decoding

How to translate new sentences?
A decoder uses the parameters learned on a
parallel corpus
Translation probabilities
Fertilities
Distortions
In combination with a language model the decoder
generates the most likely translation
Standard algorithms can be used to explore the
search space (A, greedy searching, )
Similar to the traveling salesman problem

78
Three Problems for Statistical MT

Language model
Given an English string e, assigns P(e) by
formula
good English string -gt high P(e)
random word sequence -gt low P(e)
Translation model
Given a pair of strings ltf,egt, assigns P(f e)
by formula
ltf,egt look like translations -gt high P(f e)
ltf,egt dont look like translations -gt low P(f
e)
Decoding algorithm
Given a language model, a translation model, and
a new sentence f find translation e maximizing
P(e) P(f e)

Slide from Kevin Knight
79
The Classic Language ModelWord N-Grams

Goal of the language model -- choose among
He is on the soccer field
He is in the soccer field
Is table the on cup the
The cup is on the table
Rice shrine
American shrine
Rice company
American company

Slide from Kevin Knight
80
Intuition of phrase-based translation (Koehn et
al. 2003)

Generative story has three steps
Group words into phrases
Translate each phrase
Move the phrases around

81
Generative story again

Group English source words into phrases e1, e2,
, en
Translate each English phrase ei into a Spanish
phrase fj.
The probability of doing this is ?(fjei)
Then (optionally) reorder each Spanish phrase
We do this with a distortion probability
A measure of distance between positions of a
corresponding phrase in the 2 lgs.
What is the probability that a phrase in
position X in the English sentences moves to
position Y in the Spanish sentence?

82
Distortion probability

The distortion probability is parameterized by
ai-bi-1
Where ai is the start position of the foreign
(Spanish) phrase generated by the ith English
phrase ei.
And bi-1 is the end position of the foreign
(Spanish) phrase generated by the I-1th English
phrase ei-1.
Well call the distortion probability d(ai-bi-1).
And well have a really stupid model
d(ai-bi-1) ?ai-bi-1
Where ? is some small constant.

83
Final translation model for phrase-based MT

Lets look at a simple example with no distortion

84
Phrase-based MT

Language model P(E)
Translation model P(FE)
Model
How to train the model
Decoder finding the sentence E that is most
probable

85
Training P(FE)

What we mainly need to train is ?(fjei)
Suppose we had a large bilingual training corpus
A bitext
In which each English sentence is paired with a
Spanish sentence
And suppose we knew exactly which phrase in
Spanish was the translation of which phrase in
the English
We call this a phrase alignment
If we had this, we could just count-and-divide

86
But we dont have phrase alignments

What we have instead are word alignments

87
Getting phrase alignments

To get phrase alignments
We first get word alignments
Then we symmetrize the word alignments into
phrase alignments

88
How to get Word Alignments

Word alignment a mapping between the source
words and the target words in a set of parallel
sentences.
Restriction each foreign word comes from exactly
1 English word
Advantage represent an alignment by the index of
the English word that the French word comes from
Alignment above is thus 2,3,4,5,6,6,6

89
One addition spurious words

A word in the foreign sentence
That doesnt align with any word in the English
sentence
Is called a spurious word.
We model these by pretending they are generated
by an English word e0

90
More sophisticated models of alignment
91
Computing word alignments IBM Model 1

For phrase-based machine translation
We want a word-alignment
To extract a set of phrases
A word alignment algorithm gives us P(F,E)
We want this to train our phrase probabilities
?(fjei) as part of P(FE)
But a word-alignment algorithm can also be part
of a mini-translation model itself.

92
IBM Model 1
93
IBM Model 1
94
How does the generative story assign P(FE) for a
Spanish sentence F?

Terminology
Suppose we had done steps 1 and 2, I.e. we
already knew the Spanish length J and the
alignment A (and English source E)

95
Lets formalize steps 1 and 2

We want P(AE) of an alignment A (of length J)
given an English sentence E
IBM Model 1 makes the (very) simplifying
assumption that each alignment is equally likely.
How many possible alignments are there between
English sentence of length I and Spanish sentence
of length J?
Hint Each Spanish word must come from one of the
English source words (or the NULL word)
(I1)J
Lets assume probability of choosing length J is
small constant epsilon

96
Model 1 continued

Prob of choosing a length and then one of the
possible alignments
Combining with step 3
The total probability of a given foreign sentence
F

97
Decoding

How do we find the best A?

98
Training alignment probabilities

Step 1 get a parallel corpus
Hansards
Canadian parliamentary proceedings, in French and
English
Hong Kong Hansards English and Chinese
Step 2 sentence alignment
Step 3 use EM (Expectation Maximization) to
train word alignments

99
Step 1 Parallel corpora

Example from DE-News (8/1/1996)

The old man is happy. He has fished many times.
His wife talks to him. The fish are jumping.
The sharks await.
Intuition
- use length in words or chars
- together with dynamic programming
- or use a simpler MT model

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
Slide from Kevin Knight
101
Sentence Alignment

The old man is happy.
He has fished many times.
His wife talks to him.
The fish are jumping.
The sharks await.

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
Slide from Kevin Knight
102
Sentence Alignment

The old man is happy.
He has fished many times.
His wife talks to him.
The fish are jumping.
The sharks await.

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
Slide from Kevin Knight
103
Sentence Alignment

The old man is happy. He has fished many times.
His wife talks to him.
The sharks await.

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
Note that unaligned sentences are thrown out,
and sentences are merged in n-to-m alignments (n,
m gt 0).
Slide from Kevin Knight
104
Step 3 word alignments

It turns out we can bootstrap alignments
From a sentence-aligned bilingual corpus
We use is the Expectation-Maximization or EM
algorithm

105
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
All word alignments equally likely All
P(french-word english-word) equally likely
Slide from Kevin Knight
106
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
la and the observed to co-occur
frequently, so P(la the) is increased.
Slide from Kevin Knight
107
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
house co-occurs with both la and maison,
but P(maison house) can be raised without
limit, to 1.0, while P(la house) is limited
because of the (pigeonhole principle)
Slide from Kevin Knight
108
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
settling down after another iteration
Slide from Kevin Knight
109
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower

Inherent hidden structure revealed by EM
training!
For details, see
Section 24.6.1 in the chapter
A Statistical MT Tutorial Workbook (Knight,
1999).
The Mathematics of Statistical Machine
Translation (Brown et al, 1993)
Software GIZA

Slide from Kevin Knight
110
Statistical Machine Translation
la maison la maison bleue la fleur
the house the blue house the flower
P(juste fair) 0.411 P(juste correct)
0.027 P(juste right) 0.020
Possible English translations, to be rescored by
language model
new French sentence
Slide from Kevin Knight
111
A more complex model IBM Model 3Brown et al.,
1993
Generative approach
Mary did not slap the green witch
n(3slap)
Mary not slap slap slap the green witch
P-Null
Mary not slap slap slap NULL the green witch
t(lathe)
Maria no dió una bofetada a la verde bruja
d(ji)
Maria no dió una bofetada a la bruja verde
Probabilities can be learned from raw bilingual
text.
112
How do we evaluate MT? Human tests for fluency

Rating tests Give the raters a scale (1 to 5)
and ask them to rate
Or distinct scales for
Clarity, Naturalness, Style
Or check for specific problems
Cohesion (Lexical chains, anaphora, ellipsis)
Hand-checking for cohesion.
Well-formedness
5-point scale of syntactic correctness
Comprehensibility tests
Noise test
Multiple choice questionnaire
Readability tests
cloze

113
How do we evaluate MT? Human tests for fidelity

Adequacy
Does it convey the information in the original?
Ask raters to rate on a scale
Bilingual raters give them source and target
sentence, ask how much information is preserved
Monolingual raters give them target a good
human translation
Informativeness
Task based is there enough info to do some task?
Give raters multiple-choice questions about
content

114
Evaluating MT Problems

Asking humans to judge sentences on a 5-point
scale for 10 factors takes time and (weeks or
months!)
We cant build language engineering systems if we
can only evaluate them once every quarter!!!!
We need a metric that we can run every time we
change our algorithm.
It would be OK if it wasnt perfect, but just
tended to correlate with the expensive human
metrics, which we could still run in quarterly.

Bonnie Dorr
115
Automatic evaluation

Miller and Beebe-Center (1958)
Assume we have one or more human translations of
the source passage
Compare the automatic translation to these human
translations
Bleu
NIST
Meteor
Precision/Recall

116
BiLingual Evaluation Understudy (BLEU Papineni,
2001)
http//www.research.ibm.com/people/k/kishore/RC221
76.pdf

Automatic Technique, but .
Requires the pre-existence of Human (Reference)
Translations
Approach
Produce corpus of high-quality human translations
Judge closeness numerically (word-error rate)
Compare n-gram matches between candidate
translation and 1 or more reference translations

Slide from Bonnie Dorr
117
BLEU Evaluation Metric (Papineni et al, ACL-2002)
Reference (human) translation The U.S. island
of Guam is maintaining a high state of alert
after the Guam airport and its offices both
received an e-mail from someone calling himself
the Saudi Arabian Osama bin Laden and threatening
a biological/chemical attack against public
places such as the airport .

N-gram precision (score is between 0 1)
What percentage of machine n-grams can be found
in the reference translation?
An n-gram is an sequence of n words
Not allowed to use same portion of reference
translation twice (cant cheat by typing out the
the the the the)
Brevity penalty
Cant just type out single word the (precision
1.0!)
Amazingly hard to game the system (i.e.,
find a way to change machine output so that BLEU
goes up, but quality doesnt)

BLEU4 formula
(counts n-grams up to length 4)
exp (1.0 log p1
0.5 log p2
0.25 log p3
0.125 log p4
max(words-in-reference / words-in-machine
1,
0)
p1 1-gram precision
P2 2-gram precision
P3 3-gram precision
P4 4-gram precision

Machine translation The American ?
international airport and its the office all
receives one calls self the sand Arab rich
business ? and so on electronic mail , which
sends out The threat will be able after public
place and so on the airport to start the
biochemistry attack , ? highly alerts after the
maintenance.
Slide from Bonnie Dorr
119
Multiple Reference Translations
Slide from Bonnie Dorr
120
BLEU in Action
???????? (Foreign Original) the gunman was
shot to death by the police . (Reference
Translation) the gunman was police kill .
1wounded police jaya of 2the gunman
was shot dead by the police . 3the gunman
arrested by police kill . 4the gunmen were
killed . 5the gunman was shot to death by
the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10
Slide from Bonnie Dorr
121
BLEU in Action
???????? (Foreign Original) the gunman was
shot to death by the police . (Reference
Translation) the gunman was police kill .
1wounded police jaya of 2the gunman
was shot dead by the police . 3the gunman
arrested by police kill . 4the gunmen were
killed . 5the gunman was shot to death by
the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10
green 4-gram match (good!) red word not
matched (bad!)
Slide from Bonnie Dorr
122
Bleu Comparison
Chinese-English Translation Example Candidate 1
It is a guide to action which ensures that the
military always obeys the commands of the
party. Candidate 2 It is to insure the troops
forever hearing the activity guidebook that party
direct.
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Slide from Bonnie Dorr
123
How Do We Compute Bleu Scores?

Intuition What percentage of words in candidate
occurred in some human translation?
Proposal count up of candidate translation
words (unigrams) in any reference translation,
divide by the total of words in candidate
translation
But cant just count total of overlapping
N-grams!
Candidate the the the the the the
Reference 1 The cat is on the mat
Solution A reference word should be considered
exhausted after a matching candidate word is
identified.

Slide from Bonnie Dorr
124
Modified n-gram precision

For each word compute
(1) total number of times it occurs in any
single reference translation
(2) number of times it occurs in the candidate
translation
Instead of using count 2, use the minimum of 2
and 2, I.e. clip the counts at the max for the
reference transcription
Now use that modified count.
And divide by number of candidate words.

Slide from Bonnie Dorr
125
Modified Unigram Precision Candidate 1
It(1) is(1) a(1) guide(1) to(1) action(1)
which(1) ensures(1) that(2) the(4) military(1)
always(1) obeys(0) the commands(1) of(1) the
party(1)
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Whats the answer???
17/18
Slide from Bonnie Dorr
126
Modified Unigram Precision Candidate 2
It(1) is(1) to(1) insure(0) the(4) troops(0)
forever(1) hearing(0) the activity(0)
guidebook(0) that(2) party(1) direct(0)
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Whats the answer????
8/14
Slide from Bonnie Dorr
127
Modified Bigram Precision Candidate 1
It is(1) is a(1) a guide(1) guide to(1) to
action(1) action which(0) which ensures(0)
ensures that(1) that the(1) the military(1)
military always(0) always obeys(0) obeys the(0)
the commands(0) commands of(0) of the(1) the
party(1)
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
10/17
Whats the answer????
Slide from Bonnie Dorr
128
Modified Bigram Precision Candidate 2
It is(1) is to(0) to insure(0) insure the(0) the
troops(0) troops forever(0) forever hearing(0)
hearing the(0) the activity(0) activity
guidebook(0) guidebook that(0) that party(0)
party direct(0)
Reference 1 It is a guide to action that ensures
that themilitary will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Whats the answer????
1/13
Slide from Bonnie Dorr
129
Catching Cheaters
the(2) the the the(0) the(0) the(0) the(0)
Reference 1 The cat is on the mat Reference 2
There is a cat on the mat
Whats the unigram answer?
2/7
Whats the bigram answer?
0/7
Slide from Bonnie Dorr
130
Bleu distinguishes human from machine translations
Slide from Bonnie Dorr
131
Bleu problems with sentence length

Candidate of the
Solution brevity penalty prefers candidates
translations which are same length as one of the
references

Reference 1 It is a guide to action that ensures
that themilitary will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Problem modified unigram precision is 2/2,
bigram 1/1!
Slide from Bonnie Dorr
132
BLEU Tends to Predict Human Judgments
(variant of BLEU)
slide from G. Doddington (NIST)
133
Summary