Word-based SMT - PowerPoint PPT Presentation

1 / 80

About This Presentation

Title:

Word-based SMT

Description:

If you get all the math right, maybe you can join their company. ... Using several clever tricks. The resulting formulae. are intuitively expected ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 81

Provided by: coursesWa5

Learn more at: http://courses.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Word-based SMT

1
Word-based SMT

Ling 575
Fei Xia
Week 3 1/16/06

2
Outline

General concepts
Source channel model
Notations
Word alignment
Model 1 and 2
Model 3 and 4
Model 5

3
Expectation

This part is very heavy in math, at least for
non-math majors.
You need to understand the general concepts
(e.g., source channel, alignment, iterative
training, distortion, )
It is OK if you dont follow the math.
If you get all the math right, maybe you can join
their company.

4
General concepts
5
IBM Model Basics

Classic paper Brown et. al. (1993)
Translation F ? E (or Fr ? Eng)
Resource required
Parallel data (a set of sentence pairs)
Main concepts
Source channel model
Hidden word alignment
EM training

6
Intuition

Sentence pairs word mapping is one-to-one.
(1) S a b c d e
T l m n o p
(2) S c a e
T p n m
(3) S d a c
T n p l
? (b, o), (d, l), (e, m), and
(a, p), (c, n), or
(a, n), (c, p)

7
Source channel model

Task S ? T
Source channel (a.k.a. noisy channel, noisy
source channel) use the Bayes Rule.
Two types of parameters
P(T) language model
P(S T) its meaning varies.

8
Source channel model for MT
P(T)
P(S T)
Tgt sent
Src sent
Noisy channel

Two types of parameters
Language model P(T)
Translation model P(S T)

9
Source channel model for MT
P(E)
P(F E)
Fr sent
Eng sent
Noisy channel

Two types of parameters
Language model P(E)
Translation model P(F E)

10
Source channel for MT

People think in English.
English thoughts can be characterized by a
plausibility filter P(E).
Sentences are corrupted into a different
language by a translation model P(F E).
Our goal is to find the original, uncorrupted
English sentence e. To achieve this goal, we
efficiently evaluate P(E) P(F E) over many
candidate Eng sentences.

11
Source channel vs. direct model

Source channel demand plausible Eng and strong
correlation between e and f.
Direct model demand strong correlation between
e and f.
Question
Formally, they are the same.
In practice, they are not due to different
approximations.

12
Word alignment

A function from F position to E position
a(j)i ? aj i
a (a1, , am)
Ex
F f1 f2 f3 f4 f5
E e1 e2 e3 e4
a43
a (0, 1, 1, 3, 2)

13
The constraint on word alignment

The constraint each fr word is generated by
exactly one Eng word (including e0) l is Eng
sent length, m is Fr sent length
Without the constraint 2lm.
With the constraint (l1)m.
Why the models use the constraint?
We want to use P(fj ei) to estimate P(F E).
How to handle the exceptional cases?
Various methods target word grouping,
phrase-based SMT, etc.

14
Modeling p(F E) with alignment
15
Notation

E the Eng sentence E e1 el
ei the i-th Eng word.
F the Fr sentence f1 fm
fj the j-th Fr word.
e0 the Eng NULL word
f0 the Fr NULL word.
aj the position of Eng word that generates
fj.

16
Word alignment

An alignment, a, is a function from Fr word
position to Eng word position a(j)i means that
the fj is generated by ei.
The constraint each fr word is generated by
exactly one Eng word (including e0)

17
Notation (cont)

l Eng sent leng
m Fr sent leng
i Eng word position
j Fr word position
e an Eng word
f a Fr word

18
Outline

General concepts
Source channel model
Word alignment
Notations
Model 1-2
Model 3-4

19
Model 1 and 2
20
Model 1 and 2

Modeling
Generative process
Decomposition
Formula and types of parameters
Training
Finding the best alignment
Decoding find the translation

21
Generative process

To generate F from E
Pick a length m for F, with prob P(m l)
Choose an alignment a, with prob P(a E, m)
Generate Fr sent given the Eng sent and the
alignment, with prob P(F E, a, m).
Another way to look at it
Pick a length m for F, with prob P(m l).
For j1 to m
Pick an Eng word index aj, with prob P(aj j, m,
l).
Pick a Fr word fj according to the Eng word ei,
where ajI, with prob P(fj ei ).

22
Decomposition
23
Approximation

Fr sent length depends only on Eng sent length
Fr word depends only on the Eng word that
generates it

24
Approximation (cont)

Estimating P(a E, m)
Model 1 All alignments are equally likely
Model 2 alignments have different prob
Model 1 can be seen as a special case of Model 2,
where

25
Decomposition for Model 1
26
The magic (for Model 1)
27
Final formula and parameters for Model 1

Two types of parameters
Length prob P(m l)
Translation prob P(fj ei), or t(fj ei),

28
Decomposition for Model 2

Same as Model 1 except that Model 2 does not
assume all alignments are equally likely.

29
The magic for Model 2
30
Final formula and parameters for Model 2

Three types of parameters
Length prob P(m l)
Translation prob t(fj ei)
Distortion prob d(i j, m, l)

31
Summary of Modeling
Model 1
Model 2

Parameters
Length prob P(m
l)
Translation prob t(fj
ei)
Distortion prob (for Model 2) d(i j, m, l)

32
Model 1 and 2

Modeling
Generative process
Decomposition
Formula and types of parameters
Training
Finding the best alignment
Decoding

33
Training

Mathematically motivated
Having an objective function to optimize
Using several clever tricks
The resulting formulae
are intuitively expected
can be calculated efficiently
EM algorithm
Hill climbing, and each iteration guarantees to
improve objective function
It does not guaranteed to reach global optimal.

34
Length prob P(j i)

Let Ct (j, i) be the number of sentence pairs
where the Fr leng is j, and Eng leng is i.
Length prob
No need for iterations

35
Estimating t(fe) a naïve approach

A naïve approach
Count the times that f appears in F and e appears
in E.
Count the times that e appears in E
Divide the 1st number by the 2nd number.
Problem
It cannot distinguish true translations from pure
coincidence.
Ex t(el white) t(blanco white)
Solution count the times that f aligns to e.

36
Estimating t(fe) in Model 1

When each sent pair has a unique word alignment
When each sent pair has several word alignments
with prob
When there are no word alignments

37
When there is a single word alignment

We can simply count.
Training data
Eng b c b
Fr x y y
Prob
ct(x,b)0, ct(y,b)2, ct(x,c)1, ct(y,c)0
t(xb)0, t(yb)1.0, t(xc)1.0, t(yc)0

38
When there are several word alignments

If a sent pair has several word alignments, use
fractional counts.
Training data
P(aE,F)0.3 0.2 0.4 0.1
1.0
b c b c b c
b c b
x y x y x y
x y y
Prob
Ct(x,b)0.7, Ct(y,b)1.5, Ct(x,c)0.3,
Ct(y,c)0.5
P(xb)7/22, P(yb)15/22, P(xc)3/8, P(yc)5/8

39
Fractional counts

Let Ct(f, e) be the fractional count of (f, e)
pair in the training data, given alignment prob
P.

Alignment prob
Actual count of times e and f are linked in
(E,F) by alignment a
40
When there are no word alignments

We could list all the alignments, and estimate
P(a E, F).

41
Formulae so far
? New estimate for t(fe)
42
The algorithm

Start with an initial estimate of t(f e) e.g.,
uniform distribution
Calculate P(a F, E)
Calculate Ct (f, e), Normalize to get t(fe)
Repeat Steps 2-3 until the improvement is too
small.

43
So far, we estimate t(f e) by enumerating all
possible alignments

This process is very expensive, as the number of
all possible alignments is (l1)m.

Prev iterations Estimate of Alignment prob
Actual count of times e and f are linked in
(E,F) by alignment a
44
No need to enumerate all word alignments

Luckily, for Model 1, there is a way to calculate
Ct(f, e) efficiently.

45
The algorithm

Start with an initial estimate of t(f e) e.g.,
uniform distribution
Calculate P(a F, E)
Calculate Ct (f, e), Normalize to get t(fe)
Repeat Steps 2-3 until the improvement is too
small.

46
Estimating t(f e) in Model 2

Ct(f, e) is slightly different from the one in
Model 1

47
Estimating d(i j, m,l) in Model 2

Let Ct(i, j, m, l) be the fractional count that
Fr position j is linked to the Eng position i.

48
The algorithm

Start with an initial estimate of t(f e) e.g.,
uniform distribution
Calculate P(a F, E)
Calculate Ct (f, e), Normalize to get t(fe)
Repeat Steps 2-3 until the improvement is too
small.

49
An example of Model 1 training

Training data
Sent 1 Eng b c, Fr x y
Sent 2 Eng b, Fr y
To reduce the number of alignments, assume that
each Eng word generates exactly one Fr word ? Two
possible alignments for Sent1, and one for Sent2.
Step 1 Initial t(fe) t(xb)t(yb)1/2,
t(xc)t(yc)1/2

50
Repeating step 2 calculating P(aF,E)

a1 b c a2 b c a3
b
x y x y
y
Before normalization
P(a1E1,F1)Z1/41/21/8
P(a2E1,F1)Z3/41/23/8
P(a3E2,F2)Z3/4
After normalization
P(a1E1,F1)1/8 / (1/83/8) 1/4
P(a2E1,F2)3/8 / 4/8 3/4.
P(a3E2,F2) 3/4 / 3/4 1

51
Step 2 calculating P(aF,E)

a1 b c a2 b c a3
b
x y x y
y
Before normalization
P(a1E1,F1)Z1/21/21/4
P(a2E1,F1)Z1/21/21/4
P(a3E2,F2)Z1/2
After normalization
P(a1E1,F1)1/4 / (1/41/4) ½
P(a2E1,F1)1/4 / ½ ½.
P(a3E2,F2) ½ / ½ 1

52
Repeating step 3 calculating t(f e)

a1 b c a2 b c a3
b
x y x y
y
Collecting counts
Ct(x,b) 1/4
Ct(y,b) 3/4 1 7/4
Ct(x,c)3/4
Ct(y,c)1/4
After normalization
t(x b) 1/4 / (1/47/4) 1/8, t(y b) 7/8
t(x c) 3/4 / (3/41/4) 3/4, t(y c)1/4

53
See the trend?
t(xb) t(yb) t(xc) t(yc) a1 a2
init 1/2 1/2 1/2 1/2 - -
1st iter 1/4 3/4 1/2 1/2 1/2 1/2
2nd iter 1/8 7/8 3/4 1/4 1/4 3/4
54
Calculating t(f e) with the new formulae

E1 b c E2 b
F1 x y F2 y
Collecting counts
Ct(x,b) 1/2/(1/21/2)
Ct(y,b) ½ /(1/21/2) 1/1 3/2
Ct(x,c)1/2 / (1/21/2) 1/2
Ct(y,c)1/2 / (1/21/2) 1/2
After normalization
t(x b) ½ / (1/23/2) ¼, t(y b) 3/4
t(x c) ½ / 1 ½, t(y c)1/2

55
Step 3 calculating t(f e)

a1 b c a2 b c a3
b
x y x y
y
Collecting counts
Ct(x,b) 1/2
Ct(y,b) ½ 1 3/2
Ct(x,c)1/2
Ct(y,c)1/2
After normalization
t(x b) ½ / (1/23/2) ¼, t(y b) 3/4
t(x c) ½ / 1 ½, t(y c)1/2

56
EM algorithm

EM expectation maximization
In a model with hidden states (e.g., word
alignment), how can we estimate model parameters?
EM does the following
E-step Take an initial model parameterization
and calculate the expected values of the hidden
data.
M-step Use the expected values to maximize the
likelihood of the training data.

57
Training Summary

EM algorithm
Hill climbing, and each iteration guarantees to
improve objective function
It does not guaranteed to reach global optimal.
The resulting formulae
are intuitively expected
can be calculated efficiently

58
Model 1 and 2

Modeling
Generative process
Decomposition
Formula and types of parameters
Training
Finding the best alignment

59
The best alignment in Model 1-5
Given E and F, we are looking for the best
alignment a
60
The best alignment in Model 1
61
The best alignment in Model 2
62
Summary of Model 1 and 2

Modeling
Pick the length of F with prob P(m l).
For each position j
Pick an English word position aj, with prob P(aj
j, m, l).
Pick a Fr word fj according to the Eng word ei,
with t(fj ei), where iaj
The resulting formula can be calculated
efficiently.
Training EM algorithm. The update can be done
efficiently.
Finding the best alignment can be easily done.

63
Limitations of Model 1 and 2

There could be some relations among the Fr words
generated by the same Eng word (w.r.t. positions
and fertility).
The relations are not captured by Model 1 and 2.
They are captured by Model 3 and 4.

64
Outline

General concepts
Source channel model
Word alignment
Notations
Model 1-2
Model 3-4

65
Model 3 and 4
66
Model 3 and 4

Modeling
Generative process
Decomposition and final formula
Types of parameters
Training
Finding the best alignment
Decoding

67
Generative process

For each Eng word ei, choose a fertility
For each ei, generate Fr words
Choose the position of each Fr word.

68
An example
NULL the cheapest nonstop flights
69
An example
NULL the cheapest nonstop flights
vols
sans
escale
le
moins
cher
70
Decomposition
71
Approximations and types of parameters
Where N is the number of empty slots.
72
Approximations and types of parameters (cont)
73
Modeling summary

For each Eng word ei, choose a fertility
which only depends on ei.
For each ei, generate Fr words, which only
depends on ei.
Choose the position of each Fr word
Model 3 the position depends only on the
position of the Eng word generating it.
Model 4 the position depends on more.

74
Training

Use EM, just like Model 1 and 2
Translation and distortion probabilities can be
calculated efficiently, fertility probabilities
cannot.
No efficient algorithms to find the best
alignment.

75
Model 3 and 4

Modeling
Generative process
Decomposition and final formula
Types of parameters
Training
Finding the best alignment
Decoding

76
Model 1-4 modeling
77
Model 1-4 training

Similarities
Same objective function
Same algorithm EM algorithm
Differences
Summation over all alignments can be done
efficiently for Model 1-2, but not for Model 3-4.
Best alignment can be found efficiently for Model
1-2, but not for Model 3-4.

78
Summary