Title: Advanced Techniques in NLP
1Advanced Techniques in NLP
- Machine Translation III
- Statistical MT
2Approaching MT
- There are many different ways of approaching the
problem of MT. - The choice of approach is complex and depends
upon - Task requirements
- Human resources
- Linguistic resources
3Criterial Issues
- Do we want a translation system for one language
pair or for many language pairs? - Can we assume a constrained vocabulary or do we
need to deal with arbitrary text? - What resources exist for the languages that we
are dealing with? - How long will it take us to develop the resources
and what human resources?
4Parallel Data
- Lots of translated text available 100s of
million words of translated text for some
language pairs - a book has a few 100,000s words
- an educated person may read 10,000 words a day
- 3.5 million words a year
- 300 million a lifetime
- Computers can see more translated text than
humans read in a lifetime - Machine can learn how to translate foreign
languages. - Koehn 2006
5Statistical Translation
- Robust
- Domain independent
- Extensible
- Does not require language specialists
- Does requires parallel texts
- Uses noisy channel model of translation
6Noisy Channel ModelSentence Translation (Brown
et. al. 1990)
target sentence
sourcesentence
sentence
7Statistical Modelling
- Learn P(fe) from a parallel corpus
- Not sufficient data to estimate P(fe) directly
- from Koehn 2006
8The Problem of Translation
- Given a sentence T of the target language, seek
the source sentence S from which a translator
produced T, i.e. - find S that maximises P(ST)
- By Bayes' theorem
- P(ST) P(S) x P(TS)
- P(T)
- whose denominator is independent of S.
- Hence it suffices to maximise P(S) x P(TS)
9The Three Components of a Statistical MT model
- Method for computing language model probabilities
(P(S)) - Method for computing translation probabilities
(P(ST)) - Method for searching amongst source sentences for
one that maximisesP(S) P(TS)
10A Statistical MT System
S
T
Source Language Model
Translation Model
P(S) P(TS)
P(ST)
T
S
Decoder
11Three Kinds of Model
12Language Models based on N-Grams of Words
- GeneralP(s1s2...sn) P(s1)P(s2s1)
...P(sns1...s(n-1)) - TrigramP(s1s2...sn) P(s1)P(s2s1)P(s3s1,s2)
...P(sns(n-1)s(n-2)) - BigramP(s1s2...sn) P(s1)P(s2s1)
...P(sns(n-1))
13Syntax Based Language Models
- Good syntax tree good English
- Allows for long distance contstraints
- Left sentence preferred by syntax based model
14Word-Based Translation Models
- Translation process is decomposed into smaller
steps - Each is tied to words
- Based on IBM Models Brown et al., 1993
- from Koehn 2006
15Word TM derived from Bitext
- ENGLISH
- the cat sleeps
- the dog sleeps
- the horse eats
- FRENCH
- le chat dort
- le chien dort
- le cheval mange
16le chat dort/the cat sleeps
le I I I
chat I I I
chien
cheval
dort I I I
mange
the cat dog horse sleeps eats
17le chien dort/the dog sleeps
le II I I II
chat I I I
chien I I I
cheval
dort II I I II
mange
the cat dog horse sleeps eats
18le cheval mange/the horse eats
P(ts)
le III I I I II I
chat I I I
chien I I I
cheval I I I
dort II I I II
mange I I I
the cat dog horse sleeps eats
3/9
1/9
1/9
1/9
2/9
1/9
19Parameter Estimation
- Based on counting occurrences within monolingual
and bilingual data. - For language model, we need only source language
text. - For translation model, we need pairs of sentences
that are translations of each other. - Use EM (Expectation Maximisation) Algorithm (Baum
1972) to optimize model parameters.
20EM Algorithm
- Word Alignmentsfor sentence pair ("a b c", "x y
z")are formed from arbitrary pairings from the
two sentences and include(a.x,b.y,c.z),
(a.z,b.y,c.x), etc. - There is a large number of possible alignments,
since we also allow, e.g.(ab.x,0.y,c.z),
21EM Algorithm
- Make initial estimate of parameters. This can be
used to compute the probability of any possible
word alignment. - Re-estimate parameters by ranking each possible
alignment by its probability according to initial
guess. - Repeated iterations assign ever greater
probability to the set of sentences actually
observed. - Algorithm leads to a local maximum of the
probability of observed sentence pairs as a
function of the model parameters
22Parameters forIBM Translation Model
- Word Translation Probability, P(ts)probability
that source word s is translated as target word
t. - Fertility P(ns)probability that source word s
is translated by n target words (25 n0). - Distortion P(ij,l)probability that source word
at position j is translated by target word at
position i in target sentence of length l.
23Experiment 1 (Brown et. al. 1990)
- Hansard. 40,000 pairs of sentences approx.
800,000 words in each language. - Considered 9,000 most common words in each
language. - Assumptions (initial parameter values)
- each of the 9000 target words equally likely as
translations of each of the source words. - each of the fertilities from 0 to 25 equally
likely for each of the 9000 source words - each target position equally likely given each
source position and target length
24English the
- French Probability
- le .610
- la .178
- l .083
- les .023
- ce .013
- il .012
- de .009
- à .007
- que .007
- Fertility Probability
- 1 .871
- 0 .124
- 2 .004
25English not
- French Probability
- pas .469
- ne .460
- non .024
- pas du tout .003
- faux .003
- plus .002
- ce .002
- que .002
- jamais .002
- Fertility Probability
- 2 .758
- 0 .133
- 1 .106
26English hear
- French Probability
- bravo .992
- entendre .005
- entendu .002
- entends .001
- Fertility Probability
- 0 .584
- 1 .416
27Sentence Translation Probability
- Given translation model for words, we can compute
translation probability of sentence taking
parameters into account. - P(Jean aime MarieJohn loves Mary)
P(JeanJohn) P(1,John) P(11,3)
P(aimeloves) P(1,loves) P(22,3)
P(MarieMary) P(1,Mary) P(33,3)
28Flaws in Word-Based Translation
- Model handles manyone P(ttts) but not onemany
P(tsss) translations - e.g.
- Zeitmangel erschwert das
Problem . - lack of time makes more difficult the problem
. - Correct translation Lack of time makes the
problem more difficult. - MT output Time makes the problem.
- from Koehn 2006
29Flaws Word-Based Translation (2)
- Phrasal Translation P(tttssss)
- e.g. erübrigt sich /there is no point in
- Eine Diskussion erübrigt
sich demnach . - a discussion is made unnecessary itself
therefore . - Correct translation Therefore, there is no point
in a discussion. - MT output A debate turned therefore .
- from Koehn 2006
30Flaws in Word BasedTranslation (3)
- Syntactic transformations
- Example Object/subject reordering
- Den Vorschlag lehnt die Kommission abthe
proposal rejects the commission off - Correct translation The commission rejects the
proposal . - MT output The proposal rejects the commission.
- from Koehn 2006
31Phrase Based Translation Models
- Foreign input is segmented in phrases.
- Phrases are any sequence of words, not
necessarily linguistically motivated. - Each phrase is translated into English
- Phrases are reordered.
- from Koehn 2006
32Syntax Based Translation Models
33Word Based Decoding searching for the best
translation (Brown 1990)
- Maintain list of hypotheses.
- Initial hypothesis (Jean aime Marie )
- Search proceeds iteratively.
- At each iteration we extend most promising
hypotheses with additional wordsJean aime Marie
John(1) Jean aime Marie loves(2) Jean
aime Marie Mary(3) Jean aime Marie
Jean(1) - Parenthesised numbers indicate corresponding
position in target sentence
34Phrase-Based Decoding
- Build translation left to right
- select foreign word(s) to be translated
- find English phrase translation
- add English phrase to end of partial translation
- Koehn 2006
35Decoding Process
- one to many translation
- Koehn 2006
36Decoding Process
- many to one translation
- Koehn 2006
37Decoding Process
- translation finished
- Koehn 2006
38Hypothesis Expansion
- Start with empty hypothesis
- e no English words
- f no foreign words covered
- p probability 1
- Koehn 2006
39Hypothesis Expansion
40Hypothesis Expansion
- further hypothesis expansion
- Koehn 2006
41Decoding Process
- adding more hypotheses leads to explosion of
search space. - Koehn 2006
42Hypothesis Recombination
- Sometimes different choices of hypothesis lead to
the same translation result. - Such paths can be combined.
- Koehn 2006
43Hypothesis Recombination
- Drop weaker path
- Keep pointer from weaker path
- Koehn 2006
44Pruning
- Hypothesis recombination is not sufficient
- Heuristically discard weak hypotheses early
- Organize Hypothesis in stacks, e.g. by
- same foreign words covered
- same number of foreign words covered (Pharaoh
does this) - same number of English words produced
- Compare hypotheses in stacks, discard bad ones
- histogram pruning keep top n hypotheses in each
stack (e.g., n100) - threshold pruning keep hypotheses that are at
most times the cost of best hypothesis in stack
(e.g., 0.001)
45Hypothesis Stacks
- Organization of hypothesis into stacks
- here based on number of foreign words translated
- during translation all hypotheses from one stack
are expanded - expanded Hypotheses are placed into stacks
- one to many translation
- Koehn 2006
46Comparing Hypotheses covering Same Number of
Foreign Words
- Hypothesis that covers easy part of sentence is
preferred - Need to consider future cost of uncovered parts
- Should take account of one to many translation
- Koehn 2006
47Future Cost Estimation
- Use future cost estimates when pruning hypotheses
- For each uncovered contiguous span
- look up future costs for each maximal contiguous
uncovered span - add to actually accumulated cost for translation
option for pruning - Koehn 2006
48Pharoah
- A beam search decoder for phrase-based models
- works with various phrase-based models
- beam search algorithm
- time complexity roughly linear with input length
- good quality takes about 1 second per sentence
- Very good performance in DARPA/NIST Evaluation
- Freely available for researchers
http//www.isi.edu/licensed-sw/pharaoh/ - Coming soon open source version of Pharaoh
49Pharoah Demo
- echo das ist ein kleines haus pharaoh -f
pharaoh.ini gt out - Pharaoh v1.2.9, written by Philipp Koehn
- a beam search decoder for phrase-based
statistical machine translation models - (c) 2002-2003 University of Southern California
- (c) 2004 Massachusetts Institute of Technology
- (c) 2005 University of Edinburgh, Scotland
- loading language model from europarl.srilm
- loading phrase translation table from
phrase-table, stored 21, pruned 0, kept 21 - loaded data structures in 2 seconds
- reading input sentences
- translating 1 sentences.translated 1 sentences
in 0 seconds - 3mm cat out
- this is a small house
50Brown Experiment 2
- Perform translation using 1000 most frequent
words in the English corpus. - 1,700 most frequently used French words in
translations of sentences completely covered by
1000 word English vocabulary. - 117,000 pairs of sentences completely covered by
both vocabularies. - Parameters of English language model from 570,000
sentences in English part.
51Experiment 2 contd
- 73 French sentences tested from elsewhere in
corpus. Results were classified as - Exact same as actual translation
- Alternate same meaning
- Different legitimate translation but different
meaning - Wrong could not be intepreted as a translation
- Ungrammatical grammatically deficient
- Corrections to the last three categories were
made and keystrokes were counted
52Results
Category sentences percent
Exact 4 5
Alternate 18 25
Different 13 18
Wrong 11 15
Ungrammatical 27 37
Total 73
53Results - Discussion
- According to Brown et. al., system performed
successfully 48 of the time (first three
categories). - 776 keystrokes needed to repair 1916 keystrokes
to generate all 73 translations from scratch. - According to authors, system therefore reduces
work by 60.
54Issues
- Automatic evaluation methods
- can computers decide what are good translations?
- Phrase-based models
- what are atomic units of translation?
- how are they discovered?
- the best method in statistical machine
translation - Discriminative training
- what are the methods that directly optimize
translation performance?
55The Speculative (Koehn 2006)
- Syntax-based transfer models
- how can we build models that take advantage of
syntax? - how can we ensure that the output is grammatical?
- Factored translation models
- how can we integrate different levels of
abstraction?
56Bibliography
- Statistical MTBrown et. al., A Statistical
Approach to MT, Computational Linguistics 16.2,
1990 pp79-85 (search ACL Anthology) - Koehn tutorial (see http//www.iccs.inf.ed.ac.uk/
pkoehn/)