Title: Coupling between ASR and MT in SpeechtoSpeech Translation
1Coupling between ASR and MT in Speech-to-Speech
Translation
- Arthur Chan
- Prepared for
- Advanced Machine Translation Seminar
2This Seminar
- Introduction (6 slides)
- Ringgers categorization of Coupling between ASR
and NLU (7 slides) - Interfaces in Loose Coupling
- 1 best and N-best (5 slides)
- Lattices/Confusion Network/Confidence Estimation
(12 slides) - Results from literature
- Tight Coupling
- Neys Theory and 2 methods of Implementation (14
slides) - ? Sorry, without FST approaches.
- Some As Is Ideas on This Topic
36 papers on Coupling of Speech-to-Speech
Translation
- H. Ney, Speech translation Coupling of
recognition and translation, in Proc. ICASSP,
1999. - Casacuberta et al., Architectures for
speech-to-speech translation using finite-state
models, in Proc. Workshop on Speech-to-Speech
Translation, 2002. - E. Matusov, S.Kanthak, and H. Ney, On the
integration of speech recognition and statistical
machine translation, in Proc. InterSpeech,
2005. - S.Saleem, S. C. Jou, S. Vogel, and T. Schultz,
Using word lattice information for a tighter
coupling in speech translation systems, in Proc.
ICSLP, 2004. - V.H. Quan et al., Integrated N-best re-ranking
for spoken language translation, in In
EuroSpeech, 2005. - N. Bertoldi and M. Federico, A new decoder for
spoken language translation based on confusion
networks, in IEEE ASRU Workshop, 2005.
4A Conceptual Model of Speech-to-Speech Translation
Speech Recognizer
Machine Translator
Speech Synthesizer
Decoding Result(s)
Translation
waveforms
waveforms
5Motivation of Tight Coupling between ASR and MT
- One best of ASR could be wrong
- MT could be benefited from wide range of
supplementary information provided by ASR - N-best list
- Lattice
- Sentenced/Word-based Confidence Scores
- E.g. Word posterior probability
- Confusion network
- Or consensus decoding (Mangu 1999)
- MT quality may depend on WER of ASR (?)
6Scope of this talk.
Speech Recognizer
Machine Translator
Speech Synthesizer
1-best?
Translation
N-best?
waveforms
waveforms
Lattice?
Confusion network?
7Topics Covered Today
- The concept of Coupling
- Tightness of coupling between ASR and
Technology X. (Ringger 95) - Two questions
- What could ASR provide in loose coupling?
- Discussion of interfaces between ASR and MT in
loose coupling - What is the status of tight coupling?
- Neys Formulation
8Topics not covered
- Direct Modeling
- Use both features in ASR and MT
- Some referred as ASR and MT unification
- Implication of the MT search algorithms on the
coupling - Generation of speech from text.
- Presenter doesnt know enough.
9The Concept of Coupling
10Classification of Coupling of ASR and Natural
Language Understanding (NLU)
- Proposed in Ringger 95, Harper 94
- 3 Dimensions of ASR/NLU
- Complexity of the search algorithm
- Simple N-gram?
- Incrementality of the coupling
- On-line? Left-to-right?
- Tightness of the coupling
- Tight? Loose? Semi-tight?
11Tightness of Coupling
Tight
Semi-Tight
Loose
12Notes
- Semi-tight coupling could appear as
- Feedback loop between ASR and Technology X for
the whole utterance of speech - Or Feedback loop between ASR and Technology X for
every frame. - The Ringger system
- A good way to understand how speech-based system
is developed
13Example 1 LM
- Someone asserts that ASR has to be used with
13-grams. - In tight-coupling,
- A search will be devised to search for the best
word sequence with best acoustic score 13 gram
likelihood - In loose coupling
- A simple search will be used to generate some
outputs (N-best list, lattice etc.), - 13-gram will then use to rescore the output.
- In semi-tight coupling
- 1, A simple search will be used to generate
results - 2, 13 gram will be applied at the word-end only
(but exact history will not be stored)
14Example 2 Higher order AM
- Segmental model assume obs. probability is not
conditionally independent. - Someone assert that segmental model is better
than just HMM. - Tight coupling Direct search of the best word
sequence using segmental model. - Loose coupling Use segmental model to rescore
- Semi-tight coupling Hybrid HMM-Segmental model
algorithm?
15Summary of Coupling between ASR and NLU
16Implication on ASR/MT coupling
- Generalize many systems
- Loose coupling
- Any system which uses 1-best, n-best, lattice, or
other inputs for 1-way module communication - (Bertoldi 2005)
- CMU System (Saleem 2004)
- (Matusov 2005)
- Tight coupling
- (Ney 1999)
- (Casacuberta 2002)
- Semi-tight coupling
- (Quan 2005)
17Interfaces in Loose Coupling1-best and N-best
18Perspectives
- ASR outputs
- 1-best results
- N-best results
- Lattice
- Consensus network.
- Confidence scores
- How ASR generate these outputs?
- Why they are generated?
- What if there are multiple ASRs?
- (and what if their results are combined?)
19Origin of the 1-best.
- Decoding of HMM-based ASR
- Searching the best path in a huge HMM-state
lattice. - 1-best ASR result
- The best path one could find from backtracking.
- State Lattice (Next page)
20(No Transcript)
21Note on 1-best
- Most of the time 1-best Word Sequence
- Why?
- In LVCSR, storing the backtracking pointer table
for state sequence takes a lot of memory (even
nowadays) - Compare this with the number of frames of score
one need to be stored - Usually a backtrack pointer storing
- The previous words before the current word
- Clever structure dynamically allocate
back-tracking pointer table.
22What is N-best list?
- Traceback not only from the 1st -best, also from
the 2nd best and 3rd best, etc. - Pathway
- Directly from search backtrack pointer table
- Exact N-best algorithm (Chow 90)
- Word pair N-best algorithm (Chow 91)
- A search using Viterbi score as heuristic (Chow
92) - Generate lattice first, then generate N-best from
lattice
23Interfaces in Loose CouplingLattice, Consensus
Network and Confidence Estimation
24What is Lattice?
- A compact representation of state-lattice
- Only word node (or link) are involved
- Difference between N-best and Lattice
- Lattice could be compact representation of N-best
list.
25(No Transcript)
26How lattice is generated?
- From the decoding backtracking pointer table
- Only record all the links between word nodes.
- From N-best list
- Become a compact representation of N-best
- Sometimes spurious link will be introduced
27How lattice is generated when there are phone
contexts at the word end?
- Very complicated when phonetic context is
involved - Not only word-end needs to be stored but also the
phone contexts. - Lattice has the word identity as well as contexts
- Lattice can become very large.
28How this is resolved?
- Some used only approximate triphone to generate
lattice in first stage (BBN) - Some generate lattice even with full CD-phones
but convert it back to no-context lattices (RWTH) - Use the lattice with full CD phone contexts
(RWTH)
29What ASR folks do when lattice is still too large?
- Use some criteria to prune the lattice.
- Example Criteria
- Word posterior probability
- Application of another LM or AM, then filtering.
- General confidence score
- Maximum lattice density
- (number of words in lattice/number of words)
- Or generate an even more compact representation
than lattices - E.g. consensus network.
30Conclusions on lattices
- Lattice generation itself could be a complicated
issue - Sometimes, what post-processing stage (e.g. MT)
will get is pre-filtered, pre-processed results.
31Confusion Network and Consensus Hypothesis
- Confusion Network
- Or Sausage Network.
- Or Consensus Network
32Special Properties (?)
- More local than lattice
- One can apply simple criteria to find the best
results - E.g. consensus decoding is to apply
word-posterior probability on confusion network. - More tractable
- In terms of size
- Found to be useful in
- ?
- ?
33How to generate consensus network?
- From the lattice
- Summary of Mangus algorithm
- Intra-word clustering
- Inter-word clustering
34Note on Consensus Network
- Note
- Time information might not be preserved in
confusion network - The similarity function directly affect the final
output of the consensus network.
35Other ways to generate confusion network
- From the N-best list
- Using Rover.
- A mixture of voting and adding confidence of word
36Confidence Measure
- Anything other than likelihood which could tell
whether the answer is useful - E.g.
- Word posterior probability
- P(WA)
- Usually compute using lattices
- Language model backoff mode
- Other posterior probabilities (frame, sentence)
37Interfaces in Loose CouplingResults from the
Literature
38General word
- Coupling in SST is still pretty new
- Papers are chosen according to whether some
outputs have been used - Other techniques such as direct modeling might be
mixed into the papers.
39N-best list (Quan 2005)
- Using N-best list for reranking
- Interpolation weights of AM and TM are then
optimized. - Summary
- Reranking gives improvements.
40Lattices CMU results (Saleem 2004)
- Summary of results
- Lattice word error rate improved when lattice
density improves - Lattice density and Weight on Acoustic scores
turns out to be an important parameter to tune - Too large and small could hurt.
41LWER against Lattice Density
42Modified Bleu scores against lattice density
43Optimal density and score weight based on
Utterance Length.
44Consensus Network
- Bertoldi 2005 is probably the only work on
confusion-network based method - Summary of results
- When direct modeling is applied
- Consensus Network doesnt beat N-best method.
- Author argues for speed and simplicity of the
algorithm
45Confidence Does it help?
- According to Zhang 2006, Yes.
- Confidence Measure (CM) filtering is used to
filter out unnecessary results in N-best - Note The approaches used is quite different.
46Conclusion on Loose Coupling
- SR could give a rich sets of output.
- It is still an unknown what type of output should
be used in pipeline. - Currently, it seem to lack of comprehensive
experimental studies on which method is the best.
- Usage of confusion network and confidence
estimation seem to be under-explored.
47Tight Coupling Theory and Practice
48Theory (Ney 1999)
Bayes Rule
Introduce f as hidden var.
Bayes Rule
Assume x doesnt depend on target lang.
Sum to Max
49Layman point of view
- Three factors
- Pr(e) target language model
- Pr(fe) translation model
- Pr(xf) acoustic model
- Note assumption has been made only the best
matching f for e is used.
50Comparison with SR
- In SR
- Pr(f) Source language model
- In Tight coupling
- Pr(fe), Pr(e) Translation model and Target
language model
51Algorithmic Point of View
- Brute Force Method Instead of incorporating LM
into standard Viterbi algorithm - Incoporating P(e) and P(fe)
- gt Very complicated
52Assumptions in Modeling
- Alignment Models (HMM)
- Acoustic Modeling
- Speech Recognizer will produce a word graph.
- Each link with word hypothesis covers the portion
of acoustic scores. (notation is confusing in
paper)
53Lexicon Modeling
- Further assumption from standard IBM models
- Target word is assumed to be dependent on
previous word - So, in fact, source LM is actually there.
54First Implementation Local Average Assumptions
- Local Average Assumptions
- P(xe) is used to capture the local
characteristic of the acoustic.
55Justification of Using Average Local Assumption
- Rephrased from Author (p.3 para 2)
- Lexicon modeling and language modeling will cause
f_j-1, f_j, f_j1 appear in the math. - In another words
- It is too complicated to carry out
- Computation advantage the local score could be
obtained just from the word graph but before
translation - gt Full translation strategy could still be
carried out
56Computation of P(xe)
- Make use of best source sequence
- Also refer to Wessel 98,
- A commonly used word posterior probability
algorithm for lattice - A forward-backward like procedure is used
57Second Method Monotone Alignment Assumption -
Network
58Monotone Alignment Assumption Formula for Text
Input
- Close-formed solution exist form DP O(JE2)
59Monotone Alignment Assumption Formula for
Speech Input
60How to make Monotone Assumptions work?
- Words needs to be reordered
- As part of search strategy.
- Does acoustic model assumption used?
- i.e. Are we talking about word lattice or still
state lattice? - Dont know, seems like we are actually talking
about word lattice. - Supported by Matusov 2005
61Experimental Results in Matusov, Kanthak and Ney
2005
- Summary of the results
- Translation quality is only improved by tight
coupling when the lattice density is not high. - Same as Saleem 2004, incorporation of acoustic
scores help.
62Conclusion Possible Issues of tight coupling
- Possibilities
- In SR, source n-gram LM is very closed to the
best configuration. - The complexity of the algorithm is too high,
approximation is still necessary to make it work. - When the criterion in tight coupling is used. It
is possible that the LM and the TM need to be
jointly estimated. - The current approaches still havent really
implement tight-coupling - There might be bugs in the programs.
63Conclusion
- Two major issues in coupling of SST is discussed
- In loose coupling
- Consensus network and Confidence scoring is still
not fully utilized - In tight coupling
- The approach seem to be haunted by very high
complexity of search algorithm construction
64Discussion
65The End. Thanks.
66Literature
- 2006 Ruiqiang Zhang, Genichiro Kikui. Integration
of Speech Recognition and Machine Translation
Speech Recognition Word Lattice Translation.
Speech Communication. Vol.48, Issues 3-4 - H. Ney, Speech translation Coupling of
recognition and translation, in Proc. ICASSP,
1999. - E. Matusov, S.Kanthak, and H. Ney, On the
integration of speech recognition and statistical
machine translation, in Proc. InterSpeech, 2005. - S.Saleem, S. C. Jou, S. Vogel, and T. Schultz,
Using word lattice information for a tighter
coupling in speech translation systems, in Proc.
ICSLP, 2004. - V.H. Quan et al., Integrated N-best re-ranking
for spoken language translation, in In
EuroSpeech, 2005. - N. Bertoldi and M. Federico, A new decoder for
spoken language translation based on confusion
networks, in IEEE ASRU Workshop, 2005. - L. Mangu, E. Brill, A. Stolcke, Finding
consensus in speech recognition word error
minimization and other applications of confusion
networks, Computer Speech and Language 14(4),
373-400., (2000) - E. Ringger, A Robust Loose Coupling for Speech
Recognition and Natural Language Understanding,
1995