Title: Decoding%20Techniques%20for%20Automatic%20Speech%20Recognition
1Decoding Techniques for Automatic Speech
Recognition
- Florian Metze
- Interactive Systems Laboratories
2Outline
- Decoding in ASR
- Search Problem
- Evaluation Problem
- Viterbi Algorithm
- Tree Search
- Re-Entry
- Recombination
3The ASR problem argW max p(Wx)
- Two major knowledge sources
- Acoustic Model p(xW)
- Language Model P(W)
- Bayes p(Wx)P(x)p(xW)P(W)
- Search problem argW max p(xW)P(W)
- p(xW) consists of Hidden Markov Models
- Dictionary defines state sequence hello /hh
eh l ow/ - Full model concatenation of states (i.e. sounds)
4Target Function/ Measure
- WER minimum editing distance between
reference and hypothesis - Example
- the quick brown fox jumps over REF
- quick brown fox jump is over HYP
- D S I ERR
- WER 3/7 43
- Different measure from max p(Wx) !!!
5A simpler problem Evaluation
- So far we have
- Dictionary hello /hh eh l ow/
- Acoustic Model phh(x), peh(x), pl(x), pow(x)
- Language Model P(hello world)
- State sequence /hh eh l ow w er l d/
- Given W and xAlignment needed!
/ hh eh l ow /
6A simpler problem Evaluation
- So far we have
- Dictionary hello /hh eh l ow/
- Acoustic Model phh(x), peh(x), pl(x), pow(x)
- Language Model P(hello world)
- State sequence /hh eh l ow w er l d/
- Given W and xAlignment needed!
/ hh eh l ow /
7The Viterbi Algorithm
- Beam search from left to right
- Resulting alignment is best match given p?(x) and
x
p(x) Time Time Time Time Time Time Time
1.2 1 1
1.2 1.3 1.1 1 1.1 1.2
1.3 1 1 1 1.2 1.3
1 1.2 1.2 1.1
hh eh l ow
8The Viterbi Algorithm (contd)
- Evaluation problem Dynamic Time Warping
- Best alignment for given W, x, and p?(x) by
locally adding scores (-log p) for states and
transitions
p(x) Time Time Time Time Time Time Time
6 6 7
2.4 3.9 4.4 5 6.6 8.4
1.3 2 3 4 6 6.8
1 2.4 3.9 4.4
hh eh l ow
9Pronunciation Prefix Trees (PPT)
- Tree Representation of the Search Dictionary
- Very compact ? fast!
- Viterbi Algorithm alsoworks for trees
BROADWAY B R OA D W EY BROADLY B R OA D L
IE BUT B AH T
10Viterbi Search for PPTs
- A PPT is traversed in a time-synchronous way
- Apply Viterbi Algorithm on
- state level (sub-phonemic units b m e)
- ? Constrained by HMM Topology
- phone level
- Constrained by PPT
- What do we do when we reach the end of a word?
11Re-Entrant PPTs for continuous speech
- Isolated word recognition
- Search terminated in leafs of the PPT
- Decoding of word sequences
- Re-enter the PPT and store the Viterbi path using
a backpointer-table
12Problem Branching Factor
- Imagine sequence of 3 words with 10k vocabulary
- 10k 3 1000G (potentially)
- Not everything will be expanded, of course
- Viterbi approximation ? path recombination
- Given P(Candy hi I am) P(Candy hello I
am)
13Path Recombination
At time t Path1 w1 .. wN with score s1
Path2 v1 .. vM with score s2 Where s1
p(x1...xt w1...wN )? P(wi wi-1 wi-2) s2
p(x1...xt v1 ...vM )? P(vi vi-1 vi-2)
In the end, were only interested in the best
path!
14Path Recombination (contd)
- To expand the search space into a new root
- Pick the path with the best score so far (Viterbi
approximation) - Initialize scores and backpointers for the root
node according to the best predecessor word - store the left context model information with the
last phone from the predecessor(context-dependent
acoustic models /s ih t/ ? /l ih p/)
15Problem with Re-Entry
- For a correct use of the Viterbi algorithm, the
choice of the best path must include the score
for the transition from the predecessor word to
the successor word - The word identity is not known at the root level,
the choice of the best predecessor can therefore
not be done at this point
16Consequences
- Wrong predecessor words
- ? language model information only at leaf level
- Wrong word boundaries
- The starting point for the successor word is
determined without any language model information - Incomplete linguistic information
- Open pruning thresholds are needed for beam
search
17Three-Pass search strategy
- Search on a tree-organized lexicon (PPT)
- Aggressive path recombination at word ends
- Use linguistic information only approximately
- Generate a list of starting words for each frame
- Search on a flat-organized lexicon
- Fix the word segmentation from the first pass
- Full use of language model (often needs a third
pass)
18Three-Pass Decoder Results
- Q4g system with cache for acoustic scores
- 4000 acoustic models trained on BNESST
- 40k Vocabulary
- Test on readBN data
-
Search Pass Error Rate Real-time factor
Tree Pass 22.0 9.6
Flat Pass 18.8 0.9
Lattice Rescoring 15.0 0.2
19One-Pass Decoder Motivation
- The efficient use of all available knowledge
sources as early as possible should result in
faster decoding - Use the same engine to decode along
- Statistical n-gram language models with arbitrary
n - Context-free grammars (CFG)
- Word-graphs
20Linguistic states
- Linguistic state, examples
- n-1 word history for statistical n-gram LM
- Grammar state for CFGs
- (lattice node, word history) for word-graphs
- To fully use the linguistic knowledge source, the
linguistic state has to be kept during decoding - Path recombination has to be delayed until the
word identity is known
21Linguistic context assignment
- Key idea establish a linguistic polymorphism for
each node of the PPT - Maintain a list of linguistically morphed
instances in each node - Each instance stores its own backpointer and
scores for each state of the underlying HMM with
respect to the linguistic state of that instance
22PPT with linguistically morphed instances
W EY
R OA D
B
L IE
AH T
Typically 3-gram LM, i.e. P(W)
?iP(wiWi) P(wiWi) P(broadway bullets
over)
23Language Model Lookahead
- Since the linguistic state is known, the
complete LM information P(W) can be applied to
the instances, given the possible successor words
for that node of the PPT - Let
- lct linguistic context/ state of
instance i from node n - path(w) path of word w in the PPT
- ?(n, lct) min w ? w node n ? path(w)
P(wlct) - score(i) p(x1...xt w1...wN)? P(wN-1...)
?(n, lct)
24LM Lookahead (contd)
- When the word becomes unique, the exact lm score
is already incorporated and no explicit word
transitions needs to be computed - The lm scores ? will be updated on demand, based
on a compressed PPT (smearing of LM scores) - Tighter pruning thresholds can be used since the
language model information is not delayed anymore
25Early Path Recombination
- The Path recombination can be performed as soon
as the word becomes unique, which is usually a
few nodes before reaching the leaf. This reduces
the number of unique linguistic contexts and
instances - This is particularly effective for cross-word
models due the fan-out in the right context models
26 One-pass Decoder Summary
- One-Pass decoder based on
- One copy of tree with dynamically allocated
instances - Early path recombination
- Full language model lookahead
- Linguistic knowledge sources
- Statistical n-grams with n gt3 possible
- Context free grammars
27Results
Real time factor Real time factor Error rate Error rate
3-pass 1-pass 3-pass 1-pass
VM 6.8 4.0 26.9 26.9
readBN 12.2 4.2 14.7 13.9
Meeting 55 38 43.7 43.4
28Remarks on speed-up
- ? Speed-up ranges from a factor of almost 3 for
the readBN task to 1.4 for the meeting data - Speed-up depends strongly on matched domain
conditions - Decoder profits from sharp language models
- LM Lookahead less effective for weak language
models due to unmatched conditions
29Memory usage Q4g
Module 3-pass 1-pass
Acoustic Models 44 MB 44 MB
Language Model 87 MB 82 MB
Overhead 16 MB 16 MB
Decoder
- permanent - dynamic 120 MB 100 MB 18 MB 20 MB
Total 367 MB 180 MB
30Summary
- Decoding is time- and memory consuming
- Search errors occur when beams too tight
(?trade-off) or Viterbi assumption violated - State-of-the art One-pass decoder
- Tree-structure for efficiency
- Linguistically morphed instances of nodes and
leafs - Other approaches exist (stack decoding,
a-posteriori decoding, ...)