Title: Automatic Speech Recognition
1Automatic Speech Recognition
Chai Wutiwiwatchai, Ph.D. Human Language
Technology Laboratory NECTEC
2Outline
- Hidden Markov Model (HMM)
- Sequence Modeling Problems
- HMM
- Automatic Speech Recognition (ASR)
- Classification of ASR
- ASR Formulation
- Components
3Sequence Modeling Problems
4Sequence Modeling
Forecasting Tomorrow Weather
- Given weather sequences in the past
- Creating a model that can predict
- tomorrow weather
??
5Sequence Modeling
- Prediction done by calculating probabilities
Sunny
Rain
Cloud
- And selecting max-probability sequence
6Sequence Modeling
Modeling Brazils World-Cup Results
- Given results in the past
- How possible if Brazil will never lose
- in the next tournament ?
7Sequence Classification
Genomic DNA Sequencing
- Given a long string of genes
- Classify substrings into DNS types
8Sequence Classification
- Where each type has its own
- characteristic
EXON
INTRON
9Sequence Classification
Football Event Detection
Free-kick
Foul
Adv.
10In Conclusion
Problem 1 Training How to create a model ?
given a training set of observation sequences
O Problem 2 Scoring How to compute the
probability of an observation sequence O given a
model ?, P(O ?)
11Hidden Markov Model (HMM)
12Sequence Modeling Problem
Problem 1 Training How to create a model ?
given a training set of observation sequences
O Problem 2 Scoring How to compute the
probability of an observation sequence O given a
model ?, P(O ?)
13Example Urn-and-Ball 1
Rabiner, L.R., A tutorial on hidden Markov
models and selected applications in speech
recognition, Proc. IEEE, Vol.7, No.2,
pp.257-286, 1989.
- Marry plays a game with a blind observer
- (1) Marry initially select an urn at random
- (2) Marry picks a color ball, tells the blind
observer - (3) Marry puts the ball back in urn
- (4) Marry moves to the next urn randomly
- (5) Repeat steps 2-4
14Example Urn-and-Ball 2
? Observation Sequence - The color
sequence of picked-up balls ? States - The
identity of the urn ? State Transitions -
The process of selecting the urns ? Initial
State - The identity of the first selected
urn
15Example Urn-and-Ball 3
p1 0.4
p2 0.3
p3 0.3
a33 0.2
a11 0.3
a22 0.5
1
2
3
a12 0.6
a23 0.2
a21 0.3
a32 0.1
a13 0.1
a31 0.7
b1(R) 0.3 b1(G) 0.2 b1(B) 0.2 b1(Y) 0.3
b2(R) 0.33 b2(G) 0.17 b2(B) 0.17 b2(Y)
0.33
b3(R) 0.2 b3(G) 0.3 b3(B) 0.3 b3(Y) 0.2
16? An Observation Sequence ? A Set of N States ?
Resulting State Sequence ? Transition
Probabilities ? A Set of M Observation
Symbols ? Emission Probabilities ? Initial
State Probabilities
HMM Notations
17HMM Notations In Short
? An HMM is symbolized by
18HMM Topology
Left-to-Right Model
Ergodic Model
19Example World Cup
Brazils match results modeled by HMM
Ex.
20Scoring
Forward Algorithm
? Define
Step 1
Step 2
Step 3
Problem1 Training Problem2 Scoring
21Forward Algorithm
S1
S1
S1
S1
S2
S2
S2
S2
S3
S3
S3
S3
Problem1 Training Problem2 Scoring
22Example
What is the probability of generating i.e.
using Forward algorithm
Problem1 Training Problem2 Scoring
23Scoring
Viterbi Algorithm
? Define
Step 1
Step 2
Problem1 Training Problem2 Scoring
Step 3
24Viterbi Algorithm Illustration
S1
S1
S2
S2
S3
S3
Problem1 Training Problem2 Scoring
25Viterbi vs. Forward
S1
S1
S2
S2
S3
S3
or
Problem1 Training Problem2 Scoring
26Example
What is the probability of generating i.e.
using Viterbi algorithm
Problem1 Training Problem2 Scoring
27Viterbi in Log Domain
? Due to numerical underflow issues, it is often
conduct the Viterbi algorithm in log-domain
Step 1
Step 2
Step 3
where
Problem1 Training Problem2 Scoring
28CDHMM
- Discrete HMM
- Modeling for discrete symbols such as World
- Cup match-results
- Continuous-Density HMM (CDHMM)
- Modeling for continuous values such as
- speech feature vectors
- -gt is a probability density function
of
29Mixture Gaussian PDF
? A typical pdf is M-component Mixture Gaussian
30Training
Given , tune to maximize
Baum-Welch Algorithm
- Also called Forward-Backward algorithm
- Step1 Initialize
- Step2 Compute probabilities using
- Step3 Adjust based on computed
- probabilities
- Step4 Repeat 2-3 until converge
Problem1 Training Problem2 Scoring
31Baum-Welch Algorithm
? Define
Probability of being in state i at time t
and State j at time t 1 given O and
Problem1 Training Problem2 Scoring
32Baum- Welch Algorithm
? Adjust
Expected no. of times in state i at t 1
? Adjust
Problem1 Training Problem2 Scoring
33Baum- Welch Algorithm
? Adjust
Expected no. of times vk observed in state i
Expected no. of times in state i
Problem1 Training Problem2 Scoring
34In Conclusion
- Hidden Markov Model (HMM)
- for Sequence Modeling
35In Conclusion
- Hidden Markov Model (HMM)
- for Sequence Classification
36Questions ?
37Automatic Speech Recognition
38Advantages Of ASR
- Easy to perform, No need of
- specialized skill
- 3-4 times faster than typewriters
- 8-10 times faster than handwriting
- (if correct !)
- Comfortable for
- multiple activities
- Economic
- equipments
39Classification Of ASR
- Continuity of speech
- 1) Isolated word recognition (IWR)
- 2) Continuous speech recognition (CSR)
- - Transcription/Understanding task
- - Restricted/Free grammar
- Speaking style
- 1) Isolated words or phrases Voice command
- 2) Connected speech Digit string
- 3) Read speech Dictation
- 4) Fluent speech Broadcast news
- 5) Spontaneous speech Conversation
40Classification Of ASR
- Speaker dependency
- 1) Speaker dependent
- 2) Speaker independent
- Unit of reference template
- 1) Word unit
- 2) Subword unit
- - Phoneme
- - Syllable
41Status of ASR
42Pattern Recognition
Training set
Development set
Feature Extraction
Feature Extraction
Trained model
Trained model with optimization
Training
Testing
Result
Adjusting
43Feature Extraction
44ASR Problem Formulation 1
- Given an observation sequence (sequence
- of feature vectors) from a speech signal
- Determine the underlying word or word
- sequence
- IWR
- CSR
45ASR Problem Formulation 2
- Solution maximize a posterior probability
46ASR Problem Formulation 3
- Since P(O) is equal for all words W
P(OW) Acoustic Model (AM), the prob that the
word sequence W is uttered as O
P(W) Language Model (LM), how often the
word sequence W is said
47IWR
Examples
Digit 0-9 recognizer
Digit
Choose the maximum
CSR
Digit string recognizer
Digit string
Choose the maximum
48- Impossible to build HMMs for each
- word sequence W
- Advantageously, HMM of a longer unit can
- be built up by concatenating HMMs of
- smaller units
Basic Idea
- Each smaller-unit HMM is trained as that
- done for word HMM in IWR
49CSR Structure
50Acoustic Model
51Subword-Unit Acoustic Model
52Context-Dependent Model
- In continuous speech, the sound of a
- phoneme often changes in different context
- /a/ in d e t a b e s (database)
- /a/ in d e t a (data)
- Using different HMMs for a phoneme in
- different context
- ? Context-Dependent Model
- Always using an HMM for a phoneme
- regardless of context
- ? Context-Independent Model
53Context-Dependent Model
Context-Independent (Monophone) Model
Context-Dependent (Triphone) Model
54Tied-State Model 1
- An example of Thai acoustic model (NECTEC)
- No. of monophone HMMs 76
- No. of triphone HMMs 49,631 !
- Not all triphones appear in training data
- ? Unseen triphones
- Many triphones occur infrequently
- ? Problem of data sparseness
- Triphones with similar context share HMM
- states ? Tied-State Triphone
55Tied-State Model 2
p-aan
p-aang
- 49,631 triphone HMMs can be constructed
- from a small set of states (1K to 3K states)
56Tied-State Model 3
- Decision Tree-Based State Tying
- Decision trees are built for each phoneme
57Conclusion Of AM In CSR
- Phoneme-based HMM
- Context-dependent triphone HMM
- Tied-state triphone HMM
58Language Model
59Language Modeling
- Typical language modeling techniques
- for CSR
- ? Regular Grammar Model
- - Finite-state model
- - Small vocabulary
- - Restricted grammar
- ? N-gram Model
- - Large vocabulary
- - Free grammar
- - Higher computation
60Regular Grammar Model 1
- Grammar is defined by a Finite-State Network
An example of Voice Dialing task, in HTK Book,
Cambridge University
61N-gram Model 1
- Large Vocabulary Continuous Speech
- Recognition (LVCSR)
- ? No. of vocabulary gt 1,000
- ? Unrestricted grammar
- ? Example tasks
- - Broadcast news transcription
- - Dictation system
- - Meeting transcription
- ? N-gram language model
62N-gram Model 2
- Given a word sequence W w1wM
- e.g. W w1w2w3w4 compute P(W) by
632-gram Model
- Assume wi depends only on one previous
- word wi-1 ? 2-gram (Bigram) Model
643-gram Model
- Assume wi depends only on two previous
- word wi-1 wi-2 ? 3-gram (Trigram) Model
65Training N-gram
- P(wiwi-1,wi-2), P(wiwi-1) and P(wi) are
- computed from a Training Text
C(w) No. of word w N Total no. of words
66Example Of N-gram Model
? Given a training text Compute P(W) for W
is he a student W student is man
he is a man he is a student is he a man
67Smoothing
? N-grams not occur in the training text
always have zero probabilities. However, these
N-grams might happen in real world ? Give
probabilities to unseen word pairs ? Smoothing
process ? Add-1 ? Delete Interpolation ?
Good-Turing ? Katz
68Add-1 Smoothing
? Eliminate zero probability by adding 1
occurrence to all N-grams V
No. of vocabulary ? Simple but not so good !
69Delete Interpolation
? Interpolation (linearly combination) of
different-order N-gram ? Interpolation
weights ?i can be estimated given a training
text ? Better than Add-1, but still not so good !
70Conclusion Of LM In CSR
- Regular grammar model for small tasks
- N-gram model with smoothing for LVCSR
71Pronunciation Modeling
72Pronunciation Modeling
- Finding the best phoneme sequence given
- a word sequence W
- Not easy ! Lets consider
- the elephant vs. the butterfly
- lead nitrate vs. he follows her lead
- A simple way is Pronunciation Dictionary
73Decoding
74Decoding Problem
- A critical problem of CSR is an infinite number
- of possible word sequences W
Digit string recognizer
Digit string
Choose the maximum
75Decoding Solution 1
- We instead expand the LM, regular grammar
- or N-gram, as a word network with LM probs.
P(0)
P(00)
P(000)
. . .
0
0
0
P(10)
P(100)
P(1)
. . .
1
1
1
Start
. . .
. . .
. . .
P(90)
P(900)
P(9)
. . .
9
9
9
76Decoding Solution 2
- Then expand each word node to its
- phoneme sequence using the pronunciation
- dictionary
P(0)
. . .
P(00)
z
iy
r
o
0
z
iy
r
o
0
P(10)
P(1)
. . .
w
a
n
1
w
a
n
1
Start
. . .
. . .
P(90)
. . .
P(9)
n
ay
n
9
n
ay
n
9
77Decoding Solution 3
- Finally, incorporate phoneme HMMs into each
- phoneme node, which produce AM probs.
z iy r o
z iy
P(O0)
P(0)
P(00)
0
w a n
w a
P(10)
P(O1)
P(1)
1
. . .
. . .
Start
P(90)
n ai n
n ai
P(O9)
P(9)
9
78Decoding Solution 4
- Frame-Synchronous Viterbi Beam Search
- ? Observation sequence O is slide frame-
- by-frame into the decoding network
- ? Probabilities P(OW)P(W) are cumulated
- for every possible paths using Viterbi
- algorithm
- ? At every time, paths having cumulative
- probabilities less than a threshold are
- eliminated
- ? After sliding all frames, the path with
- maximum cumulative probability is the
- resulting word sequence
79Decoding Solution 5
HMM node HMM state emission probability Word
node None
HMM node HMM state transition probability Word
node N-gram probability
80Decoding Illustration
81Question ?
82Building ASR
http//htk.eng.cam.ac.uk
83- Preparation
- Phoneme inventory design
- Task grammar
- Pronunciation dictionary
- Phoneme list
- Training speech data
- Training data transcriptions
Procedure
- Training
- Acoustic model training
- Language model training
Evaluation
84Preparation 1
Task Grammar Digit (Isolated Word)
word onetwo zero (SENT-START word
SENT-END)
Digit String (Continuous Speech)
word onetwo zero (SENT-START ltwordgt
SENT-END)
85Preparation 2
Pronun Dict
zero zero s ii r oo one one w a n two two
th uu nine nine n aa j SENT-START
sil SENT-END sil
86Preparation 3
87Training 1
Task Grammar config/dgs.gram Pronun
Dict config/dgs.dict Phoneme List config/monophn
.list Training Speech Data wav/train/.wav Data
Transcription config/monophn.mlf Script
for Feature Extraction config/trcode.scp HMM
Training config/train.scp Configuration
Files Feature Extraction config/code.config HMM
Training config/train.config HMM
Prototype config/proto5s
88Training 2
Feature Extraction HCopy -T 1 -C
config/code.config -S config/trcode.scp HMM
Initialization mkdir am/hmm_0 HCompV -T 1 -C
config/train.config -f 0.01 -m -S
config/train.scp -M am/hmm_0 config/proto5s
perl script/createmono.pl config/monophn.list
am/hmm_0/proto5s am/hmm_0/vFloors
am/hmm_0/newMacros HMM Parameter Re-estimation
mkdir am/hmm_1 HERest -T 1 -C
config/train.config -I config/monophn.mlf -S
config/train.scp -H am/hmm_0/newMacros -M
am/hmm_1 config/monophn.list Repeat HERest 3
times (obtaining am/hmm_3)
89Training 3
Ways to improve
- Increasing Gaussian Mixtures in HMM
- Tied-state triphone HMM
- Speaker adaptation
Increasing Gaussian Mixtures in HMM mkdir
am/hmm2m_0 HHEd -T 1 -H am/hmm_3/newMacros -M
am/hmm2m_0 config/mix1to2.hed
config/monophn.list mkdir am/hmm2m_1 HERest
-T 1 -C config/train.config -v 1.0e-8 -m 1 -I
config/monophn.mlf -S config/train.scp -H
am/hmm2m_0/newMacros -M am/hmm2m_1
config/monophn.list Repeat HERest 3 times
(obtaining am/hmm2m_3)
90Evaluation
Compiling Language Model HParse lm/dgs.gram
lm/dgs.wdnet
Test Speech Data wav/test/.wav Script
for Feature Extraction config/tscode.scp Testing
config/test.scp
Testing HCopy -T 1 -C config/code.config -S
config/tscode.scp HVite -H am/hmm2m_3/newMacros
-S config/test.scp -l '' -w lm/dgs.wdnet -i
result/result.mlf config/dgs.dict
config/monophn.list
91Demonstration
- Isolated Digit recognition
- Digit String recognition
92Improving
- In terms of training data
- Adding more speech training data
- Recording training speech that best matched
- to test speech
- - Speaking style
- - Environment
- - Equipment
- In terms of training algorithms
- Tied-state triphone HMMs
- Optimizing training parameters
- Trying algorithms for robust ASR
- - Noise classification Model selection
- - Noise/speaker adaptation
93Question ?