Automatic Speech Recognition - PowerPoint PPT Presentation

1 / 93

About This Presentation

Title:

Automatic Speech Recognition

Description:

(2) Marry picks a color ball, tells the blind observer (3) Marry puts the ball back in urn ... 4) Fluent speech Broadcast news. 5) Spontaneous speech Conversation ... – PowerPoint PPT presentation

Number of Views:212

Avg rating:3.0/5.0

Slides: 94

Provided by: 20314

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Speech Recognition

1
Automatic Speech Recognition
Chai Wutiwiwatchai, Ph.D. Human Language
Technology Laboratory NECTEC
2
Outline

Hidden Markov Model (HMM)
Sequence Modeling Problems
HMM

Automatic Speech Recognition (ASR)
Classification of ASR
ASR Formulation
Components

3
Sequence Modeling Problems
4
Sequence Modeling
Forecasting Tomorrow Weather

Given weather sequences in the past

Creating a model that can predict
tomorrow weather

??
5
Sequence Modeling

Prediction done by calculating probabilities

Sunny
Rain
Cloud

And selecting max-probability sequence

6
Sequence Modeling
Modeling Brazils World-Cup Results

Given results in the past

How possible if Brazil will never lose
in the next tournament ?

7
Sequence Classification
Genomic DNA Sequencing

Given a long string of genes

Classify substrings into DNS types

8
Sequence Classification

Where each type has its own
characteristic

EXON
INTRON
9
Sequence Classification
Football Event Detection
Free-kick
Foul
Adv.
10
In Conclusion
Problem 1 Training How to create a model ?
given a training set of observation sequences
O Problem 2 Scoring How to compute the
probability of an observation sequence O given a
model ?, P(O ?)
11
Hidden Markov Model (HMM)
12
Sequence Modeling Problem
Problem 1 Training How to create a model ?
given a training set of observation sequences
O Problem 2 Scoring How to compute the
probability of an observation sequence O given a
model ?, P(O ?)
13
Example Urn-and-Ball 1
Rabiner, L.R., A tutorial on hidden Markov
models and selected applications in speech
recognition, Proc. IEEE, Vol.7, No.2,
pp.257-286, 1989.

Marry plays a game with a blind observer
(1) Marry initially select an urn at random
(2) Marry picks a color ball, tells the blind
observer
(3) Marry puts the ball back in urn
(4) Marry moves to the next urn randomly
(5) Repeat steps 2-4

14
Example Urn-and-Ball 2
? Observation Sequence - The color
sequence of picked-up balls ? States - The
identity of the urn ? State Transitions -
The process of selecting the urns ? Initial
State - The identity of the first selected
urn
15
Example Urn-and-Ball 3
p1 0.4
p2 0.3
p3 0.3
a33 0.2
a11 0.3
a22 0.5
1
2
3
a12 0.6
a23 0.2
a21 0.3
a32 0.1
a13 0.1
a31 0.7
b1(R) 0.3 b1(G) 0.2 b1(B) 0.2 b1(Y) 0.3
b2(R) 0.33 b2(G) 0.17 b2(B) 0.17 b2(Y)
0.33
b3(R) 0.2 b3(G) 0.3 b3(B) 0.3 b3(Y) 0.2
16
? An Observation Sequence ? A Set of N States ?
Resulting State Sequence ? Transition
Probabilities ? A Set of M Observation
Symbols ? Emission Probabilities ? Initial
State Probabilities
HMM Notations
17
HMM Notations In Short
? An HMM is symbolized by
18
HMM Topology
Left-to-Right Model
Ergodic Model
19
Example World Cup
Brazils match results modeled by HMM
Ex.
20
Scoring
Forward Algorithm
? Define
Step 1
Step 2
Step 3
Problem1 Training Problem2 Scoring
21
Forward Algorithm
S1
S1
S1
S1
S2
S2
S2
S2
S3
S3
S3
S3
Problem1 Training Problem2 Scoring
22
Example
What is the probability of generating i.e.
using Forward algorithm
Problem1 Training Problem2 Scoring
23
Scoring
Viterbi Algorithm
? Define
Step 1
Step 2
Problem1 Training Problem2 Scoring
Step 3
24
Viterbi Algorithm Illustration
S1
S1
S2
S2
S3
S3
Problem1 Training Problem2 Scoring
25
Viterbi vs. Forward
S1
S1
S2
S2
S3
S3
or
Problem1 Training Problem2 Scoring
26
Example
What is the probability of generating i.e.
using Viterbi algorithm
Problem1 Training Problem2 Scoring
27
Viterbi in Log Domain
? Due to numerical underflow issues, it is often
conduct the Viterbi algorithm in log-domain
Step 1
Step 2
Step 3
where
Problem1 Training Problem2 Scoring
28
CDHMM

Discrete HMM
Modeling for discrete symbols such as World
Cup match-results
Continuous-Density HMM (CDHMM)
Modeling for continuous values such as
speech feature vectors
-gt is a probability density function
of

29
Mixture Gaussian PDF
? A typical pdf is M-component Mixture Gaussian
30
Training
Given , tune to maximize
Baum-Welch Algorithm

Also called Forward-Backward algorithm
Step1 Initialize
Step2 Compute probabilities using
Step3 Adjust based on computed
probabilities
Step4 Repeat 2-3 until converge

Problem1 Training Problem2 Scoring
31
Baum-Welch Algorithm
? Define
Probability of being in state i at time t
and State j at time t 1 given O and
Problem1 Training Problem2 Scoring
32
Baum- Welch Algorithm
? Adjust
Expected no. of times in state i at t 1
? Adjust
Problem1 Training Problem2 Scoring
33
Baum- Welch Algorithm
? Adjust
Expected no. of times vk observed in state i
Expected no. of times in state i
Problem1 Training Problem2 Scoring
34
In Conclusion

Hidden Markov Model (HMM)
for Sequence Modeling

35
In Conclusion

Hidden Markov Model (HMM)
for Sequence Classification

36
Questions ?
37
Automatic Speech Recognition
38
Advantages Of ASR

Easy to perform, No need of
specialized skill
3-4 times faster than typewriters
8-10 times faster than handwriting
(if correct !)
Comfortable for
multiple activities
Economic
equipments

39
Classification Of ASR

Continuity of speech
1) Isolated word recognition (IWR)
2) Continuous speech recognition (CSR)
- Transcription/Understanding task
- Restricted/Free grammar
Speaking style
1) Isolated words or phrases Voice command
2) Connected speech Digit string
3) Read speech Dictation
4) Fluent speech Broadcast news
5) Spontaneous speech Conversation

40
Classification Of ASR

Speaker dependency
1) Speaker dependent
2) Speaker independent
Unit of reference template
1) Word unit
2) Subword unit
- Phoneme
- Syllable

41
Status of ASR
42
Pattern Recognition
Training set
Development set
Feature Extraction
Feature Extraction
Trained model
Trained model with optimization
Training
Testing
Result
Adjusting
43
Feature Extraction
44
ASR Problem Formulation 1

Given an observation sequence (sequence
of feature vectors) from a speech signal

Determine the underlying word or word
sequence
IWR
CSR

45
ASR Problem Formulation 2

Solution maximize a posterior probability

Using Bayes Rule

46
ASR Problem Formulation 3

Since P(O) is equal for all words W

P(OW) Acoustic Model (AM), the prob that the
word sequence W is uttered as O
P(W) Language Model (LM), how often the
word sequence W is said
47
IWR
Examples
Digit 0-9 recognizer
Digit
Choose the maximum
CSR
Digit string recognizer
Digit string
Choose the maximum
48

Impossible to build HMMs for each
word sequence W
Advantageously, HMM of a longer unit can
be built up by concatenating HMMs of
smaller units

Basic Idea

Each smaller-unit HMM is trained as that
done for word HMM in IWR

49
CSR Structure
50
Acoustic Model
51
Subword-Unit Acoustic Model
52
Context-Dependent Model

In continuous speech, the sound of a
phoneme often changes in different context
/a/ in d e t a b e s (database)
/a/ in d e t a (data)
Using different HMMs for a phoneme in
different context
? Context-Dependent Model
Always using an HMM for a phoneme
regardless of context
? Context-Independent Model

53
Context-Dependent Model
Context-Independent (Monophone) Model
Context-Dependent (Triphone) Model
54
Tied-State Model 1

An example of Thai acoustic model (NECTEC)
No. of monophone HMMs 76
No. of triphone HMMs 49,631 !
Not all triphones appear in training data
? Unseen triphones
Many triphones occur infrequently
? Problem of data sparseness
Triphones with similar context share HMM
states ? Tied-State Triphone

55
Tied-State Model 2
p-aan
p-aang

49,631 triphone HMMs can be constructed
from a small set of states (1K to 3K states)

56
Tied-State Model 3

Decision Tree-Based State Tying
Decision trees are built for each phoneme

57
Conclusion Of AM In CSR

Phoneme-based HMM
Context-dependent triphone HMM
Tied-state triphone HMM

58
Language Model
59
Language Modeling

Typical language modeling techniques
for CSR
? Regular Grammar Model
- Finite-state model
- Small vocabulary
- Restricted grammar
? N-gram Model
- Large vocabulary
- Free grammar
- Higher computation

60
Regular Grammar Model 1

Grammar is defined by a Finite-State Network

An example of Voice Dialing task, in HTK Book,
Cambridge University
61
N-gram Model 1

Large Vocabulary Continuous Speech
Recognition (LVCSR)
? No. of vocabulary gt 1,000
? Unrestricted grammar
? Example tasks
- Broadcast news transcription
- Dictation system
- Meeting transcription
? N-gram language model

62
N-gram Model 2

Given a word sequence W w1wM
e.g. W w1w2w3w4 compute P(W) by

63
2-gram Model

Assume wi depends only on one previous
word wi-1 ? 2-gram (Bigram) Model

For example,

64
3-gram Model

Assume wi depends only on two previous
word wi-1 wi-2 ? 3-gram (Trigram) Model

For example,

65
Training N-gram

P(wiwi-1,wi-2), P(wiwi-1) and P(wi) are
computed from a Training Text

C(w) No. of word w N Total no. of words
66
Example Of N-gram Model
? Given a training text Compute P(W) for W
is he a student W student is man
he is a man he is a student is he a man
67
Smoothing
? N-grams not occur in the training text
always have zero probabilities. However, these
N-grams might happen in real world ? Give
probabilities to unseen word pairs ? Smoothing
process ? Add-1 ? Delete Interpolation ?
Good-Turing ? Katz
68
Add-1 Smoothing
? Eliminate zero probability by adding 1
occurrence to all N-grams V
No. of vocabulary ? Simple but not so good !
69
Delete Interpolation
? Interpolation (linearly combination) of
different-order N-gram ? Interpolation
weights ?i can be estimated given a training
text ? Better than Add-1, but still not so good !
70
Conclusion Of LM In CSR

Regular grammar model for small tasks
N-gram model with smoothing for LVCSR

71
Pronunciation Modeling
72
Pronunciation Modeling

Finding the best phoneme sequence given
a word sequence W
Not easy ! Lets consider
the elephant vs. the butterfly
lead nitrate vs. he follows her lead
A simple way is Pronunciation Dictionary

73
Decoding
74
Decoding Problem

A critical problem of CSR is an infinite number
of possible word sequences W

Digit string recognizer
Digit string
Choose the maximum
75
Decoding Solution 1

We instead expand the LM, regular grammar
or N-gram, as a word network with LM probs.

P(0)
P(00)
P(000)
. . .
0
0
0
P(10)
P(100)
P(1)
. . .
1
1
1
Start
. . .
. . .
. . .
P(90)
P(900)
P(9)
. . .
9
9
9
76
Decoding Solution 2

Then expand each word node to its
phoneme sequence using the pronunciation
dictionary

P(0)
. . .
P(00)
z
iy
r
o
0
z
iy
r
o
0
P(10)
P(1)
. . .
w
a
n
1
w
a
n
1
Start
. . .
. . .
P(90)
. . .
P(9)
n
ay
n
9
n
ay
n
9
77
Decoding Solution 3

Finally, incorporate phoneme HMMs into each
phoneme node, which produce AM probs.

z iy r o
z iy
P(O0)
P(0)
P(00)
0
w a n
w a
P(10)
P(O1)
P(1)
1
. . .
. . .
Start
P(90)
n ai n
n ai
P(O9)
P(9)
9

Decoding Network

78
Decoding Solution 4

Frame-Synchronous Viterbi Beam Search
? Observation sequence O is slide frame-
by-frame into the decoding network
? Probabilities P(OW)P(W) are cumulated
for every possible paths using Viterbi
algorithm
? At every time, paths having cumulative
probabilities less than a threshold are
eliminated
? After sliding all frames, the path with
maximum cumulative probability is the
resulting word sequence

79
Decoding Solution 5

Viterbi Algorithm

HMM node HMM state emission probability Word
node None
HMM node HMM state transition probability Word
node N-gram probability
80
Decoding Illustration
81
Question ?
82
Building ASR
http//htk.eng.cam.ac.uk
83

Preparation
Phoneme inventory design
Task grammar
Pronunciation dictionary
Phoneme list
Training speech data
Training data transcriptions

Procedure

Training
Acoustic model training
Language model training

Evaluation
84
Preparation 1
Task Grammar Digit (Isolated Word)
word onetwo zero (SENT-START word
SENT-END)
Digit String (Continuous Speech)
word onetwo zero (SENT-START ltwordgt
SENT-END)
85
Preparation 2
Pronun Dict
zero zero s ii r oo one one w a n two two
th uu nine nine n aa j SENT-START
sil SENT-END sil
86
Preparation 3
87
Training 1
Task Grammar config/dgs.gram Pronun
Dict config/dgs.dict Phoneme List config/monophn
.list Training Speech Data wav/train/.wav Data
Transcription config/monophn.mlf Script
for Feature Extraction config/trcode.scp HMM
Training config/train.scp Configuration
Files Feature Extraction config/code.config HMM
Training config/train.config HMM
Prototype config/proto5s
88
Training 2
Feature Extraction HCopy -T 1 -C
config/code.config -S config/trcode.scp HMM
Initialization mkdir am/hmm_0 HCompV -T 1 -C
config/train.config -f 0.01 -m -S
config/train.scp -M am/hmm_0 config/proto5s
perl script/createmono.pl config/monophn.list
am/hmm_0/proto5s am/hmm_0/vFloors
am/hmm_0/newMacros HMM Parameter Re-estimation
mkdir am/hmm_1 HERest -T 1 -C
config/train.config -I config/monophn.mlf -S
config/train.scp -H am/hmm_0/newMacros -M
am/hmm_1 config/monophn.list Repeat HERest 3
times (obtaining am/hmm_3)
89
Training 3
Ways to improve

Increasing Gaussian Mixtures in HMM
Tied-state triphone HMM
Speaker adaptation

Increasing Gaussian Mixtures in HMM mkdir
am/hmm2m_0 HHEd -T 1 -H am/hmm_3/newMacros -M
am/hmm2m_0 config/mix1to2.hed
config/monophn.list mkdir am/hmm2m_1 HERest
-T 1 -C config/train.config -v 1.0e-8 -m 1 -I
config/monophn.mlf -S config/train.scp -H
am/hmm2m_0/newMacros -M am/hmm2m_1
config/monophn.list Repeat HERest 3 times
(obtaining am/hmm2m_3)
90
Evaluation
Compiling Language Model HParse lm/dgs.gram
lm/dgs.wdnet
Test Speech Data wav/test/.wav Script
for Feature Extraction config/tscode.scp Testing
config/test.scp
Testing HCopy -T 1 -C config/code.config -S
config/tscode.scp HVite -H am/hmm2m_3/newMacros
-S config/test.scp -l '' -w lm/dgs.wdnet -i
result/result.mlf config/dgs.dict
config/monophn.list
91
Demonstration