Title: Natural Language Processing 11 Speech Recognition
1Natural Language Processing(11)Speech
Recognition
- Dr. Xuan Wang(? ?)
- Intelligence Computing Research Center
- Harbin Institute of Technology Shenzhen Graduate
School - Slides from Slides from Dr. Mary P. Harper ECE,
Purdue University
2LVCSR
- Large Vocabulary Continuous Speech Recognition
- 20,000-64,000 words
- Speaker independent (vs. speaker-dependent)
- Continuous speech (vs isolated-word)
3LVCSR
- Build a statistical model of the speech-to-words
process - Collect lots and lots of speech, and transcribe
all the words. - Train the model on the labeled speech
- Paradigm Supervised Machine Learning Search
4Speech Recognition Architecture
5The Noisy Channel Model
- Search through space of all possible sentences.
- Pick the one that is most probable given the
waveform.
6The Noisy Channel Model (II)
- What is the most likely sentence out of all
sentences in the language L given some acoustic
input O? - Treat acoustic input O as sequence of individual
observations - O o1,o2,o3,,ot
- Define a sentence as a sequence of words
- W w1,w2,w3,,wn
7Noisy Channel Model (III)
- Probabilistic implication Pick the highest prob
S - We can use Bayes rule to rewrite this
- Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax
8Noisy channel model
likelihood
prior
9The noisy channel model
- Ignoring the denominator leaves us with two
factors P(Source) and (SignalSource)
10Speech Architecture meets Noisy Channel
11Architecture Five easy pieces
- Feature extraction
- Acoustic Modeling
- HMMs, Lexicons, and Pronunciation
- Decoding
- Language Modeling
12Feature Extraction
- Digitize Speech
- Extract Frames
13Digitizing Speech
14Digitizing Speech (A-D)
- Sampling
- measuring amplitude of signal at time t
- 16,000 Hz (samples/sec) Microphone (Wideband)
- 8,000 Hz (samples/sec) Telephone
- Why?
- Need at least 2 samples per cycle
- max measurable frequency is half sampling rate
- Human speech lt 10,000 Hz, so need max 20K
- Telephone filtered at 4K, so 8K is enough
15Digitizing Speech (II)
- Quantization
- Representing real value of each amplitude as
integer - 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
- Formats
- 16 bit PCM
- 8 bit mu-law log compression
- LSB (Intel) vs. MSB (Sun, Apple)
- Headers
- Raw (no header)
- Microsoft wav
- Sun .au
40 byte header
16Frame Extraction
- A frame (25 ms wide) extracted every 10 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
17MFCC (Mel Frequency Cepstral Coefficients)
- Do FFT to get spectral information
- Like the spectrogram/spectrum we saw earlier
- Apply Mel scaling
- Linear below 1kHz, log above, equal samples above
and below 1kHz - Models human ear more sensitivity in lower freqs
- Plus Discrete Cosine Transformation
18Final Feature Vector
- 39 Features per 10 ms frame
- 12 MFCC features
- 12 Delta MFCC features
- 12 Delta-Delta MFCC features
- 1 (log) frame energy
- 1 Delta (log) frame energy
- 1 Delta-Delta (log frame energy)
- So each frame represented by a 39D vector
19Where we are
- Given a sequence of acoustic feature vectors,
one every 10 ms - Goal output a string of words
- Well work on how to do this
20??HMM????????
21The Three Basic Problems for HMMs
- (From the classic formulation by Larry Rabiner
after Jack Ferguson) - L. R. Rabiner. 1989. A tutorial on Hidden Markov
Models and Selected Applications in Speech
Recognition. Proc IEEE 77(2), 257-286. Also in
Waibel and Lee volume.
22The Three Basic Problems for HMMs
- Problem 1 (Evaluation) Given the observation
sequence O(o1o2oT), and an HMM model ?
(A,B,?), how do we efficiently compute P(O ?),
the probability of the observation sequence,
given the model - Problem 2 (Decoding) Given the observation
sequence O(o1o2oT), and an HMM model ?
(A,B,?), how do we choose a corresponding state
sequence Q(q1q2qT) that is optimal in some
sense (i.e., best explains the observations) - Problem 3 (Learning) How do we adjust the model
parameters ? (A,B,?) to maximize P(O ? )?
23The Viterbi Trellis
24HMMs for speech
25But phones arent homogeneous
26So well need to break phones into subphones
27Now a word looks like this
28Back to Viterbi with speech, but w/out subphones
for a sec
29Vector Quantization
- Idea Make MFCC vectors look like symbols that we
can count - By building a mapping function that maps each
input vector into one of a small number of
symbols - Then compute probabilities just by counting
- This is called Vector Quantization or VQ
- Not used for ASR any more too simple
- But is useful to consider as a starting point.
30Vector Quantization
- Create a training set of feature vectors
- Cluster them into a small number of classes
- Represent each class by a discrete symbol
- For each class vk, we can compute the probability
that it is generated by a given HMM state using
Baum-Welch as above
31VQ
- Well define a
- Codebook, which lists for each symbol
- A prototype vector, or codeword
- If we had 256 classes (8-bit VQ),
- A codebook with 256 prototype vectors
- Given an incoming feature vector, we compare it
to each of the 256 prototype vectors - We pick whichever one is closest (by some
distance metric) - And replace the input vector by the index of this
prototype vector
32VQ
33VQ requirements
- A distance metric or distortion metric
- Specifies how similar two vectors are
- Used
- to build clusters
- To find prototype vector for cluster
- And to compare incoming vector to prototypes
- A clustering algorithm
- K-means, etc.
34Distance metrics
- Simplest
- Euclidean distance
- Also called sum-squared error
35Summary VQ
- To deal with real-valued input
- Convert the input to a symbol
- By choosing closest prototype vector in a
preclustered codebook - Where closest is defined by
- Euclidean distance
- Mahalanobis distance
- Then just use Baum-Welch as above
36Language Model
37LVCSR Search Algorithm
- Goal of search how to combine AM and LM
- Viterbi search
- Review and adding in LM
- Beam search
- Silence models
- A Search
- Fast match
- Tree structured lexicons
- N-Best and multipass search
- N-best
- Word lattice and word graph
- Forward-Backward search (not related to F-B
training)
38Evaluation
- How do we evaluate recognizers?
- Word error rate
39Word Error Rate
- Word Error Rate
- 100 (InsertionsSubstitutions Deletions)
- ------------------------------
- Total Word in Correct Transcript
- Aligment example
- REF portable PHONE UPSTAIRS last
night so - HYP portable FORM OF STORES last
night so - Eval I S S
- WER 100 (120)/6 50
40NIST sctk-1.3 scoring softareComputing WER with
sclite
- http//www.nist.gov/speech/tools/
- Sclite aligns a hypothesized text (HYP) (from the
recognizer) with a correct or reference text
(REF) (human transcribed) - id (2347-b-013)
- Scores (C S D I) 9 3 1 2
- REF was an engineer SO I i was always with
MEN UM and they - HYP was an engineer AND i was always with
THEM THEY ALL THAT and they - Eval D S I
I S S
41Summary on WER
- WER is clearly better than metrics like e.g.,
perplexity - But should we be more concerned with meaning
(semantic error rate)? - Good idea, but hard to agree on
- Has been applied in dialogue systems, where
desired semantic output is more clear - Recent research modify training to directly
minimize WER instead of maximizing likelihood
42What we are searching for
- Given Acoustic Model (AM) and Language Model
(LM)
AM (likelihood)
LM (prior)
(1)
43Combining Acoustic and Language Models
- We dont actually use equation (1)
- AM underestimates acoustic probability
- Why? Bad independence assumptions
- Intuition we compute (independent) AM
probability estimates every 10 ms but LM only
every word. - AM and LM have vastly different dynamic ranges
44Language Model Scaling Factor
- Solution add a language model weight (also
called language weight LW or language model
scaling factor LMSF - Value determined empirically, is positive (why?)
- For Sphinx, similar systems, generally in the
range 10 - 3.
45Word Insertion Penalty
- But LM prob P(W) also functions as penalty for
inserting words - Intuition when a uniform language model (every
word has an equal probability) is used, LM prob
is a 1/N penalty multiplier taken for each word - If penalty is large, decoder will prefer fewer
longer words - If penalty is small, decoder will prefer more
shorter words - When tuning LM for balancing AM, side effect of
penalty - So we add a separate word insertion penalty to
offset
46Log domain
- We do everything in log domain
- So final equation
47Language Model Scaling Factor
- As LMSF is increased
- More deletion errors (since increase penalty for
transitioning between words) - Fewer insertion errors
- Need wider search beam (since path scores larger)
- Less influence of acoustic model observation
probabilities
48Word Insertion Penalty
- Controls trade-off between insertion and deletion
errors - As penalty becomes larger (more negative)
- More deletion errors
- Fewer insertion errors
- Acts as a model of effect of length on
probability - But probably not a good model (geometric
assumption probably bad for short sentences)
49Adding LM probabilities to Viterbi (1) Uniform
LM
- Visualizing the search space for 2 words
50Viterbi trellis with 2 words and uniform LM
- Null transition from the end-state of each word
to start-state of all (both) words.
51Viterbi for 2 word continuous recognition
- Viterbi search computations done
time-synchronously from left to read, I.e. each
cell for time t is computed before proceedings to
time t1
52Search space for unigram LM
53Search space with bigrams
54Speeding things up
- Viterbi is O(N2T), where N is total number of HMM
states, and T is length - This is too large for real-time search
- A ton of work in ASR search is just to make
search faster - Beam search (pruning)
- Fast match
- Tree-based lexicons
55Beam search
- Instead of retaining all candidates (cells) at
every time frame - Use a threshold T to keep subset
- At each time t
- Identify state with lowest cost Dmin
- Each state with cost gt Dmin T is discarded
(pruned) before moving on to time t1
56Viterbi Beam search
- Is the most common and powerful search algorithm
for LVCSR - Note
- What makes this possible is time-synchronous
- We are comparing paths of equal length
- For two different word sequences W1 and W2
- We are comparing P(W1O0t) and P(W2O0t)
- Based on same partial observation sequence O0t
- So denominator is same, can be ignored
- Time-asynchronous search (A) is harder
57Viterbi Beam Search
- Empirically, beam size of 5-10 of search space
- Thus 90-95 of HMM states dont have to be
considered at each time t - Vast savings in time.
58A Search(A Decoding)
- Intuition
- If we had good heuristics for guiding decoding
- We could do depth-first (best-first) search and
not waste all our time on computing all those
paths at every time step as Viterbi does. - A decoding, also called stack decoding, is an
attempt to do that. - A also does not make the Viterbi assumption
- Uses the actual forward probability, rather than
the Viterbi approximation
59Reminder A search
- A search algorithm is admissible if it can
guarantee to find an optimal solution if one
exists. - Heuristic search functions rank nodes in search
space by f(N), the goodness of each node N in a
search tree, computed as - f(N) g(N) h(N)where
- g(N) The distance of the partial path already
traveled from root S to node N - h(N) Heuristic estimate of the remaining
distance from node N to goal node G.
60Reminder A search
- If the heuristic function h(N) of estimating the
remaining distance form N to goal node G is an
underestimate of the true distance, best-first
search is admissible, and is called A search.
61A search for speech
- The search space is the set of possible sentences
- The forward algorithm can tell us the cost of the
current path so far g(.) - We need an estimate of the cost from the current
node to the end h(.)
62A Decoding (2)
63Stack decoding (A) algorithm
64Making A work h(.)
- If h(.) is zero, breadth first search
- Stupid estimates of h(.)
- Amount of time left in utterance
- Slightly smarter
- Estimate expected cost-per-frame for remaining
path - Multiply that by remaining time
- This can be computed from the training set (how
much was the average acoustic cost for a frame in
the training set) - Later multi-pass decoding, can use backwards
algorithm to estimate h for any hypothesis!
65N-best and multipass search
- The ideal search strategy would use every
available knowledge source (KS) - But is often difficult or expensive to integrate
a very complex KS into first pass search - For example, parsers as a language model have
long-distance dependencies that violate dynamic
programming assumptions - Other knowledge sources might not be
left-to-right (knowledge of following words can
help predict preceding words) - For this reason (and others we will see) we use
multipass search algorithms
66Multipass Search
67Some definitions
- N-best list
- Instead of single best sentence (word string),
return ordered list of N sentence hypotheses - Word lattice
- Compact representation of word hypotheses and
their times and scores - Word graph
- FSA representation of lattice in which times are
represented by topology
68N-best list
69Word lattice
- Encodes
- Word
- Starting/ending time(s) of word
- Acoustic score of word
- More compact than N-best list
- Utterance with 10 words, 2 hyps per word
- 1024 different sentences
- Lattice with only 20 different hypotheses
70Word Graph
71Converting word lattice to word graph
- Word lattice can have range of possible end
frames for word - Create an edge from (wi,ti) to (wj,tj) if tj-1 is
one of the end-times of wi
72Computing N-best lists
- In the worst case, an admissible algorithm for
finding the N most likely hypotheses is
exponential in the length of the utterance. - S. Young. 1984. Generating Multiple Solutions
from Connected Word DP Recognition Algorithms.
Proc. of the Institute of Acoustics, 64,
351-354. - For example, if AM and LM score were nearly
identical for all word sequences, we must
consider all permutations of word sequences for
whole sentence (all with the same scores). - But of course if this is true, cant do ASR at
all!
73- Demo
- LVSR system
- developed by Intelligent Computing Lab
- (????????)
74