Natural Language Processing 11 Speech Recognition - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

Natural Language Processing 11 Speech Recognition

Description:

Intelligence Computing Research Center. Harbin Institute ... HMMs, Lexicons, and Pronunciation. Decoding. Language Modeling. Feature Extraction. Digitize Speech ... – PowerPoint PPT presentation

Number of Views:413

Avg rating:3.0/5.0

Slides: 75

Provided by: leno2

Category:

more less

Transcript and Presenter's Notes

Title: Natural Language Processing 11 Speech Recognition

1
Natural Language Processing(11)Speech
Recognition

Dr. Xuan Wang(? ?)
Intelligence Computing Research Center
Harbin Institute of Technology Shenzhen Graduate
School
Slides from Slides from Dr. Mary P. Harper ECE,
Purdue University

2
LVCSR

Large Vocabulary Continuous Speech Recognition
20,000-64,000 words
Speaker independent (vs. speaker-dependent)
Continuous speech (vs isolated-word)

3
LVCSR

Build a statistical model of the speech-to-words
process
Collect lots and lots of speech, and transcribe
all the words.
Train the model on the labeled speech
Paradigm Supervised Machine Learning Search

4
Speech Recognition Architecture
5
The Noisy Channel Model

Search through space of all possible sentences.
Pick the one that is most probable given the
waveform.

6
The Noisy Channel Model (II)

What is the most likely sentence out of all
sentences in the language L given some acoustic
input O?
Treat acoustic input O as sequence of individual
observations
O o1,o2,o3,,ot
Define a sentence as a sequence of words
W w1,w2,w3,,wn

7
Noisy Channel Model (III)

Probabilistic implication Pick the highest prob
S
We can use Bayes rule to rewrite this
Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax

8
Noisy channel model
likelihood
prior
9
The noisy channel model

Ignoring the denominator leaves us with two
factors P(Source) and (SignalSource)

10
Speech Architecture meets Noisy Channel
11
Architecture Five easy pieces

Feature extraction
Acoustic Modeling
HMMs, Lexicons, and Pronunciation
Decoding
Language Modeling

12
Feature Extraction

Digitize Speech
Extract Frames

13
Digitizing Speech
14
Digitizing Speech (A-D)

Sampling
measuring amplitude of signal at time t
16,000 Hz (samples/sec) Microphone (Wideband)
8,000 Hz (samples/sec) Telephone
Why?
Need at least 2 samples per cycle
max measurable frequency is half sampling rate
Human speech lt 10,000 Hz, so need max 20K
Telephone filtered at 4K, so 8K is enough

15
Digitizing Speech (II)

Quantization
Representing real value of each amplitude as
integer
8-bit (-128 to 127) or 16-bit (-32768 to 32767)
Formats
16 bit PCM
8 bit mu-law log compression
LSB (Intel) vs. MSB (Sun, Apple)
Headers
Raw (no header)
Microsoft wav
Sun .au

40 byte header
16
Frame Extraction

A frame (25 ms wide) extracted every 10 ms

. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
17
MFCC (Mel Frequency Cepstral Coefficients)

Do FFT to get spectral information
Like the spectrogram/spectrum we saw earlier
Apply Mel scaling
Linear below 1kHz, log above, equal samples above
and below 1kHz
Models human ear more sensitivity in lower freqs
Plus Discrete Cosine Transformation

18
Final Feature Vector

39 Features per 10 ms frame
12 MFCC features
12 Delta MFCC features
12 Delta-Delta MFCC features
1 (log) frame energy
1 Delta (log) frame energy
1 Delta-Delta (log frame energy)
So each frame represented by a 39D vector

19
Where we are

Given a sequence of acoustic feature vectors,
one every 10 ms
Goal output a string of words
Well work on how to do this

20
??HMM????????
21
The Three Basic Problems for HMMs

(From the classic formulation by Larry Rabiner
after Jack Ferguson)
L. R. Rabiner. 1989. A tutorial on Hidden Markov
Models and Selected Applications in Speech
Recognition. Proc IEEE 77(2), 257-286. Also in
Waibel and Lee volume.

22
The Three Basic Problems for HMMs

Problem 1 (Evaluation) Given the observation
sequence O(o1o2oT), and an HMM model ?
(A,B,?), how do we efficiently compute P(O ?),
the probability of the observation sequence,
given the model
Problem 2 (Decoding) Given the observation
sequence O(o1o2oT), and an HMM model ?
(A,B,?), how do we choose a corresponding state
sequence Q(q1q2qT) that is optimal in some
sense (i.e., best explains the observations)
Problem 3 (Learning) How do we adjust the model
parameters ? (A,B,?) to maximize P(O ? )?

23
The Viterbi Trellis
24
HMMs for speech
25
But phones arent homogeneous
26
So well need to break phones into subphones
27
Now a word looks like this
28
Back to Viterbi with speech, but w/out subphones
for a sec
29
Vector Quantization

Idea Make MFCC vectors look like symbols that we
can count
By building a mapping function that maps each
input vector into one of a small number of
symbols
Then compute probabilities just by counting
This is called Vector Quantization or VQ
Not used for ASR any more too simple
But is useful to consider as a starting point.

30
Vector Quantization

Create a training set of feature vectors
Cluster them into a small number of classes
Represent each class by a discrete symbol
For each class vk, we can compute the probability
that it is generated by a given HMM state using
Baum-Welch as above

31
VQ

Well define a
Codebook, which lists for each symbol
A prototype vector, or codeword
If we had 256 classes (8-bit VQ),
A codebook with 256 prototype vectors
Given an incoming feature vector, we compare it
to each of the 256 prototype vectors
We pick whichever one is closest (by some
distance metric)
And replace the input vector by the index of this
prototype vector

32
VQ
33
VQ requirements

A distance metric or distortion metric
Specifies how similar two vectors are
Used
to build clusters
To find prototype vector for cluster
And to compare incoming vector to prototypes
A clustering algorithm
K-means, etc.

34
Distance metrics

Simplest
Euclidean distance
Also called sum-squared error

35
Summary VQ

To deal with real-valued input
Convert the input to a symbol
By choosing closest prototype vector in a
preclustered codebook
Where closest is defined by
Euclidean distance
Mahalanobis distance
Then just use Baum-Welch as above

36
Language Model

N-Gram
Gramer

37
LVCSR Search Algorithm

Goal of search how to combine AM and LM
Viterbi search
Review and adding in LM
Beam search
Silence models
A Search
Fast match
Tree structured lexicons
N-Best and multipass search
N-best
Word lattice and word graph
Forward-Backward search (not related to F-B
training)

38
Evaluation

How do we evaluate recognizers?
Word error rate

39
Word Error Rate

Word Error Rate
100 (InsertionsSubstitutions Deletions)
------------------------------
Total Word in Correct Transcript
Aligment example
REF portable PHONE UPSTAIRS last
night so
HYP portable FORM OF STORES last
night so
Eval I S S
WER 100 (120)/6 50

40
NIST sctk-1.3 scoring softareComputing WER with
sclite

http//www.nist.gov/speech/tools/
Sclite aligns a hypothesized text (HYP) (from the
recognizer) with a correct or reference text
(REF) (human transcribed)
id (2347-b-013)
Scores (C S D I) 9 3 1 2
REF was an engineer SO I i was always with
MEN UM and they
HYP was an engineer AND i was always with
THEM THEY ALL THAT and they
Eval D S I
I S S

41
Summary on WER

WER is clearly better than metrics like e.g.,
perplexity
But should we be more concerned with meaning
(semantic error rate)?
Good idea, but hard to agree on
Has been applied in dialogue systems, where
desired semantic output is more clear
Recent research modify training to directly
minimize WER instead of maximizing likelihood

42
What we are searching for

Given Acoustic Model (AM) and Language Model
(LM)

AM (likelihood)
LM (prior)
(1)
43
Combining Acoustic and Language Models

We dont actually use equation (1)
AM underestimates acoustic probability
Why? Bad independence assumptions
Intuition we compute (independent) AM
probability estimates every 10 ms but LM only
every word.
AM and LM have vastly different dynamic ranges

44
Language Model Scaling Factor

Solution add a language model weight (also
called language weight LW or language model
scaling factor LMSF
Value determined empirically, is positive (why?)
For Sphinx, similar systems, generally in the
range 10 - 3.

45
Word Insertion Penalty

But LM prob P(W) also functions as penalty for
inserting words
Intuition when a uniform language model (every
word has an equal probability) is used, LM prob
is a 1/N penalty multiplier taken for each word
If penalty is large, decoder will prefer fewer
longer words
If penalty is small, decoder will prefer more
shorter words
When tuning LM for balancing AM, side effect of
penalty
So we add a separate word insertion penalty to
offset

46
Log domain

We do everything in log domain
So final equation

47
Language Model Scaling Factor

As LMSF is increased
More deletion errors (since increase penalty for
transitioning between words)
Fewer insertion errors
Need wider search beam (since path scores larger)
Less influence of acoustic model observation
probabilities

48
Word Insertion Penalty

Controls trade-off between insertion and deletion
errors
As penalty becomes larger (more negative)
More deletion errors
Fewer insertion errors
Acts as a model of effect of length on
probability
But probably not a good model (geometric
assumption probably bad for short sentences)

49
Adding LM probabilities to Viterbi (1) Uniform
LM

Visualizing the search space for 2 words

50
Viterbi trellis with 2 words and uniform LM

Null transition from the end-state of each word
to start-state of all (both) words.

51
Viterbi for 2 word continuous recognition

Viterbi search computations done
time-synchronously from left to read, I.e. each
cell for time t is computed before proceedings to
time t1

52
Search space for unigram LM
53
Search space with bigrams
54
Speeding things up

Viterbi is O(N2T), where N is total number of HMM
states, and T is length
This is too large for real-time search
A ton of work in ASR search is just to make
search faster
Beam search (pruning)
Fast match
Tree-based lexicons

55
Beam search

Instead of retaining all candidates (cells) at
every time frame
Use a threshold T to keep subset
At each time t
Identify state with lowest cost Dmin
Each state with cost gt Dmin T is discarded
(pruned) before moving on to time t1

56
Viterbi Beam search

Is the most common and powerful search algorithm
for LVCSR
Note
What makes this possible is time-synchronous
We are comparing paths of equal length
For two different word sequences W1 and W2
We are comparing P(W1O0t) and P(W2O0t)
Based on same partial observation sequence O0t
So denominator is same, can be ignored
Time-asynchronous search (A) is harder

57
Viterbi Beam Search

Empirically, beam size of 5-10 of search space
Thus 90-95 of HMM states dont have to be
considered at each time t
Vast savings in time.

58
A Search(A Decoding)

Intuition
If we had good heuristics for guiding decoding
We could do depth-first (best-first) search and
not waste all our time on computing all those
paths at every time step as Viterbi does.
A decoding, also called stack decoding, is an
attempt to do that.
A also does not make the Viterbi assumption
Uses the actual forward probability, rather than
the Viterbi approximation

59
Reminder A search

A search algorithm is admissible if it can
guarantee to find an optimal solution if one
exists.
Heuristic search functions rank nodes in search
space by f(N), the goodness of each node N in a
search tree, computed as
f(N) g(N) h(N)where
g(N) The distance of the partial path already
traveled from root S to node N
h(N) Heuristic estimate of the remaining
distance from node N to goal node G.

60
Reminder A search

If the heuristic function h(N) of estimating the
remaining distance form N to goal node G is an
underestimate of the true distance, best-first
search is admissible, and is called A search.

61
A search for speech

The search space is the set of possible sentences
The forward algorithm can tell us the cost of the
current path so far g(.)
We need an estimate of the cost from the current
node to the end h(.)

62
A Decoding (2)
63
Stack decoding (A) algorithm
64
Making A work h(.)

If h(.) is zero, breadth first search
Stupid estimates of h(.)
Amount of time left in utterance
Slightly smarter
Estimate expected cost-per-frame for remaining
path
Multiply that by remaining time
This can be computed from the training set (how
much was the average acoustic cost for a frame in
the training set)
Later multi-pass decoding, can use backwards
algorithm to estimate h for any hypothesis!

65
N-best and multipass search

The ideal search strategy would use every
available knowledge source (KS)
But is often difficult or expensive to integrate
a very complex KS into first pass search
For example, parsers as a language model have
long-distance dependencies that violate dynamic
programming assumptions
Other knowledge sources might not be
left-to-right (knowledge of following words can
help predict preceding words)
For this reason (and others we will see) we use
multipass search algorithms

66
Multipass Search
67
Some definitions

N-best list
Instead of single best sentence (word string),
return ordered list of N sentence hypotheses
Word lattice
Compact representation of word hypotheses and
their times and scores
Word graph
FSA representation of lattice in which times are
represented by topology

68
N-best list
69
Word lattice

Encodes
Word
Starting/ending time(s) of word
Acoustic score of word
More compact than N-best list
Utterance with 10 words, 2 hyps per word
1024 different sentences
Lattice with only 20 different hypotheses

70
Word Graph
71
Converting word lattice to word graph

Word lattice can have range of possible end
frames for word
Create an edge from (wi,ti) to (wj,tj) if tj-1 is
one of the end-times of wi

72
Computing N-best lists

In the worst case, an admissible algorithm for
finding the N most likely hypotheses is
exponential in the length of the utterance.
S. Young. 1984. Generating Multiple Solutions
from Connected Word DP Recognition Algorithms.
Proc. of the Institute of Acoustics, 64,
351-354.
For example, if AM and LM score were nearly
identical for all word sequences, we must
consider all permutations of word sequences for
whole sentence (all with the same scores).
But of course if this is true, cant do ASR at
all!