Natural Language Processing 11 Speech Recognition - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Natural Language Processing 11 Speech Recognition

Description:

Intelligence Computing Research Center. Harbin Institute ... HMMs, Lexicons, and Pronunciation. Decoding. Language Modeling. Feature Extraction. Digitize Speech ... – PowerPoint PPT presentation

Number of Views:413
Avg rating:3.0/5.0
Slides: 75
Provided by: leno2
Category:

less

Transcript and Presenter's Notes

Title: Natural Language Processing 11 Speech Recognition


1
Natural Language Processing(11)Speech
Recognition
  • Dr. Xuan Wang(? ?)
  • Intelligence Computing Research Center
  • Harbin Institute of Technology Shenzhen Graduate
    School
  • Slides from Slides from Dr. Mary P. Harper ECE,
    Purdue University

2
LVCSR
  • Large Vocabulary Continuous Speech Recognition
  • 20,000-64,000 words
  • Speaker independent (vs. speaker-dependent)
  • Continuous speech (vs isolated-word)

3
LVCSR
  • Build a statistical model of the speech-to-words
    process
  • Collect lots and lots of speech, and transcribe
    all the words.
  • Train the model on the labeled speech
  • Paradigm Supervised Machine Learning Search

4
Speech Recognition Architecture
5
The Noisy Channel Model
  • Search through space of all possible sentences.
  • Pick the one that is most probable given the
    waveform.

6
The Noisy Channel Model (II)
  • What is the most likely sentence out of all
    sentences in the language L given some acoustic
    input O?
  • Treat acoustic input O as sequence of individual
    observations
  • O o1,o2,o3,,ot
  • Define a sentence as a sequence of words
  • W w1,w2,w3,,wn

7
Noisy Channel Model (III)
  • Probabilistic implication Pick the highest prob
    S
  • We can use Bayes rule to rewrite this
  • Since denominator is the same for each candidate
    sentence W, we can ignore it for the argmax

8
Noisy channel model
likelihood
prior
9
The noisy channel model
  • Ignoring the denominator leaves us with two
    factors P(Source) and (SignalSource)

10
Speech Architecture meets Noisy Channel
11
Architecture Five easy pieces
  • Feature extraction
  • Acoustic Modeling
  • HMMs, Lexicons, and Pronunciation
  • Decoding
  • Language Modeling

12
Feature Extraction
  • Digitize Speech
  • Extract Frames

13
Digitizing Speech
14
Digitizing Speech (A-D)
  • Sampling
  • measuring amplitude of signal at time t
  • 16,000 Hz (samples/sec) Microphone (Wideband)
  • 8,000 Hz (samples/sec) Telephone
  • Why?
  • Need at least 2 samples per cycle
  • max measurable frequency is half sampling rate
  • Human speech lt 10,000 Hz, so need max 20K
  • Telephone filtered at 4K, so 8K is enough

15
Digitizing Speech (II)
  • Quantization
  • Representing real value of each amplitude as
    integer
  • 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
  • Formats
  • 16 bit PCM
  • 8 bit mu-law log compression
  • LSB (Intel) vs. MSB (Sun, Apple)
  • Headers
  • Raw (no header)
  • Microsoft wav
  • Sun .au

40 byte header
16
Frame Extraction
  • A frame (25 ms wide) extracted every 10 ms

. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
17
MFCC (Mel Frequency Cepstral Coefficients)
  • Do FFT to get spectral information
  • Like the spectrogram/spectrum we saw earlier
  • Apply Mel scaling
  • Linear below 1kHz, log above, equal samples above
    and below 1kHz
  • Models human ear more sensitivity in lower freqs
  • Plus Discrete Cosine Transformation

18
Final Feature Vector
  • 39 Features per 10 ms frame
  • 12 MFCC features
  • 12 Delta MFCC features
  • 12 Delta-Delta MFCC features
  • 1 (log) frame energy
  • 1 Delta (log) frame energy
  • 1 Delta-Delta (log frame energy)
  • So each frame represented by a 39D vector

19
Where we are
  • Given a sequence of acoustic feature vectors,
    one every 10 ms
  • Goal output a string of words
  • Well work on how to do this

20
??HMM????????
21
The Three Basic Problems for HMMs
  • (From the classic formulation by Larry Rabiner
    after Jack Ferguson)
  • L. R. Rabiner. 1989. A tutorial on Hidden Markov
    Models and Selected Applications in Speech
    Recognition. Proc IEEE 77(2), 257-286. Also in
    Waibel and Lee volume.

22
The Three Basic Problems for HMMs
  • Problem 1 (Evaluation) Given the observation
    sequence O(o1o2oT), and an HMM model ?
    (A,B,?), how do we efficiently compute P(O ?),
    the probability of the observation sequence,
    given the model
  • Problem 2 (Decoding) Given the observation
    sequence O(o1o2oT), and an HMM model ?
    (A,B,?), how do we choose a corresponding state
    sequence Q(q1q2qT) that is optimal in some
    sense (i.e., best explains the observations)
  • Problem 3 (Learning) How do we adjust the model
    parameters ? (A,B,?) to maximize P(O ? )?

23
The Viterbi Trellis
24
HMMs for speech
25
But phones arent homogeneous
26
So well need to break phones into subphones
27
Now a word looks like this
28
Back to Viterbi with speech, but w/out subphones
for a sec
29
Vector Quantization
  • Idea Make MFCC vectors look like symbols that we
    can count
  • By building a mapping function that maps each
    input vector into one of a small number of
    symbols
  • Then compute probabilities just by counting
  • This is called Vector Quantization or VQ
  • Not used for ASR any more too simple
  • But is useful to consider as a starting point.

30
Vector Quantization
  • Create a training set of feature vectors
  • Cluster them into a small number of classes
  • Represent each class by a discrete symbol
  • For each class vk, we can compute the probability
    that it is generated by a given HMM state using
    Baum-Welch as above

31
VQ
  • Well define a
  • Codebook, which lists for each symbol
  • A prototype vector, or codeword
  • If we had 256 classes (8-bit VQ),
  • A codebook with 256 prototype vectors
  • Given an incoming feature vector, we compare it
    to each of the 256 prototype vectors
  • We pick whichever one is closest (by some
    distance metric)
  • And replace the input vector by the index of this
    prototype vector

32
VQ
33
VQ requirements
  • A distance metric or distortion metric
  • Specifies how similar two vectors are
  • Used
  • to build clusters
  • To find prototype vector for cluster
  • And to compare incoming vector to prototypes
  • A clustering algorithm
  • K-means, etc.

34
Distance metrics
  • Simplest
  • Euclidean distance
  • Also called sum-squared error

35
Summary VQ
  • To deal with real-valued input
  • Convert the input to a symbol
  • By choosing closest prototype vector in a
    preclustered codebook
  • Where closest is defined by
  • Euclidean distance
  • Mahalanobis distance
  • Then just use Baum-Welch as above

36
Language Model
  • N-Gram
  • Gramer

37
LVCSR Search Algorithm
  • Goal of search how to combine AM and LM
  • Viterbi search
  • Review and adding in LM
  • Beam search
  • Silence models
  • A Search
  • Fast match
  • Tree structured lexicons
  • N-Best and multipass search
  • N-best
  • Word lattice and word graph
  • Forward-Backward search (not related to F-B
    training)

38
Evaluation
  • How do we evaluate recognizers?
  • Word error rate

39
Word Error Rate
  • Word Error Rate
  • 100 (InsertionsSubstitutions Deletions)
  • ------------------------------
  • Total Word in Correct Transcript
  • Aligment example
  • REF portable PHONE UPSTAIRS last
    night so
  • HYP portable FORM OF STORES last
    night so
  • Eval I S S
  • WER 100 (120)/6 50

40
NIST sctk-1.3 scoring softareComputing WER with
sclite
  • http//www.nist.gov/speech/tools/
  • Sclite aligns a hypothesized text (HYP) (from the
    recognizer) with a correct or reference text
    (REF) (human transcribed)
  • id (2347-b-013)
  • Scores (C S D I) 9 3 1 2
  • REF was an engineer SO I i was always with
    MEN UM and they
  • HYP was an engineer AND i was always with
    THEM THEY ALL THAT and they
  • Eval D S I
    I S S

41
Summary on WER
  • WER is clearly better than metrics like e.g.,
    perplexity
  • But should we be more concerned with meaning
    (semantic error rate)?
  • Good idea, but hard to agree on
  • Has been applied in dialogue systems, where
    desired semantic output is more clear
  • Recent research modify training to directly
    minimize WER instead of maximizing likelihood

42
What we are searching for
  • Given Acoustic Model (AM) and Language Model
    (LM)

AM (likelihood)
LM (prior)
(1)
43
Combining Acoustic and Language Models
  • We dont actually use equation (1)
  • AM underestimates acoustic probability
  • Why? Bad independence assumptions
  • Intuition we compute (independent) AM
    probability estimates every 10 ms but LM only
    every word.
  • AM and LM have vastly different dynamic ranges

44
Language Model Scaling Factor
  • Solution add a language model weight (also
    called language weight LW or language model
    scaling factor LMSF
  • Value determined empirically, is positive (why?)
  • For Sphinx, similar systems, generally in the
    range 10 - 3.

45
Word Insertion Penalty
  • But LM prob P(W) also functions as penalty for
    inserting words
  • Intuition when a uniform language model (every
    word has an equal probability) is used, LM prob
    is a 1/N penalty multiplier taken for each word
  • If penalty is large, decoder will prefer fewer
    longer words
  • If penalty is small, decoder will prefer more
    shorter words
  • When tuning LM for balancing AM, side effect of
    penalty
  • So we add a separate word insertion penalty to
    offset

46
Log domain
  • We do everything in log domain
  • So final equation

47
Language Model Scaling Factor
  • As LMSF is increased
  • More deletion errors (since increase penalty for
    transitioning between words)
  • Fewer insertion errors
  • Need wider search beam (since path scores larger)
  • Less influence of acoustic model observation
    probabilities

48
Word Insertion Penalty
  • Controls trade-off between insertion and deletion
    errors
  • As penalty becomes larger (more negative)
  • More deletion errors
  • Fewer insertion errors
  • Acts as a model of effect of length on
    probability
  • But probably not a good model (geometric
    assumption probably bad for short sentences)

49
Adding LM probabilities to Viterbi (1) Uniform
LM
  • Visualizing the search space for 2 words

50
Viterbi trellis with 2 words and uniform LM
  • Null transition from the end-state of each word
    to start-state of all (both) words.

51
Viterbi for 2 word continuous recognition
  • Viterbi search computations done
    time-synchronously from left to read, I.e. each
    cell for time t is computed before proceedings to
    time t1

52
Search space for unigram LM
53
Search space with bigrams
54
Speeding things up
  • Viterbi is O(N2T), where N is total number of HMM
    states, and T is length
  • This is too large for real-time search
  • A ton of work in ASR search is just to make
    search faster
  • Beam search (pruning)
  • Fast match
  • Tree-based lexicons

55
Beam search
  • Instead of retaining all candidates (cells) at
    every time frame
  • Use a threshold T to keep subset
  • At each time t
  • Identify state with lowest cost Dmin
  • Each state with cost gt Dmin T is discarded
    (pruned) before moving on to time t1

56
Viterbi Beam search
  • Is the most common and powerful search algorithm
    for LVCSR
  • Note
  • What makes this possible is time-synchronous
  • We are comparing paths of equal length
  • For two different word sequences W1 and W2
  • We are comparing P(W1O0t) and P(W2O0t)
  • Based on same partial observation sequence O0t
  • So denominator is same, can be ignored
  • Time-asynchronous search (A) is harder

57
Viterbi Beam Search
  • Empirically, beam size of 5-10 of search space
  • Thus 90-95 of HMM states dont have to be
    considered at each time t
  • Vast savings in time.

58
A Search(A Decoding)
  • Intuition
  • If we had good heuristics for guiding decoding
  • We could do depth-first (best-first) search and
    not waste all our time on computing all those
    paths at every time step as Viterbi does.
  • A decoding, also called stack decoding, is an
    attempt to do that.
  • A also does not make the Viterbi assumption
  • Uses the actual forward probability, rather than
    the Viterbi approximation

59
Reminder A search
  • A search algorithm is admissible if it can
    guarantee to find an optimal solution if one
    exists.
  • Heuristic search functions rank nodes in search
    space by f(N), the goodness of each node N in a
    search tree, computed as
  • f(N) g(N) h(N)where
  • g(N) The distance of the partial path already
    traveled from root S to node N
  • h(N) Heuristic estimate of the remaining
    distance from node N to goal node G.

60
Reminder A search
  • If the heuristic function h(N) of estimating the
    remaining distance form N to goal node G is an
    underestimate of the true distance, best-first
    search is admissible, and is called A search.

61
A search for speech
  • The search space is the set of possible sentences
  • The forward algorithm can tell us the cost of the
    current path so far g(.)
  • We need an estimate of the cost from the current
    node to the end h(.)

62
A Decoding (2)
63
Stack decoding (A) algorithm
64
Making A work h(.)
  • If h(.) is zero, breadth first search
  • Stupid estimates of h(.)
  • Amount of time left in utterance
  • Slightly smarter
  • Estimate expected cost-per-frame for remaining
    path
  • Multiply that by remaining time
  • This can be computed from the training set (how
    much was the average acoustic cost for a frame in
    the training set)
  • Later multi-pass decoding, can use backwards
    algorithm to estimate h for any hypothesis!

65
N-best and multipass search
  • The ideal search strategy would use every
    available knowledge source (KS)
  • But is often difficult or expensive to integrate
    a very complex KS into first pass search
  • For example, parsers as a language model have
    long-distance dependencies that violate dynamic
    programming assumptions
  • Other knowledge sources might not be
    left-to-right (knowledge of following words can
    help predict preceding words)
  • For this reason (and others we will see) we use
    multipass search algorithms

66
Multipass Search
67
Some definitions
  • N-best list
  • Instead of single best sentence (word string),
    return ordered list of N sentence hypotheses
  • Word lattice
  • Compact representation of word hypotheses and
    their times and scores
  • Word graph
  • FSA representation of lattice in which times are
    represented by topology

68
N-best list
69
Word lattice
  • Encodes
  • Word
  • Starting/ending time(s) of word
  • Acoustic score of word
  • More compact than N-best list
  • Utterance with 10 words, 2 hyps per word
  • 1024 different sentences
  • Lattice with only 20 different hypotheses

70
Word Graph
71
Converting word lattice to word graph
  • Word lattice can have range of possible end
    frames for word
  • Create an edge from (wi,ti) to (wj,tj) if tj-1 is
    one of the end-times of wi

72
Computing N-best lists
  • In the worst case, an admissible algorithm for
    finding the N most likely hypotheses is
    exponential in the length of the utterance.
  • S. Young. 1984. Generating Multiple Solutions
    from Connected Word DP Recognition Algorithms.
    Proc. of the Institute of Acoustics, 64,
    351-354.
  • For example, if AM and LM score were nearly
    identical for all word sequences, we must
    consider all permutations of word sequences for
    whole sentence (all with the same scores).
  • But of course if this is true, cant do ASR at
    all!

73
  • Demo
  • LVSR system
  • developed by Intelligent Computing Lab
  • (????????)

74
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com