Speech%20Processing%20Text%20to%20Speech%20Synthesis - PowerPoint PPT Presentation

About This Presentation
Title:

Speech%20Processing%20Text%20to%20Speech%20Synthesis

Description:

... how to pick the right unit? Search Joining the units dumb (just stick'em together) PSOLA (Pitch-Synchronous Overlap and Add) MBROLA (Multi-band overlap and add) ... – PowerPoint PPT presentation

Number of Views:893
Avg rating:3.0/5.0
Slides: 111
Provided by: KathyM151
Category:

less

Transcript and Presenter's Notes

Title: Speech%20Processing%20Text%20to%20Speech%20Synthesis


1
Speech ProcessingText to Speech Synthesis
  • August 11, 2005

2
CS 224S / LINGUIST 236Speech Recognition and
Synthesis
  • Dan Jurafsky

Lecture 3 TTS Overview, History, and
Letter-to-Sound
IP Notice lots of info, text, and diagrams on
these slides comes (thanks!) from Alan Blacks
excellent lecture notes and from Richard Sproats
great new slides.
3
  • Text-To-Speech defined as the automatic
    production of speech, through a
    grapheme-to-phoneme transcription of the
    sentences to utter.
  • Text Analyzer
  • Normalization
  • Morphological Analyzer
  • Contextual Analyzer
  • Syntactic-Prosodic Parser
  • Grapheme to phoneme rules
  • Prosody Generator

4
  • Dictionary based G2P
  • Rule based G2P

5
Text Normalization
  • Analysis of raw text into pronounceable words
  • Sample problems
  • The robbers stole Rs 100 lakhs from the bank
  • It's 13/4 Modern Ave.
  • The home page is http//www.facweb.iitkgp.ernet.in
    /sudeshna/
  • yes, see you the following tues, that's 23/08/05
  • Steps
  • Identify tokens in text
  • Chunk tokens into reasonably sized sections
  • Map tokens to words
  • Identify types for words

6
Grapheme to Phoneme
  • How to pronounce a word? Look in dictionary! But
  • Unknown words and names will be missing
  • Turkish, German, and other hard languages
  • uygarlaStIramadIklarImIzdanmISsInIzcasIna
  • (behaving) as if you are among those whom we
    could not civilize
  • uygar laS tIr ama dIk lar ImIz dan mIS
    sInIz casIna civilized bec caus NegAble
    ppart pl p1pl abl past 2pl AsIf
  • So need Letter to Sound Rules
  • Also homograph disambiguation (wind, live, read)

7
Grapheme to Phoneme in Indian languages
  • Hindi do not need a dictionary. Letter to sound
    rules can capture the pronunciation of most
    words.
  • Bengali Harder than Hindi, but mostly can be
    handled using rules, and a list of exceptions.

8
Prosodyfrom wordsphones to boundaries, accent,
F0, duration
  • The term prosody refers to certain properties of
    the speech signal which are related to audible
    changes in pitch, loudness, syllable length.
  • Prosodic phrasing
  • Need to break utterances into phrases
  • Punctuation is useful, not sufficient
  • Accents
  • Predictions of accents which syllables should be
    accented
  • Realization of F0 contour given accents/tones,
    generate F0 contour
  • Duration
  • Predicting duration of each phone

9
Waveform synthesisfrom segments, f0, duration
to waveform
  • Collecting diphones
  • need to record diphones in correct contexts
  • l sounds different in onset than coda, t is
    flapped sometimes, etc.
  • need quiet recording room
  • then need to label them very very exactly
  • Unit selection how to pick the right unit?
    Search
  • Joining the units
  • dumb (just stick'em together)
  • PSOLA (Pitch-Synchronous Overlap and Add)
  • MBROLA (Multi-band overlap and add)

10
Festival
  • Open source speech synthesis system
  • Designed for development and runtime use
  • Use in many commercial and academic systems
  • Distributed with RedHat 9.x
  • Hundreds of thousands of users
  • Multilingual
  • No built-in language
  • Designed to allow addition of new languages
  • Additional tools for rapid voice development
  • Statistical learning tools
  • Scripts for building models

Text from Richard Sproat
11
Festival as software
  • http//festvox.org/festival/
  • General system for multi-lingual TTS
  • C/C code with Scheme scripting language
  • General replaceable modules
  • Lexicons, LTS, duration, intonation, phrasing,
    POS tagging, tokenizing, diphone/unit selection,
    signal processing
  • General tools
  • Intonation analysis (f0, Tilt), signal
    processing, CART building, N-gram, SCFG, WFST

Text from Richard Sproat
12
Festival as software
  • http//festvox.org/festival/
  • No fixed theories
  • New languages without new C code
  • Multiplatform (Unix/Windows)
  • Full sources in distribution
  • Free software

Text from Richard Sproat
13
CMU FestVox project
  • Festival is an engine, how do you make voices?
  • Festvox building synthetic voices
  • Tools, scripts, documentation
  • Discussion and examples for building voices
  • Example voice databases
  • Step by step walkthroughs of processes
  • Support for English and other languages
  • Support for different waveform synthesis methods
  • Diphone
  • Unit selection
  • Limited domain

Text from Richard Sproat
14
Synthesis tools
  • I want my computer to talk
  • Festival Speech Synthesis
  • I want my computer to talk in my voice
  • FestVox Project
  • I want it to be fast and efficient
  • Flite

Text from Richard Sproat
15
Using Festival
  • How to get Festival to talk
  • Scheme (Festivals scripting language)
  • Basic Festival commands

Text from Richard Sproat
16
Getting it to talk
  • Say a file
  • festival --tts file.txt
  • From Emacs
  • say region, say buffer
  • Command line interpreter
  • festivalgt (SayText hello)

Text from Richard Sproat
17
Quick Intro to Scheme
  • Scheme is a dialect of LISP
  • expressions are
  • atoms or
  • lists
  • a bcd hello world 12.3
  • (a b c)
  • (a (1 2) seven)
  • Interpreter evaluates expressions
  • Atoms evaluate as variables
  • Lists evaluate as functional calls
  • bxx
  • 3.14
  • ( 2 3)

Text from Richard Sproat
18
Quick Intro to Scheme
  • Setting variables
  • (set! a 3.14)
  • Defining functions
  • (define (timestwo n) ( 2 n))
  • (timestwo a)
  • 6.28

Text from Richard Sproat
19
Lists in Scheme
  • festivalgt (set! alist '(apples pears bananas))
  • (apples pears bananas)
  • festivalgt (car alist)
  • apples
  • festivalgt (cdr alist)
  • (pears bananas)
  • festivalgt (set! blist (cons 'oranges alist))
  • (oranges apples pears bananas)
  • festivalgt append alist blist
  • ltSUBR(6) appendgt
  • (apples pears bananas)
  • (oranges apples pears bananas)
  • festivalgt (append alist blist)
  • (apples pears bananas oranges apples pears
    bananas)
  • festivalgt (length alist)
  • 3
  • festivalgt (length (append alist blist))
  • 7

Text from Richard Sproat
20
Scheme speech
  • Make an utterance of type text
  • festivalgt (set! utt1 (Utterance Text hello))
  • ltUtterance 0xf6855718gt
  • Synthesize an utterance
  • festivalgt (utt.synth utt1)
  • ltUtterance 0xf6855718gt
  • Play waveform
  • festivalgt (utt.play utt1)
  • ltUtterance 0xf6855718gt
  • Do all together
  • festivalgt (SayText This is an example)
  • ltUtterance 0xf6961618gt

Text from Richard Sproat
21
Scheme speech
  • In a file
  • (define (SpeechPlus a b)
  • (SayText
  • (format nil
  • d plus d equals d
  • a b ( a b))))
  • Loading files
  • festivalgt (load file.scm)
  • t
  • Do all together
  • festivalgt (SpeechPlus 2 4)
  • ltUtterance 0xf6961618gt

Text from Richard Sproat
22
Scheme speech
  • (define (sp_time hour minute)
  • (cond
  • (( lt hour 12)
  • (SayText
  • (format nil
  • It is d d in the morning
  • hour minute )))
  • (( lt hour 18)
  • (SayText
  • (format nil
  • It is d d in the afternoon
  • (- hour 12) minute )))
  • (t
  • (SayText
  • (format nil
  • It is d d in the evening
  • (- hour 12) minute )))))

Text from Richard Sproat
23
Getting help
  • Online manual
  • http//festvox.org/docs/manual-1.4.3
  • Alt-h (or esc-h) on current symbol short help
  • Alt-s (or esc-s) to speak help
  • Alt-m goto man page
  • Use TAB key for completion

24
Lexicons and Lexical Entries
  • You can explicitly give pronunciations for words
  • Each lg/dialect has its own lexicon
  • You can lookup words with
  • (lex.lookup WORD)
  • You can add entries to the current lexicon
  • (lex.add.entry NEWENTRY)
  • Entry (WORD POS (SYL0 SYL1))
  • Syllable ((PHONE0 PHONE1 ) STRESS )
  • Example
  • (cepstra n ((k eh p) 1) ((s t r aa) 0))))

25
Converting from words to phones
  • Two methods
  • Dictionary-based
  • Rule-based (Letter-to-soundLTS)
  • Early systems, all LTS
  • MITalk was radical in having huge 10K word
    dictionary
  • Now systems use a combination
  • CMU dictionary 127K words
  • http//www.speech.cs.cmu.edu/cgi-bin/cmudict

26
Dictionaries arent always sufficient
  • Unknown words
  • Seem to be linear with number of words in unseen
    text
  • Mostly person, company, product names
  • But also foreign words, etc.
  • So commercial systems have 3-part system
  • Big dictionary
  • Special code for handling names
  • Machine learned LTS system for other unknown words

27
Letter-to-Sound Rules
  • Festival LTS rules
  • (LEFTCONTEXT ITEMS RIGHTCONTEXT NEWITEMS )
  • Example
  • ( c h C k )
  • ( c h ch )
  • denotes beginning of word
  • C means all consonants
  • Rules apply in order
  • christmas pronounced with k
  • But word with ch followed by non-consonant
    pronounced ch
  • E.g., choice

28
Stress rules in LTS
  • English famously evil one from Allen et al 1987
  • V -gt 1-stress / X_C Vshort C C?V Vshort
    CV
  • Where X must contain all prefixes
  • Assign 1-stress to the vowel in a syllable
    preceding a weak syllable followed by a
    morpheme-final syllable containing a short vowel
    and 0 or more consonants (e.g. difficult)
  • Assign 1-stress to the vowel in a syllable
    preceding a weak syllable followed by a
    morpheme-final vowel (e.g. oregano)
  • etc

29
Modern method Learning LTS rules automatically
  • Induce LTS from a dictionary of the language
  • Black et al. 1998
  • Applied to English, German, French
  • Two steps alignment and (CART-based)
    rule-induction

30
Alignment
  • Letters c h e c k e d
  • Phones ch _ eh _ k _ t
  • Black et al Method 1
  • First scatter epsilons in all possible ways to
    cause letters and phones to align
  • Then collect stats for P(letterphone) and select
    best to generate new stats
  • This iterated a number of times until settles
    (5-6)
  • This is EM (expectation maximization) alg

31
Alignment
  • Black et al method 2
  • Hand specify which letters can be rendered as
    which phones
  • C goes to k/ch/s/sh
  • W goes to w/v/f, etc
  • Once mapping table is created, find all valid
    alignments, find p(letterphone), score all
    alignments, take best

32
Alignment
  • Some alignments will turn out to be really bad.
  • These are just the cases where pronunciation
    doesnt match letters
  • Dept d ih p aa r t m ah n t
  • CMU s iy eh m y uw
  • Lieutenant l eh f t eh n ax n t (British)
  • Also foreign words
  • These can just be removed from alignment training

33
Building CART trees
  • Build a CART tree for each letter in alphabet (26
    plus accented) using context of -3 letters
  • c h e c -gt ch
  • c h e c k e d -gt _
  • This produces 92-96 correct LETTER accuracy
    (58-75 word acc) for English

34
Improvements
  • Take names out of the training data
  • And acronyms
  • Detect both of these separately
  • And build special-purpose tools to do LTS for
    names and acronyms
  • Names
  • Can do morphology (Walters -gt Walter, Lucasville)
  • Can write stress-shifting rules (Jordan -gt
    Jordanian)
  • Rhyme analogy Plotsky by analogy with Trostsky
    (replace tr with pl)
  • Liberman and Church for 250K most common names,
    got 212K (85) from these modified-dictionary
    methods, used LTS for rest.

35
Speech Recognition
36
Speech Recognition
  • Applications of Speech Recognition (ASR)
  • Dictation
  • Telephone-based Information (directions, air
    travel, banking, etc)
  • Hands-free (in car)
  • Speaker Identification
  • Language Identification
  • Second language ('L2') (accent reduction)
  • Audio archive searching

37
LVCSR
  • Large Vocabulary Continuous Speech Recognition
  • 20,000-64,000 words
  • Speaker independent (vs. speaker-dependent)
  • Continuous speech (vs isolated-word)

38
LVCSR Design Intuition
  • Build a statistical model of the speech-to-words
    process
  • Collect lots and lots of speech, and transcribe
    all the words.
  • Train the model on the labeled speech
  • Paradigm Supervised Machine Learning Search

39
Speech Recognition Architecture
40
The Noisy Channel Model
  • Search through space of all possible sentences.
  • Pick the one that is most probable given the
    waveform.

41
The Noisy Channel Model (II)
  • What is the most likely sentence out of all
    sentences in the language L given some acoustic
    input O?
  • Treat acoustic input O as sequence of individual
    observations
  • O o1,o2,o3,,ot
  • Define a sentence as a sequence of words
  • W w1,w2,w3,,wn

42
Noisy Channel Model (III)
  • Probabilistic implication Pick the highest prob
    S
  • We can use Bayes rule to rewrite this
  • Since denominator is the same for each candidate
    sentence W, we can ignore it for the argmax

43
A quick derivation of Bayes Rule
  • Conditionals
  • Rearranging
  • And also

44
Bayes (II)
  • We know
  • So rearranging things

45
Noisy channel model
likelihood
prior
46
The noisy channel model
  • Ignoring the denominator leaves us with two
    factors P(Source) and P(SignalSource)

47
Speech Architecture meets Noisy Channel
48
Five easy pieces
  • Feature extraction
  • Acoustic Modeling
  • HMMs, Lexicons, and Pronunciation
  • Decoding
  • Language Modeling

49
Feature Extraction
  • Digitize Speech
  • Extract Frames

50
Digitizing Speech
51
Digitizing Speech (A-D)
  • Sampling
  • measuring amplitude of signal at time t
  • 16,000 Hz (samples/sec) Microphone (Wideband)
  • 8,000 Hz (samples/sec) Telephone
  • Why?
  • Need at least 2 samples per cycle
  • max measurable frequency is half sampling rate
  • Human speech lt 10,000 Hz, so need max 20K
  • Telephone filtered at 4K, so 8K is enough

52
Digitizing Speech (II)
  • Quantization
  • Representing real value of each amplitude as
    integer
  • 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
  • Formats
  • 16 bit PCM
  • 8 bit mu-law log compression
  • LSB (Intel) vs. MSB (Sun, Apple)
  • Headers
  • Raw (no header)
  • Microsoft wav
  • Sun .au

40 byte header
53
Frame Extraction
  • A frame (25 ms wide) extracted every 10 ms

25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
54
MFCC (Mel Frequency Cepstral Coefficients)
  • Do FFT to get spectral information
  • Like the spectrogram/spectrum we saw earlier
  • Apply Mel scaling
  • Linear below 1kHz, log above, equal samples above
    and below 1kHz
  • Models human ear more sensitivity in lower freqs
  • Plus Discrete Cosine Transformation

55
Final Feature Vector
  • 39 Features per 10 ms frame
  • 12 MFCC features
  • 12 Delta MFCC features
  • 12 Delta-Delta MFCC features
  • 1 (log) frame energy
  • 1 Delta (log) frame energy
  • 1 Delta-Delta (log frame energy)
  • So each frame represented by a 39D vector

56
Where we are
  • Given a sequence of acoustic feature vectors,
    one every 10 ms
  • Goal output a string of words
  • Well spend 6 lectures on how to do this
  • Rest of today
  • Markov Models
  • Hidden Markov Models in the abstract
  • Forward Algorithm
  • Viterbi Algorithm
  • Start of HMMs for speech

57
Acoustic Modeling
  • Given a 39d vector corresponding to the
    observation of one frame oi
  • And given a phone q we want to detect
  • Compute p(oiq)
  • Most popular method
  • GMM (Gaussian mixture models)
  • Other methods
  • MLP (multi-layer perceptron)

58
Acoustic Modeling MLP computes p(qo)
59
Gaussian Mixture Models
  • Also called fully-continuous HMMs
  • P(oq) computed by a Gaussian

60
Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means
  • P(oq)

P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
61
Training Gaussians
  • A (single) Gaussian is characterized by a mean
    and a variance
  • Imagine that we had some training data in which
    each phone was labeled
  • We could just compute the mean and variance from
    the data

62
But we need 39 gaussians, not 1!
  • The observation o is really a vector of length 39
  • So need a vector of Gaussians

63
Actually, mixture of gaussians
  • Each phone is modeled by a sum of different
    gaussians
  • Hence able to model complex facts about the data

Phone A
Phone B
64
Gaussians acoustic modeling
  • Summary each phone is represented by a GMM
    parameterized by
  • M mixture weights
  • M mean vectors
  • M covariance matrices
  • Usually assume covariance matrix is diagonal
  • I.e. just keep separate variance for each
    cepstral feature

65
ASR Lexicon Markov Models for pronunciation
66
The Hidden Markov model
67
Formal definition of HMM
  • States a set of states Q q1, q2qN
  • Transition probabilities a set of probabilities
    A a01,a02,an1,ann.
  • Each aij represents P(ji)
  • Observation likelihoods a set of likelihoods
    Bbi(ot), probability that state i generated
    observation t
  • Special non-emitting initial and final states

68
Pieces of the HMM
  • Observation likelihoods (b), p(oq), represents
    the acoustics of each phone, and are computed by
    the gaussians (Acoustic Model, or AM)
  • Transition probabilities represent the
    probability of different pronunciations
    (different sequences of phones)
  • States correspond to phones

69
Pieces of the HMM
  • Actually, I lied when I say states correspond to
    phones
  • Actually states usually correspond to triphones
  • CHEESE (phones) ch iy z
  • CHEESE (triphones) -chiy, ch-iyz, iy-z

70
Pieces of the HMM
  • Actually, I lied again when I said states
    correspond to triphones
  • In fact, each triphone has 3 states for
    beginning, middle, and end of the triphone.

71
A real HMM
72
Cross-word triphones
  • Word-Internal Context-Dependent Models
  • OUR LIST
  • SIL AAR AA-R LIH L-IHS IH-ST S-T
  • Cross-Word Context-Dependent Models
  • OUR LIST
  • SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL

73
Summary
  • ASR Architecture
  • The Noisy Channel Model
  • Five easy pieces of an ASR system
  • Feature Extraction
  • 39 MFCC features
  • Acoustic Model
  • Gaussians for computing p(oq)
  • Lexicon/Pronunciation Model
  • HMM
  • Next time Decoding how to combine these to
    compute words from speech!

74
Perceptual properties
  • Pitch perceptual correlate of frequency
  • Loudness perceptual correlate of power, which is
    related to square of amplitude

75
Speech Recognition
  • Applications of Speech Recognition (ASR)
  • Dictation
  • Telephone-based Information (directions, air
    travel, banking, etc)
  • Hands-free (in car)
  • Speaker Identification
  • Language Identification
  • Second language ('L2') (accent reduction)
  • Audio archive searching

76
LVCSR
  • Large Vocabulary Continuous Speech Recognition
  • 20,000-64,000 words
  • Speaker independent (vs. speaker-dependent)
  • Continuous speech (vs isolated-word)

77
LVCSR Design Intuition
  • Build a statistical model of the speech-to-words
    process
  • Collect lots and lots of speech, and transcribe
    all the words.
  • Train the model on the labeled speech
  • Paradigm Supervised Machine Learning Search

78
Speech Recognition Architecture
79
The Noisy Channel Model
  • Search through space of all possible sentences.
  • Pick the one that is most probable given the
    waveform.

80
The Noisy Channel Model (II)
  • What is the most likely sentence out of all
    sentences in the language L given some acoustic
    input O?
  • Treat acoustic input O as sequence of individual
    observations
  • O o1,o2,o3,,ot
  • Define a sentence as a sequence of words
  • W w1,w2,w3,,wn

81
Noisy Channel Model (III)
  • Probabilistic implication Pick the highest prob
    S
  • We can use Bayes rule to rewrite this
  • Since denominator is the same for each candidate
    sentence W, we can ignore it for the argmax

82
A quick derivation of Bayes Rule
  • Conditionals
  • Rearranging
  • And also

83
Bayes (II)
  • We know
  • So rearranging things

84
Noisy channel model
likelihood
prior
85
The noisy channel model
  • Ignoring the denominator leaves us with two
    factors P(Source) and P(SignalSource)

86
Five easy pieces
  • Feature extraction
  • Acoustic Modeling
  • HMMs, Lexicons, and Pronunciation
  • Decoding
  • Language Modeling

87
Feature Extraction
  • Digitize Speech
  • Extract Frames

88
Digitizing Speech
89
Digitizing Speech (A-D)
  • Sampling
  • measuring amplitude of signal at time t
  • 16,000 Hz (samples/sec) Microphone (Wideband)
  • 8,000 Hz (samples/sec) Telephone
  • Why?
  • Need at least 2 samples per cycle
  • max measurable frequency is half sampling rate
  • Human speech lt 10,000 Hz, so need max 20K
  • Telephone filtered at 4K, so 8K is enough

90
Digitizing Speech (II)
  • Quantization
  • Representing real value of each amplitude as
    integer
  • 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
  • Formats
  • 16 bit PCM
  • 8 bit mu-law log compression
  • LSB (Intel) vs. MSB (Sun, Apple)
  • Headers
  • Raw (no header)
  • Microsoft wav
  • Sun .au

40 byte header
91
Frame Extraction
  • A frame (25 ms wide) extracted every 10 ms

25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
92
MFCC (Mel Frequency Cepstral Coefficients)
  • Do FFT to get spectral information
  • Like the spectrogram/spectrum we saw earlier
  • Apply Mel scaling
  • Linear below 1kHz, log above, equal samples above
    and below 1kHz
  • Models human ear more sensitivity in lower freqs
  • Plus Discrete Cosine Transformation

93
Final Feature Vector
  • 39 Features per 10 ms frame
  • 12 MFCC features
  • 12 Delta MFCC features
  • 12 Delta-Delta MFCC features
  • 1 (log) frame energy
  • 1 Delta (log) frame energy
  • 1 Delta-Delta (log frame energy)
  • So each frame represented by a 39D vector

94
Acoustic Modeling
  • Given a 39d vector corresponding to the
    observation of one frame oi
  • And given a phone q we want to detect
  • Compute p(oiq)
  • Most popular method
  • GMM (Gaussian mixture models)
  • Other methods
  • MLP (multi-layer perceptron)

95
Acoustic Modeling MLP computes p(qo)
96
Gaussian Mixture Models
  • Also called fully-continuous HMMs
  • P(oq) computed by a Gaussian

97
Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means
  • P(oq)

P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
98
Training Gaussians
  • A (single) Gaussian is characterized by a mean
    and a variance
  • Imagine that we had some training data in which
    each phone was labeled
  • We could just compute the mean and variance from
    the data

99
But we need 39 gaussians, not 1!
  • The observation o is really a vector of length 39
  • So need a vector of Gaussians

100
Actually, mixture of gaussians
  • Each phone is modeled by a sum of different
    gaussians
  • Hence able to model complex facts about the data

Phone A
Phone B
101
Gaussians acoustic modeling
  • Summary each phone is represented by a GMM
    parameterized by
  • M mixture weights
  • M mean vectors
  • M covariance matrices
  • Usually assume covariance matrix is diagonal
  • I.e. just keep separate variance for each
    cepstral feature

102
ASR Lexicon Markov Models for pronunciation
103
The Hidden Markov model
104
Formal definition of HMM
  • States a set of states Q q1, q2qN
  • Transition probabilities a set of probabilities
    A a01,a02,an1,ann.
  • Each aij represents P(ji)
  • Observation likelihoods a set of likelihoods
    Bbi(ot), probability that state i generated
    observation t
  • Special non-emitting initial and final states

105
Pieces of the HMM
  • Observation likelihoods (b), p(oq), represents
    the acoustics of each phone, and are computed by
    the gaussians (Acoustic Model, or AM)
  • Transition probabilities represent the
    probability of different pronunciations
    (different sequences of phones)
  • States correspond to phones

106
Pieces of the HMM
  • Actually, I lied when I say states correspond to
    phones
  • Actually states usually correspond to triphones
  • CHEESE (phones) ch iy z
  • CHEESE (triphones) -chiy, ch-iyz, iy-z

107
Pieces of the HMM
  • Actually, I lied again when I said states
    correspond to triphones
  • In fact, each triphone has 3 states for
    beginning, middle, and end of the triphone.

108
A real HMM
109
Cross-word triphones
  • Word-Internal Context-Dependent Models
  • OUR LIST
  • SIL AAR AA-R LIH L-IHS IH-ST S-T
  • Cross-Word Context-Dependent Models
  • OUR LIST
  • SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL

110
Summary
  • ASR Architecture
  • The Noisy Channel Model
  • Five easy pieces of an ASR system
  • Feature Extraction
  • 39 MFCC features
  • Acoustic Model
  • Gaussians for computing p(oq)
  • Lexicon/Pronunciation Model
  • HMM
  • Next time Decoding how to combine these to
    compute words from speech!
Write a Comment
User Comments (0)
About PowerShow.com