Title: Speech%20Processing%20Text%20to%20Speech%20Synthesis
1Speech ProcessingText to Speech Synthesis
2CS 224S / LINGUIST 236Speech Recognition and
Synthesis
Lecture 3 TTS Overview, History, and
Letter-to-Sound
IP Notice lots of info, text, and diagrams on
these slides comes (thanks!) from Alan Blacks
excellent lecture notes and from Richard Sproats
great new slides.
3- Text-To-Speech defined as the automatic
production of speech, through a
grapheme-to-phoneme transcription of the
sentences to utter. - Text Analyzer
- Normalization
- Morphological Analyzer
- Contextual Analyzer
- Syntactic-Prosodic Parser
- Grapheme to phoneme rules
- Prosody Generator
4 5Text Normalization
- Analysis of raw text into pronounceable words
- Sample problems
- The robbers stole Rs 100 lakhs from the bank
- It's 13/4 Modern Ave.
- The home page is http//www.facweb.iitkgp.ernet.in
/sudeshna/ - yes, see you the following tues, that's 23/08/05
- Steps
- Identify tokens in text
- Chunk tokens into reasonably sized sections
- Map tokens to words
- Identify types for words
6Grapheme to Phoneme
- How to pronounce a word? Look in dictionary! But
- Unknown words and names will be missing
- Turkish, German, and other hard languages
- uygarlaStIramadIklarImIzdanmISsInIzcasIna
- (behaving) as if you are among those whom we
could not civilize - uygar laS tIr ama dIk lar ImIz dan mIS
sInIz casIna civilized bec caus NegAble
ppart pl p1pl abl past 2pl AsIf - So need Letter to Sound Rules
- Also homograph disambiguation (wind, live, read)
7Grapheme to Phoneme in Indian languages
- Hindi do not need a dictionary. Letter to sound
rules can capture the pronunciation of most
words. - Bengali Harder than Hindi, but mostly can be
handled using rules, and a list of exceptions.
8Prosodyfrom wordsphones to boundaries, accent,
F0, duration
- The term prosody refers to certain properties of
the speech signal which are related to audible
changes in pitch, loudness, syllable length. - Prosodic phrasing
- Need to break utterances into phrases
- Punctuation is useful, not sufficient
- Accents
- Predictions of accents which syllables should be
accented - Realization of F0 contour given accents/tones,
generate F0 contour - Duration
- Predicting duration of each phone
9Waveform synthesisfrom segments, f0, duration
to waveform
- Collecting diphones
- need to record diphones in correct contexts
- l sounds different in onset than coda, t is
flapped sometimes, etc. - need quiet recording room
- then need to label them very very exactly
- Unit selection how to pick the right unit?
Search - Joining the units
- dumb (just stick'em together)
- PSOLA (Pitch-Synchronous Overlap and Add)
- MBROLA (Multi-band overlap and add)
10Festival
- Open source speech synthesis system
- Designed for development and runtime use
- Use in many commercial and academic systems
- Distributed with RedHat 9.x
- Hundreds of thousands of users
- Multilingual
- No built-in language
- Designed to allow addition of new languages
- Additional tools for rapid voice development
- Statistical learning tools
- Scripts for building models
Text from Richard Sproat
11Festival as software
- http//festvox.org/festival/
- General system for multi-lingual TTS
- C/C code with Scheme scripting language
- General replaceable modules
- Lexicons, LTS, duration, intonation, phrasing,
POS tagging, tokenizing, diphone/unit selection,
signal processing - General tools
- Intonation analysis (f0, Tilt), signal
processing, CART building, N-gram, SCFG, WFST
Text from Richard Sproat
12Festival as software
- http//festvox.org/festival/
- No fixed theories
- New languages without new C code
- Multiplatform (Unix/Windows)
- Full sources in distribution
- Free software
Text from Richard Sproat
13CMU FestVox project
- Festival is an engine, how do you make voices?
- Festvox building synthetic voices
- Tools, scripts, documentation
- Discussion and examples for building voices
- Example voice databases
- Step by step walkthroughs of processes
- Support for English and other languages
- Support for different waveform synthesis methods
- Diphone
- Unit selection
- Limited domain
Text from Richard Sproat
14Synthesis tools
- I want my computer to talk
- Festival Speech Synthesis
- I want my computer to talk in my voice
- FestVox Project
- I want it to be fast and efficient
- Flite
Text from Richard Sproat
15Using Festival
- How to get Festival to talk
- Scheme (Festivals scripting language)
- Basic Festival commands
Text from Richard Sproat
16Getting it to talk
- Say a file
- festival --tts file.txt
- From Emacs
- say region, say buffer
- Command line interpreter
- festivalgt (SayText hello)
Text from Richard Sproat
17Quick Intro to Scheme
- Scheme is a dialect of LISP
- expressions are
- atoms or
- lists
- a bcd hello world 12.3
- (a b c)
- (a (1 2) seven)
- Interpreter evaluates expressions
- Atoms evaluate as variables
- Lists evaluate as functional calls
- bxx
- 3.14
- ( 2 3)
Text from Richard Sproat
18Quick Intro to Scheme
- Setting variables
- (set! a 3.14)
- Defining functions
- (define (timestwo n) ( 2 n))
- (timestwo a)
- 6.28
Text from Richard Sproat
19Lists in Scheme
- festivalgt (set! alist '(apples pears bananas))
- (apples pears bananas)
- festivalgt (car alist)
- apples
- festivalgt (cdr alist)
- (pears bananas)
- festivalgt (set! blist (cons 'oranges alist))
- (oranges apples pears bananas)
- festivalgt append alist blist
- ltSUBR(6) appendgt
- (apples pears bananas)
- (oranges apples pears bananas)
- festivalgt (append alist blist)
- (apples pears bananas oranges apples pears
bananas) - festivalgt (length alist)
- 3
- festivalgt (length (append alist blist))
- 7
Text from Richard Sproat
20Scheme speech
- Make an utterance of type text
- festivalgt (set! utt1 (Utterance Text hello))
- ltUtterance 0xf6855718gt
- Synthesize an utterance
- festivalgt (utt.synth utt1)
- ltUtterance 0xf6855718gt
- Play waveform
- festivalgt (utt.play utt1)
- ltUtterance 0xf6855718gt
- Do all together
- festivalgt (SayText This is an example)
- ltUtterance 0xf6961618gt
Text from Richard Sproat
21Scheme speech
- In a file
- (define (SpeechPlus a b)
- (SayText
- (format nil
- d plus d equals d
- a b ( a b))))
- Loading files
- festivalgt (load file.scm)
- t
- Do all together
- festivalgt (SpeechPlus 2 4)
- ltUtterance 0xf6961618gt
Text from Richard Sproat
22Scheme speech
- (define (sp_time hour minute)
- (cond
- (( lt hour 12)
- (SayText
- (format nil
- It is d d in the morning
- hour minute )))
- (( lt hour 18)
- (SayText
- (format nil
- It is d d in the afternoon
- (- hour 12) minute )))
- (t
- (SayText
- (format nil
- It is d d in the evening
- (- hour 12) minute )))))
Text from Richard Sproat
23Getting help
- Online manual
- http//festvox.org/docs/manual-1.4.3
- Alt-h (or esc-h) on current symbol short help
- Alt-s (or esc-s) to speak help
- Alt-m goto man page
- Use TAB key for completion
24Lexicons and Lexical Entries
- You can explicitly give pronunciations for words
- Each lg/dialect has its own lexicon
- You can lookup words with
- (lex.lookup WORD)
- You can add entries to the current lexicon
- (lex.add.entry NEWENTRY)
- Entry (WORD POS (SYL0 SYL1))
- Syllable ((PHONE0 PHONE1 ) STRESS )
- Example
- (cepstra n ((k eh p) 1) ((s t r aa) 0))))
25Converting from words to phones
- Two methods
- Dictionary-based
- Rule-based (Letter-to-soundLTS)
- Early systems, all LTS
- MITalk was radical in having huge 10K word
dictionary - Now systems use a combination
- CMU dictionary 127K words
- http//www.speech.cs.cmu.edu/cgi-bin/cmudict
26Dictionaries arent always sufficient
- Unknown words
- Seem to be linear with number of words in unseen
text - Mostly person, company, product names
- But also foreign words, etc.
- So commercial systems have 3-part system
- Big dictionary
- Special code for handling names
- Machine learned LTS system for other unknown words
27Letter-to-Sound Rules
- Festival LTS rules
- (LEFTCONTEXT ITEMS RIGHTCONTEXT NEWITEMS )
- Example
- ( c h C k )
- ( c h ch )
- denotes beginning of word
- C means all consonants
- Rules apply in order
- christmas pronounced with k
- But word with ch followed by non-consonant
pronounced ch - E.g., choice
28Stress rules in LTS
- English famously evil one from Allen et al 1987
- V -gt 1-stress / X_C Vshort C C?V Vshort
CV - Where X must contain all prefixes
- Assign 1-stress to the vowel in a syllable
preceding a weak syllable followed by a
morpheme-final syllable containing a short vowel
and 0 or more consonants (e.g. difficult) - Assign 1-stress to the vowel in a syllable
preceding a weak syllable followed by a
morpheme-final vowel (e.g. oregano) - etc
29Modern method Learning LTS rules automatically
- Induce LTS from a dictionary of the language
- Black et al. 1998
- Applied to English, German, French
- Two steps alignment and (CART-based)
rule-induction
30Alignment
- Letters c h e c k e d
- Phones ch _ eh _ k _ t
- Black et al Method 1
- First scatter epsilons in all possible ways to
cause letters and phones to align - Then collect stats for P(letterphone) and select
best to generate new stats - This iterated a number of times until settles
(5-6) - This is EM (expectation maximization) alg
31Alignment
- Black et al method 2
- Hand specify which letters can be rendered as
which phones - C goes to k/ch/s/sh
- W goes to w/v/f, etc
- Once mapping table is created, find all valid
alignments, find p(letterphone), score all
alignments, take best
32Alignment
- Some alignments will turn out to be really bad.
- These are just the cases where pronunciation
doesnt match letters - Dept d ih p aa r t m ah n t
- CMU s iy eh m y uw
- Lieutenant l eh f t eh n ax n t (British)
- Also foreign words
- These can just be removed from alignment training
33 Building CART trees
- Build a CART tree for each letter in alphabet (26
plus accented) using context of -3 letters - c h e c -gt ch
- c h e c k e d -gt _
- This produces 92-96 correct LETTER accuracy
(58-75 word acc) for English
34Improvements
- Take names out of the training data
- And acronyms
- Detect both of these separately
- And build special-purpose tools to do LTS for
names and acronyms - Names
- Can do morphology (Walters -gt Walter, Lucasville)
- Can write stress-shifting rules (Jordan -gt
Jordanian) - Rhyme analogy Plotsky by analogy with Trostsky
(replace tr with pl) - Liberman and Church for 250K most common names,
got 212K (85) from these modified-dictionary
methods, used LTS for rest.
35Speech Recognition
36Speech Recognition
- Applications of Speech Recognition (ASR)
- Dictation
- Telephone-based Information (directions, air
travel, banking, etc) - Hands-free (in car)
- Speaker Identification
- Language Identification
- Second language ('L2') (accent reduction)
- Audio archive searching
37LVCSR
- Large Vocabulary Continuous Speech Recognition
- 20,000-64,000 words
- Speaker independent (vs. speaker-dependent)
- Continuous speech (vs isolated-word)
38LVCSR Design Intuition
- Build a statistical model of the speech-to-words
process - Collect lots and lots of speech, and transcribe
all the words. - Train the model on the labeled speech
- Paradigm Supervised Machine Learning Search
39Speech Recognition Architecture
40The Noisy Channel Model
- Search through space of all possible sentences.
- Pick the one that is most probable given the
waveform.
41The Noisy Channel Model (II)
- What is the most likely sentence out of all
sentences in the language L given some acoustic
input O? - Treat acoustic input O as sequence of individual
observations - O o1,o2,o3,,ot
- Define a sentence as a sequence of words
- W w1,w2,w3,,wn
42Noisy Channel Model (III)
- Probabilistic implication Pick the highest prob
S - We can use Bayes rule to rewrite this
- Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax
43A quick derivation of Bayes Rule
- Conditionals
- Rearranging
- And also
44Bayes (II)
- We know
- So rearranging things
45Noisy channel model
likelihood
prior
46The noisy channel model
- Ignoring the denominator leaves us with two
factors P(Source) and P(SignalSource)
47Speech Architecture meets Noisy Channel
48Five easy pieces
- Feature extraction
- Acoustic Modeling
- HMMs, Lexicons, and Pronunciation
- Decoding
- Language Modeling
49Feature Extraction
- Digitize Speech
- Extract Frames
50Digitizing Speech
51Digitizing Speech (A-D)
- Sampling
- measuring amplitude of signal at time t
- 16,000 Hz (samples/sec) Microphone (Wideband)
- 8,000 Hz (samples/sec) Telephone
- Why?
- Need at least 2 samples per cycle
- max measurable frequency is half sampling rate
- Human speech lt 10,000 Hz, so need max 20K
- Telephone filtered at 4K, so 8K is enough
52Digitizing Speech (II)
- Quantization
- Representing real value of each amplitude as
integer - 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
- Formats
- 16 bit PCM
- 8 bit mu-law log compression
- LSB (Intel) vs. MSB (Sun, Apple)
- Headers
- Raw (no header)
- Microsoft wav
- Sun .au
40 byte header
53Frame Extraction
- A frame (25 ms wide) extracted every 10 ms
25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
54MFCC (Mel Frequency Cepstral Coefficients)
- Do FFT to get spectral information
- Like the spectrogram/spectrum we saw earlier
- Apply Mel scaling
- Linear below 1kHz, log above, equal samples above
and below 1kHz - Models human ear more sensitivity in lower freqs
- Plus Discrete Cosine Transformation
55Final Feature Vector
- 39 Features per 10 ms frame
- 12 MFCC features
- 12 Delta MFCC features
- 12 Delta-Delta MFCC features
- 1 (log) frame energy
- 1 Delta (log) frame energy
- 1 Delta-Delta (log frame energy)
- So each frame represented by a 39D vector
56Where we are
- Given a sequence of acoustic feature vectors,
one every 10 ms - Goal output a string of words
- Well spend 6 lectures on how to do this
- Rest of today
- Markov Models
- Hidden Markov Models in the abstract
- Forward Algorithm
- Viterbi Algorithm
- Start of HMMs for speech
57Acoustic Modeling
- Given a 39d vector corresponding to the
observation of one frame oi - And given a phone q we want to detect
- Compute p(oiq)
- Most popular method
- GMM (Gaussian mixture models)
- Other methods
- MLP (multi-layer perceptron)
58Acoustic Modeling MLP computes p(qo)
59Gaussian Mixture Models
- Also called fully-continuous HMMs
- P(oq) computed by a Gaussian
60Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means
P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
61Training Gaussians
- A (single) Gaussian is characterized by a mean
and a variance - Imagine that we had some training data in which
each phone was labeled - We could just compute the mean and variance from
the data
62But we need 39 gaussians, not 1!
- The observation o is really a vector of length 39
- So need a vector of Gaussians
63Actually, mixture of gaussians
- Each phone is modeled by a sum of different
gaussians - Hence able to model complex facts about the data
Phone A
Phone B
64Gaussians acoustic modeling
- Summary each phone is represented by a GMM
parameterized by - M mixture weights
- M mean vectors
- M covariance matrices
- Usually assume covariance matrix is diagonal
- I.e. just keep separate variance for each
cepstral feature
65ASR Lexicon Markov Models for pronunciation
66The Hidden Markov model
67Formal definition of HMM
- States a set of states Q q1, q2qN
- Transition probabilities a set of probabilities
A a01,a02,an1,ann. - Each aij represents P(ji)
- Observation likelihoods a set of likelihoods
Bbi(ot), probability that state i generated
observation t - Special non-emitting initial and final states
68Pieces of the HMM
- Observation likelihoods (b), p(oq), represents
the acoustics of each phone, and are computed by
the gaussians (Acoustic Model, or AM) - Transition probabilities represent the
probability of different pronunciations
(different sequences of phones) - States correspond to phones
69Pieces of the HMM
- Actually, I lied when I say states correspond to
phones - Actually states usually correspond to triphones
- CHEESE (phones) ch iy z
- CHEESE (triphones) -chiy, ch-iyz, iy-z
70Pieces of the HMM
- Actually, I lied again when I said states
correspond to triphones - In fact, each triphone has 3 states for
beginning, middle, and end of the triphone. -
71A real HMM
72Cross-word triphones
- Word-Internal Context-Dependent Models
- OUR LIST
- SIL AAR AA-R LIH L-IHS IH-ST S-T
- Cross-Word Context-Dependent Models
- OUR LIST
- SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL
73Summary
- ASR Architecture
- The Noisy Channel Model
- Five easy pieces of an ASR system
- Feature Extraction
- 39 MFCC features
- Acoustic Model
- Gaussians for computing p(oq)
- Lexicon/Pronunciation Model
- HMM
- Next time Decoding how to combine these to
compute words from speech!
74Perceptual properties
- Pitch perceptual correlate of frequency
- Loudness perceptual correlate of power, which is
related to square of amplitude
75Speech Recognition
- Applications of Speech Recognition (ASR)
- Dictation
- Telephone-based Information (directions, air
travel, banking, etc) - Hands-free (in car)
- Speaker Identification
- Language Identification
- Second language ('L2') (accent reduction)
- Audio archive searching
76LVCSR
- Large Vocabulary Continuous Speech Recognition
- 20,000-64,000 words
- Speaker independent (vs. speaker-dependent)
- Continuous speech (vs isolated-word)
77LVCSR Design Intuition
- Build a statistical model of the speech-to-words
process - Collect lots and lots of speech, and transcribe
all the words. - Train the model on the labeled speech
- Paradigm Supervised Machine Learning Search
78Speech Recognition Architecture
79The Noisy Channel Model
- Search through space of all possible sentences.
- Pick the one that is most probable given the
waveform.
80The Noisy Channel Model (II)
- What is the most likely sentence out of all
sentences in the language L given some acoustic
input O? - Treat acoustic input O as sequence of individual
observations - O o1,o2,o3,,ot
- Define a sentence as a sequence of words
- W w1,w2,w3,,wn
81Noisy Channel Model (III)
- Probabilistic implication Pick the highest prob
S - We can use Bayes rule to rewrite this
- Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax
82A quick derivation of Bayes Rule
- Conditionals
- Rearranging
- And also
83Bayes (II)
- We know
- So rearranging things
84Noisy channel model
likelihood
prior
85The noisy channel model
- Ignoring the denominator leaves us with two
factors P(Source) and P(SignalSource)
86Five easy pieces
- Feature extraction
- Acoustic Modeling
- HMMs, Lexicons, and Pronunciation
- Decoding
- Language Modeling
87Feature Extraction
- Digitize Speech
- Extract Frames
88Digitizing Speech
89Digitizing Speech (A-D)
- Sampling
- measuring amplitude of signal at time t
- 16,000 Hz (samples/sec) Microphone (Wideband)
- 8,000 Hz (samples/sec) Telephone
- Why?
- Need at least 2 samples per cycle
- max measurable frequency is half sampling rate
- Human speech lt 10,000 Hz, so need max 20K
- Telephone filtered at 4K, so 8K is enough
90Digitizing Speech (II)
- Quantization
- Representing real value of each amplitude as
integer - 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
- Formats
- 16 bit PCM
- 8 bit mu-law log compression
- LSB (Intel) vs. MSB (Sun, Apple)
- Headers
- Raw (no header)
- Microsoft wav
- Sun .au
40 byte header
91Frame Extraction
- A frame (25 ms wide) extracted every 10 ms
25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
92MFCC (Mel Frequency Cepstral Coefficients)
- Do FFT to get spectral information
- Like the spectrogram/spectrum we saw earlier
- Apply Mel scaling
- Linear below 1kHz, log above, equal samples above
and below 1kHz - Models human ear more sensitivity in lower freqs
- Plus Discrete Cosine Transformation
93Final Feature Vector
- 39 Features per 10 ms frame
- 12 MFCC features
- 12 Delta MFCC features
- 12 Delta-Delta MFCC features
- 1 (log) frame energy
- 1 Delta (log) frame energy
- 1 Delta-Delta (log frame energy)
- So each frame represented by a 39D vector
94Acoustic Modeling
- Given a 39d vector corresponding to the
observation of one frame oi - And given a phone q we want to detect
- Compute p(oiq)
- Most popular method
- GMM (Gaussian mixture models)
- Other methods
- MLP (multi-layer perceptron)
95Acoustic Modeling MLP computes p(qo)
96Gaussian Mixture Models
- Also called fully-continuous HMMs
- P(oq) computed by a Gaussian
97Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means
P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
98Training Gaussians
- A (single) Gaussian is characterized by a mean
and a variance - Imagine that we had some training data in which
each phone was labeled - We could just compute the mean and variance from
the data
99But we need 39 gaussians, not 1!
- The observation o is really a vector of length 39
- So need a vector of Gaussians
100Actually, mixture of gaussians
- Each phone is modeled by a sum of different
gaussians - Hence able to model complex facts about the data
Phone A
Phone B
101Gaussians acoustic modeling
- Summary each phone is represented by a GMM
parameterized by - M mixture weights
- M mean vectors
- M covariance matrices
- Usually assume covariance matrix is diagonal
- I.e. just keep separate variance for each
cepstral feature
102ASR Lexicon Markov Models for pronunciation
103The Hidden Markov model
104Formal definition of HMM
- States a set of states Q q1, q2qN
- Transition probabilities a set of probabilities
A a01,a02,an1,ann. - Each aij represents P(ji)
- Observation likelihoods a set of likelihoods
Bbi(ot), probability that state i generated
observation t - Special non-emitting initial and final states
105Pieces of the HMM
- Observation likelihoods (b), p(oq), represents
the acoustics of each phone, and are computed by
the gaussians (Acoustic Model, or AM) - Transition probabilities represent the
probability of different pronunciations
(different sequences of phones) - States correspond to phones
106Pieces of the HMM
- Actually, I lied when I say states correspond to
phones - Actually states usually correspond to triphones
- CHEESE (phones) ch iy z
- CHEESE (triphones) -chiy, ch-iyz, iy-z
107Pieces of the HMM
- Actually, I lied again when I said states
correspond to triphones - In fact, each triphone has 3 states for
beginning, middle, and end of the triphone. -
108A real HMM
109Cross-word triphones
- Word-Internal Context-Dependent Models
- OUR LIST
- SIL AAR AA-R LIH L-IHS IH-ST S-T
- Cross-Word Context-Dependent Models
- OUR LIST
- SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL
110Summary
- ASR Architecture
- The Noisy Channel Model
- Five easy pieces of an ASR system
- Feature Extraction
- 39 MFCC features
- Acoustic Model
- Gaussians for computing p(oq)
- Lexicon/Pronunciation Model
- HMM
- Next time Decoding how to combine these to
compute words from speech!