Title: HMM-based speech synthesis: the new generation of artificial voices
1HMM-based speech synthesis the new generation of
artificial voices
- Thomas Drugman
- thomas.drugman_at_umons.ac.be
2TCTS Lab
Laboratoire de Théorie des Circuits et de
Traitement du Signal 25 people 3 Profs, 10
PhD Students
TCTS Lab
Image Video
Numerical Arts
Audio Speech
3Content
- Speech synthesis history
- HMM-based speech synthesis
- Parametric modeling of speech
- Statistical generation
- Conclusions
4Content
- Speech synthesis history
- HMM-based speech synthesis
- Parametric modeling of speech
- Statistical generation
- Conclusions
5Speech Synthesis
Text-to-speech system
Hello
GOAL Produce the lecture of an unknown text
typed by the user
6Challenges
- Naturalness
- Intelligibility
- Cost-effectiveness
- Expressivity
7Challenge 3 Cost-effectiveness
- Industry expects Intelligibility Naturalness
- Small footprint a few Megs
- Small CPU requirements (embedded market)
- Easy extension to other languages
- Possibility to create new voices as fast as
possible - Through automatic recording/segmentation process
- Through efficient voice conversion
- Possibility to bootstrap an existing TTS voice
into any voice
8Challenge 4 (new) Expressivity
- Emotional speech synthesis (?art!)
- Being able to render an expressive voice
- In terms of prosody
- In terms of voice quality
- Knowing when to do it (yet unsolved)
- Todays holy grail for the industry
- Strategic advantage for whoever gets it first
- News markets (ebooks?)
-
9Methods for Speech Synthesis
- Expert-based (rule-based) approach
- Corpus-based approach
- Diphone concatenation
- Unit Selection
- Statistical parametric synthesis (HMM-based
synthesis)
10Von Kempelens talking machine (1791)
11Omer Dudleys Voder (Bell Labs, 1936)
12And other developments in articulatory synthesis
- Work by
- K. Stevens, G. Fant, P. Mermelstein, R. Carré
(GNUSpeech), S. Maeda, J. Shroeter M. Sondhi - More recently
- O. Engwall, S. Fels (ArtiSynth), Birkholz and
Kröger, A. Alwan S. Narayanan (MRI)
13Rule-based synthesis
Intelligibility? Naturalness?
Mem/CPU/Voices? Expressivity ?
14Methods for Speech Synthesis
- Expert-based (rule-based) approach
- Corpus-based approach
- Diphone concatenation
- Unit Selection
- Statistical parametric synthesis (HMM-based
synthesis)
15Diphone concatenation
Intelligibility? Naturalness Mem/CPU/Voices?
Expressivity ?
16Unit selection
Intelligibility? Naturalness ?
Mem/CPU/Voices Expressivity
17Content
- Speech synthesis history
- HMM-based speech synthesis
- Parametric modeling of speech
- Statistical generation
- Conclusions
18Statistical Parametric Speech Synthesis
DATABASE
Speech Parameters
Speech Analysis
Statistical Modeling
TRAINING
SPS Synthesizer
SYNTHESIS
Speech Parameters
Speech Processing
Statistical Generation
Hello !
Hello!
19HMM-based speech synthesis
http//hts.sp.nitech.ac.jp/
Intelligibility? Naturalness ?? Mem/CPU/Voices
? Expressivity ??
20TRAINING OF THE HMM-BASED SYNTHESIZER
21Parameter extraction
22Parameter extraction
Pulse train
Synthetic Speech
Filter
White noise
23Labels
24Labels
Labels consist of phonetic environment description
- Contextual factors
- Phone identity
- Syntaxical factors
- Stress-related factors
- Locational ,
25Labels
Example
26HMM training
27System architecture
Contextual factors may affect duration, source
and filter differently
Context Oriented Clustering using Decision Trees
28System architecture
State Duration Model
HMM for Source and Filter
Decision tree for State Duration
Decision trees for Filter
Decision trees for Source
29Training decision trees
An exhaustive list of possible questions is first
drawn up
Example
QS "LL-Nasal" m,n,en,ng QS
"LL-Fricative" ch,dh,f,hh,hv,s,sh,
th,v,z,zh QS "LL-Liquid"
el,hh,l,r,w,y QS "LL-Front"
ae,b,eh,em,f,ih,ix,iy,m,p,v
,w QS "LL-Central" ah,ao,axr,d,dh,d
x,el,en,er,l,n,r,s,t,th,z,zh
QS "LL-Back" aa,ax,ch,g,hh,jh,k
,ng,ow,sh,uh,uw,y QS
"LL-Front_Vowel" ae,eh,ey,ih,iy QS
"LL-Central_Vowel" aa,ah,ao,axr,er QS
"LL-Back_Vowel" ax,ow,uh,uw QS
"LL-Long_Vowel" ao,aw,el,em,en,en,iy
,ow,uw QS "LL-Short_Vowel"
aa,ah,ax,ay,eh,ey,ih,ix,oy,uh
QS "LL-Dipthong_Vowel" aw,axr,ay,el,e
m,en,er,ey,oy QS "LL-Front_Start_Vowel"
aw,axr,er,ey
Total about 1500 questions
30Training decision trees
Decision trees are trained using a Maximum
Likelihood criterion
Example
31Emission likelihood and training
Finally, each leaf is modeled by a Gaussian
Mixture Model (GMM)
Training is guided by the Viterbi and Baum-Welch
re-estimation algorithms
32SYNTHESIS BY THE HMM-BASED SYNTHESIZER
33Text analysis
34Parameters generation
35Parameters generation
Given the sequence of labels, durations are
determined by maximizing the state sequence
likelihood
A trajectory through context-dependent HMM states
is known !
36Parameters generation
Using this trajectory, source and filter
parameters are generated by maximizing the output
probability
Dynamic features evolution more realistic and
smooth
37Speech synthesizers comparison
38Speech synthesizers comparison
Quality
Unit Selection
HTS
Diphone Concatenation
Footprint
lt1Mb
5Mb
200Mb
39Content
- Speech synthesis history
- HMM-based speech synthesis
- Parametric modeling of speech
- Statistical generation
- Conclusions
40Problem positioning
Parametric speech synthesizers generally suffer
from a typical buzziness as encountered in
LPC-like vocoders
SourceFilter approach
Enhance the excitation signal
Pulse train
Synthetic Speech
Filter
White noise
41Proposed solution
SOURCE
FILTER
T.Drugman, G.Wilfart, T.Dutoit, A Deterministic
plus Stochastic Model of the Residual Signal for
Improved Parametric Speech Synthesis ,
Interspeech09
42Results
Traditional Proposed
43Content
- Speech synthesis history
- HMM-based speech synthesis
- Parametric modeling of speech
- Statistical generation
- Conclusions
44Problem of oversmoothing
45Compensation of oversmooting
46Global Variance
47Global Variance
48Results
49Content
- Speech synthesis history
- HMM-based speech synthesis
- Parametric modeling of speech
- Statistical generation
- Conclusions
50Speech synthesizers comparison
Rule-based synthesis
Intelligibility? Naturalness?
Mem/CPU/Voices? Expressivity ?
Diphone concatenation
Intelligibility? Naturalness Mem/CPU/Voices?
Expressivity ?
Unit selection
Intelligibility? Naturalness ?
Mem/CPU/Voices Expressivity
HMM-based speech synthesis
Intelligibility? Naturalness ?? Mem/CPU/Voices
? Expressivity ??
51Speech synthesizers comparison
Quality
Unit Selection
HTS
Diphone Concatenation
Footprint
lt1Mb
5Mb
200Mb
52Future Works
- Voice Conversion
- Expressive/emotional synthesis
- Better parametric representation
- Real-time speech synthesis
53Questions ?