HMM-based speech synthesis: the new generation of artificial voices

About This Presentation

Title:

HMM-based speech synthesis: the new generation of artificial voices

Description:

HMM-based speech synthesis: the new generation of artificial voices Thomas Drugman thomas.drugman_at_umons.ac.be * ... – PowerPoint PPT presentation

Number of Views:402

Avg rating:3.0/5.0

Slides: 54

Provided by: Thierry110

Category:

more less

Transcript and Presenter's Notes

Title: HMM-based speech synthesis: the new generation of artificial voices

1
HMM-based speech synthesis the new generation of
artificial voices

Thomas Drugman
thomas.drugman_at_umons.ac.be

2
TCTS Lab
Laboratoire de Théorie des Circuits et de
Traitement du Signal 25 people 3 Profs, 10
PhD Students
TCTS Lab
Image Video
Numerical Arts
Audio Speech
3
Content

Speech synthesis history
HMM-based speech synthesis
Parametric modeling of speech
Statistical generation
Conclusions

4
Content

Speech synthesis history
HMM-based speech synthesis
Parametric modeling of speech
Statistical generation
Conclusions

5
Speech Synthesis
Text-to-speech system
Hello
GOAL Produce the lecture of an unknown text
typed by the user
6
Challenges

Naturalness
Intelligibility
Cost-effectiveness
Expressivity

7
Challenge 3 Cost-effectiveness

Industry expects Intelligibility Naturalness
Small footprint a few Megs
Small CPU requirements (embedded market)
Easy extension to other languages
Possibility to create new voices as fast as
possible
Through automatic recording/segmentation process
Through efficient voice conversion
Possibility to bootstrap an existing TTS voice
into any voice

8
Challenge 4 (new) Expressivity

Emotional speech synthesis (?art!)
Being able to render an expressive voice
In terms of prosody
In terms of voice quality
Knowing when to do it (yet unsolved)
Todays holy grail for the industry
Strategic advantage for whoever gets it first
News markets (ebooks?)

9
Methods for Speech Synthesis

Expert-based (rule-based) approach
Corpus-based approach
Diphone concatenation
Unit Selection
Statistical parametric synthesis (HMM-based
synthesis)

10
Von Kempelens talking machine (1791)
11
Omer Dudleys Voder (Bell Labs, 1936)
12
And other developments in articulatory synthesis

Work by
K. Stevens, G. Fant, P. Mermelstein, R. Carré
(GNUSpeech), S. Maeda, J. Shroeter M. Sondhi
More recently
O. Engwall, S. Fels (ArtiSynth), Birkholz and
Kröger, A. Alwan S. Narayanan (MRI)

13
Rule-based synthesis
Intelligibility? Naturalness?
Mem/CPU/Voices? Expressivity ?
14
Methods for Speech Synthesis

Expert-based (rule-based) approach
Corpus-based approach
Diphone concatenation
Unit Selection
Statistical parametric synthesis (HMM-based
synthesis)

15
Diphone concatenation
Intelligibility? Naturalness Mem/CPU/Voices?
Expressivity ?
16
Unit selection
Intelligibility? Naturalness ?
Mem/CPU/Voices Expressivity
17
Content

Speech synthesis history
HMM-based speech synthesis
Parametric modeling of speech
Statistical generation
Conclusions

18
Statistical Parametric Speech Synthesis
DATABASE
Speech Parameters
Speech Analysis
Statistical Modeling
TRAINING
SPS Synthesizer
SYNTHESIS
Speech Parameters
Speech Processing
Statistical Generation
Hello !
Hello!
19
HMM-based speech synthesis
http//hts.sp.nitech.ac.jp/
Intelligibility? Naturalness ?? Mem/CPU/Voices
? Expressivity ??
20
TRAINING OF THE HMM-BASED SYNTHESIZER
21
Parameter extraction
22
Parameter extraction
Pulse train
Synthetic Speech
Filter
White noise
23
Labels
24
Labels
Labels consist of phonetic environment description

Contextual factors
Phone identity
Syntaxical factors
Stress-related factors
Locational ,

25
Labels
Example
26
HMM training
27
System architecture
Contextual factors may affect duration, source
and filter differently
Context Oriented Clustering using Decision Trees
28
System architecture
State Duration Model
HMM for Source and Filter
Decision tree for State Duration
Decision trees for Filter
Decision trees for Source
29
Training decision trees
An exhaustive list of possible questions is first
drawn up
Example
QS "LL-Nasal" m,n,en,ng QS
"LL-Fricative" ch,dh,f,hh,hv,s,sh,
th,v,z,zh QS "LL-Liquid"
el,hh,l,r,w,y QS "LL-Front"
ae,b,eh,em,f,ih,ix,iy,m,p,v
,w QS "LL-Central" ah,ao,axr,d,dh,d
x,el,en,er,l,n,r,s,t,th,z,zh
QS "LL-Back" aa,ax,ch,g,hh,jh,k
,ng,ow,sh,uh,uw,y QS
"LL-Front_Vowel" ae,eh,ey,ih,iy QS
"LL-Central_Vowel" aa,ah,ao,axr,er QS
"LL-Back_Vowel" ax,ow,uh,uw QS
"LL-Long_Vowel" ao,aw,el,em,en,en,iy
,ow,uw QS "LL-Short_Vowel"
aa,ah,ax,ay,eh,ey,ih,ix,oy,uh
QS "LL-Dipthong_Vowel" aw,axr,ay,el,e
m,en,er,ey,oy QS "LL-Front_Start_Vowel"
aw,axr,er,ey
Total about 1500 questions
30
Training decision trees
Decision trees are trained using a Maximum
Likelihood criterion
Example
31
Emission likelihood and training
Finally, each leaf is modeled by a Gaussian
Mixture Model (GMM)
Training is guided by the Viterbi and Baum-Welch
re-estimation algorithms
32
SYNTHESIS BY THE HMM-BASED SYNTHESIZER
33
Text analysis
34
Parameters generation
35
Parameters generation
Given the sequence of labels, durations are
determined by maximizing the state sequence
likelihood
A trajectory through context-dependent HMM states
is known !
36
Parameters generation
Using this trajectory, source and filter
parameters are generated by maximizing the output
probability
Dynamic features evolution more realistic and
smooth
37
Speech synthesizers comparison
38
Speech synthesizers comparison
Quality
Unit Selection
HTS
Diphone Concatenation
Footprint
lt1Mb
5Mb
200Mb
39
Content

Speech synthesis history
HMM-based speech synthesis
Parametric modeling of speech
Statistical generation
Conclusions

40
Problem positioning
Parametric speech synthesizers generally suffer
from a typical buzziness as encountered in
LPC-like vocoders
SourceFilter approach
Enhance the excitation signal
Pulse train
Synthetic Speech
Filter
White noise
41
Proposed solution
SOURCE
FILTER
T.Drugman, G.Wilfart, T.Dutoit, A Deterministic
plus Stochastic Model of the Residual Signal for
Improved Parametric Speech Synthesis ,
Interspeech09
42
Results
Traditional Proposed
43
Content

Speech synthesis history
HMM-based speech synthesis
Parametric modeling of speech
Statistical generation
Conclusions

44
Problem of oversmoothing
45
Compensation of oversmooting
46
Global Variance
47
Global Variance
48
Results
49
Content

Speech synthesis history
HMM-based speech synthesis
Parametric modeling of speech
Statistical generation
Conclusions

50
Speech synthesizers comparison
Rule-based synthesis
Intelligibility? Naturalness?
Mem/CPU/Voices? Expressivity ?
Diphone concatenation
Intelligibility? Naturalness Mem/CPU/Voices?
Expressivity ?
Unit selection
Intelligibility? Naturalness ?
Mem/CPU/Voices Expressivity
HMM-based speech synthesis
Intelligibility? Naturalness ?? Mem/CPU/Voices
? Expressivity ??
51
Speech synthesizers comparison
Quality
Unit Selection
HTS
Diphone Concatenation
Footprint
lt1Mb
5Mb
200Mb
52
Future Works