Title: Philip Jackson and Martin Russell
1Models of speech dynamics in a segmental-HMM
recognizer using intermediate linear
representations
- Philip Jackson and Martin Russell
Electronic Electrical and Computer Engineering
http//web.bham.ac.uk/p.jackson/balthasar/
2Speech dynamics into ASR
INTRODUCTION
3Conventional model
acoustic PDF
1
1
1
1
1
2
3
4
2
2
2
2
2
3
3
3
4
4
4
2
HMM
INTRODUCTION
4Linear-trajectory model
acoustic PDF
articulatory-to-
W
acoustic mapping
intermediate layer
2
3
4
1
segmental HMM
INTRODUCTION
5Multi-level Segmental HMM
- segmental finite-state process
- intermediate articulatory layer
- linear trajectories
- mapping required
- linear transformation
- radial basis function network
INTRODUCTION
6Estimation of linear mapping
THEORY
7Linear-trajectory equations
THEORY
8Training the model parameters
- For optimal least-squares estimates (acoustic
domain)
midpoint
slope
THEORY
9Training the model parameters
- For optimal least-squares estimates
(articulatory domain)
midpoint
slope
THEORY
10Training the model parameters
- For optimal maximum-likelihood estimates
(articulatory domain)
midpoint
slope
THEORY
11Tests on MOCHA
- S. British English, at 16kHz (Wrench, 2000)
- MFCC13 acoustic features, incl. zeroth
- articulatory x- y-coords from 7 EMA coils
- PCA9Lx first nine articulatory modes plus the
laryngograph log energy
METHOD
12MOCHA baseline performance
- Constant-trajectory SHMM (ID_0)
- Linear-trajectory SHMM (ID_1)
RESULTS
13Performance across mappings
RESULTS
14Phone categorisation
No. Description
A 1 all data
B 2 silence speech
C 6 linguistic categories silence/stop vowel liquid nasal fricative affricate
D 10 as (Deng and Ma, 2000) silence vowel liquid nasal UV fric /s,ch/ V fric /z,jh/ UV stop V stop
E 10 discrete articulatory regions
F 49 silence individual phones
METHOD
15Tests on TIMIT
- N. American English, at 8kHz
- MFCC13 acoustic features, incl. zeroth
- F1-3 formants F1, F2 and F3, estimated by Holmes
formant tracker - F1-3BE5 five band energies added
- PFS12 synthesiser control parameters
METHOD
16TIMIT baseline performance
- Constant-trajectory SHMM (ID_0)
- Linear-trajectory SHMM (ID_1)
RESULTS
17Performance across feature sets
RESULTS
18Performance across groupings
RESULTS
19Results across groupings
RESULTS
20Model visualisation
Constant- trajectory model
Linear- trajectory model (c,F)
DISCUSSION
21Conclusions
- Developed framework for speech dynamics in an
intermediate space - Linear traj. piecewise linear mapping bounded
by performance of linear traj. in acoustic space - Near optimal performance achieved
- For more than 3 formant parameters
- For 6 or more linear mappings
- Formants and articulatory parameters gave
qualitatively similar results - What next?
SUMMARY
22Further work
- Complete experiments with lang. model
- Include segment duration models
- Derive pseudo-articulatory representations by
unsupervised (embedded) training - Implement non-linear mapping (i.e., RBF)
- Further information
- here and now
- p.jackson_at_bham.ac.uk
- web.bham.ac.uk/p.jackson/balthasar
SUMMARY