Title: TANDEM OBSERVATION MODELS
1TANDEM OBSERVATION MODELS
2Introduction
- Tandem is a method to use the predictions of a
MLP as observation vectors in generative models,
e..g. HMMs - Extensively used in the ICSI/SRI systems 10-20
improvement for English, Arabic, and Mandarin - Most previous work used phone MLPs for deriving
tandem (e.g., Hermansky et al. 00, and Morgan et
al. 05 ) - We explore tandem based on articulatory MLPs
- Similar to the approach in Kirchhoff 99
- Questions
- Are articulatory tandems better than the phonetic
ones? - Are factored observation models for tandem and
acoustic (e.g. PLP) observations better than the
observation concatenation approaches?
3Tandem Processing Steps
- MLP posteriors are processed to make them
Gaussian like - There are 8 articulatory MLPs their outputs are
joined together at the input (64 dims) - PCA reduces dimensionality to 26 (95 of the
total variance) - Use this 26-dimensional vector as acoustic
observations in an HMM or some other model - The tandem features are usually used in
combination w/ a standard feature, e.g. PLP
4Tandem Observation Models
- Feature concatenation Simply append tandems to
PLPs - All of the standard modeling methods applicable
to this meta observation vector (e.g., MLLR,
MMIE, and HLDA) - Factored models Tandem and PLP distributions are
factored at the HMM state output distributions - - Potentially more efficient use of free
parameters, especially if streams are
conditionally independent - Can use e.g., separate triphone clusters for each
observation
5Articulatory vs. Phone Tandems
Model Test WER ()
PLP 67.7
PLP/Phone Tandem (SVBD) 63.0
PLP/Articulatory Tandem (SVBD) 62.3
PLP/Articulatory Tandem (Fisher) 59.7
- Monophones on 500 vocabulary task w/o alignments
feature concatenated PLP/tandem models - All tandem systems are significantly better than
PLP alone - Articulatory tandems are as good as phone tandems
- Articulatory tandems from Fisher (1776 hrs)
trained MLPs outperform those from SVB (3 hrs)
trained MLPs
6Concatenation vs. Factoring
Model Task Test WER ()
PLP 10 24.5
PLP / Tandem Concatenation 10 21.1
PLP x Tandem Factoring 10 19.7
PLP 500 67.7
PLP / Tandem Concatenation 500 59.7
PLP x Tandem Factoring 500 59.1
- Monophone models w/o alignments
- All tandem results are significant over PLP
baseline - Consistent improvements from factoring
statistically significant on the 500 task
7Triphone Experiments
Model of Clusters Test WER
PLP 477 59.2
PLP / Tandem Concatenation 880 55.0
PLP x Tandem Factoring 467x641 53.8
- 500 vocabulary task w/o alignments
- PLP x Tandem factoring uses separate decision
trees for PLP and Tandem, as well as factored
pdfs - A significant improvement from factoring over the
feature concatenation approach - All pairs of results are statistically significant
8Observation factoring and weight tuning
Factored tandem
Results
WER Validation set Test set
Factored 58.7 59.5
Fully-factored R R
Dimensions of streams
Stream before KLT after KLT
All MLPs 64 26
dg1 6 4
frt 7 5
glo 4 2
ht 8 5
nas 3 2
pl1 10 7
rou 3 2
vow 23 13
Fully factored tandem
phoneState
dg1
PLPs
pl1
rd
. . .
log outputs of separate MLPs
Dims after KLT account for 95 of variance
9Weight tuning
MLP weight 1
Language model tuned for PLP weight1
Weight tuning in progress
10Summary
- Tandem features w/ PLPs outperform PLPs alone for
both monophones and triphones - 8-13 relative improvements (statistically
significant) - Articulatory tandems are as good as phone tandems
- - Further comparisons w/ phone MLPs trained on
Fisher - Factored models look promising (significant
results on the 500 vocabulary task) - - Further experiments w/ tying, initialization
- - Judiciously selected dependencies between the
factored vectors, instead of complete
independence