Title: Explorations in Discriminative Training
1Explorations in Discriminative Training
- Jing Zheng
- Andreas Stolcke
- Speech Technology Research Laboratory
- SRI International, Menlo Park, CA
2Talk Outline
- Motivation
- Review of training criteria
- Denominator lattice generation
- Exploring training procedures
- Minimum Phone Frame Error (MPFE) criterion
- Summary
3Motivation
- Available training data increased significantly
this year. To keep discriminative training
practical we need to speed up - Denominator lattices generation
- Statistics collection
- Convergence rate
- Explore different discriminative training
approaches, especially - MMI
- MPE
- Alternatives
4Discriminative Training Criteria
- Maximum Likelihood Criterion (ML)
- Maximum Mutual Information Criterion (MMI)
- Minimum Phone Error Criterion (MPE)
5Estimation Algorithms
- Maximum Likelihood (ML)
- Baum-Welch algorithm (EM)
- (Dempster et al., 77)
- Maximum Mutual Information (MMIE)
- Extended Baum-Welch (EBW)
- (Normadin, 91)
- Minimum Phone Error (MPE)
- Adapted EBW with I-smoothing
- (Povey Woodland, 02)
6EBW and I-smoothing
7Prior Approaches to Denominator Lattice Generation
- Word Lattices with phone boundaries (Valtchev et
al., 1997) - Approach uses bigram/unigram LM to generate word
lattices, then marks model boundaries. Collects
statistics from the lattices restricted by phone
boundaries. - Pros statistics collection is fast, especially
with multiple iterations. - Cons takes large disk space slow to generate
lattices with rich alternative hypotheses. - Implicit-lattice MMI (Huang et al., 2002)
- Approach drops word information, compiles
pronunciation dictionary into compact HMM state
network, with unigram LM probabilities encoded.
Collects statistics via forward-backward pass on
the whole network. - Pros no need to generate lattices, saves a lot
of disk space. - Cons statistics collection is slower
non-trivial to extend to MPE.
8New Denominator Lattice Generation
- Drop word information, compile pronunciation
dictionary into determinized / minimized phone
network, encoding unigram probabilities and
pronunciation probabilities. - Generate phone lattices by decoding from the
network, each arc associated with a
context-dependent phone and start/end times. - Use !NULL links to reduce lattice size and
computation. - Collect statistics constrained by phone start and
end times. - Pros
- Fast lattice generation efficient representation
of alternative phone hypothesis. - Fast statistics collection, even faster than word
lattices for less arc overlapping. - Can be used for both MMI and MPE training.
- Cons
- Still takes a lot of disk space, though better
than word lattices with equal richness of
alternative hypotheses.
9Exploring Training Procedures
- Setup
- For each iteration, we collect statistics for
both MMI and MPE training simultaneously. We
therefore can estimate the following different
models - MMI model
- MPE model
- MMI model with MLE prior using I-smoothing
(MMI/MLE) - MPE model with MLE prior using I-smoothing
(MPE/MLE) - MPE model with MMI prior using I-smoothing
(MPE/MMI) - MMI model with MPE prior using I-smoothing
(MMI/MPE) - Observations
- I-smoothing generally helps.
- Best model is not necessarily of the type of the
seed model that is used to generate statistics.
Model evaluation and selection is needed. - Alternating MMI and MPE (with priors) usually
speeds up training convergence (denoted MPEMMI).
10Broadcast News Results
- Broadcast News crossword PLP model, trained on
400 hours of data (Hub4 9697 and TDT4). - Tested on dev03, trigram decoding, with SAT
transforms.
11CTS Results
- CTS gender-dependent models crossword models
with SAT. - MFC, MFCICSI, PLP features trained on 1400 hours
of data (SWB ½ Fisher). - Tested on dev04, multiword bigram decoding.
Note MFCICSI features benefits less from
discriminative training, probably because ICSI
features are already extracted in a
discriminative manner.
12Full CTS System Results
- Tested with SRIs 20xRT CTS system on dev04.
- SRI eval system uses MFCICSI features front end
in one branch, PLP in another branch. Two
branches cross adapt and combine with each other. - We run a control experiment without using ICSI
features.
Again, system with ICSI features benefits less
from discriminative training, which may make two
branches more similar, and gain less from
combination.
13New Training CriterionMinimum Phone Frame Error
(MPFE)
- Minimum Phone Error (MPE)
- (Povey Woodland, 02)
- Close to word error rate definition
- Biased to hypotheses with less phones
- Low occupancy values, sensitive to data
sparseness
- Minimum Phone Frame Error (MPFE)
- Different hypotheses are more comparable,
independent of phones - Occupancy values similar to MMI
-
14MPFE Results English CTS
- CTS MFC crossword model, trained on 1400 hours
(SWB ½ Fisher). - Tested on dev04, generating lattices with
multiword bigram LM and SAT transform and then
rescoring lattices with 4-gram LM, outputting
results from consensus decoding.
15MPFE Results Different Test Sets
- Same model tested on eval04,eval03,eval02 eval01.
- Decoding with bigram generating lattices, 4-gram
rescoring, consensus decoding.
16MPFE Results Mandarin CTS
- Mandarin MFCC pitch feature (42 dim), SAT
normalized gender independent crossword model. - Trained on 100 hours of data.
- Tested on dev04, rescoring trigram lattices, MLLR
adaptation to hypotheses of previous decoding
pass.
17Summary
- Discriminative training with phone lattices saves
time for both lattice generation and statistics
collection. - Alternating MPE and MMI criterion combined with
I-smoothing can speed up convergence. - Minimum Phone Frame Error (MPFE) gives more word
error rate reduction than standard MPE.