Explorations in Discriminative Training

1 / 17

About This Presentation

Title:

Explorations in Discriminative Training

Description:

... word information, compiles pronunciation dictionary into compact HMM state ... CTS MFC crossword model, trained on 1400 hours (SWB Fisher) ... –

Number of Views:36

Avg rating:3.0/5.0

Slides: 18

Provided by: jingz

Category:

more less

Transcript and Presenter's Notes

Title: Explorations in Discriminative Training

1
Explorations in Discriminative Training

Jing Zheng
Andreas Stolcke
Speech Technology Research Laboratory
SRI International, Menlo Park, CA

2
Talk Outline

Motivation
Review of training criteria
Denominator lattice generation
Exploring training procedures
Minimum Phone Frame Error (MPFE) criterion
Summary

3
Motivation

Available training data increased significantly
this year. To keep discriminative training
practical we need to speed up
Denominator lattices generation
Statistics collection
Convergence rate
Explore different discriminative training
approaches, especially
MMI
MPE
Alternatives

4
Discriminative Training Criteria

Maximum Likelihood Criterion (ML)
Maximum Mutual Information Criterion (MMI)
Minimum Phone Error Criterion (MPE)

5
Estimation Algorithms

Maximum Likelihood (ML)
Baum-Welch algorithm (EM)
(Dempster et al., 77)
Maximum Mutual Information (MMIE)
Extended Baum-Welch (EBW)
(Normadin, 91)
Minimum Phone Error (MPE)
Adapted EBW with I-smoothing
(Povey Woodland, 02)

6
EBW and I-smoothing

EBW
I-smoothing

7
Prior Approaches to Denominator Lattice Generation

Word Lattices with phone boundaries (Valtchev et
al., 1997)
Approach uses bigram/unigram LM to generate word
lattices, then marks model boundaries. Collects
statistics from the lattices restricted by phone
boundaries.
Pros statistics collection is fast, especially
with multiple iterations.
Cons takes large disk space slow to generate
lattices with rich alternative hypotheses.
Implicit-lattice MMI (Huang et al., 2002)
Approach drops word information, compiles
pronunciation dictionary into compact HMM state
network, with unigram LM probabilities encoded.
Collects statistics via forward-backward pass on
the whole network.
Pros no need to generate lattices, saves a lot
of disk space.
Cons statistics collection is slower
non-trivial to extend to MPE.

8
New Denominator Lattice Generation

Drop word information, compile pronunciation
dictionary into determinized / minimized phone
network, encoding unigram probabilities and
pronunciation probabilities.
Generate phone lattices by decoding from the
network, each arc associated with a
context-dependent phone and start/end times.
Use !NULL links to reduce lattice size and
computation.
Collect statistics constrained by phone start and
end times.
Pros
Fast lattice generation efficient representation
of alternative phone hypothesis.
Fast statistics collection, even faster than word
lattices for less arc overlapping.
Can be used for both MMI and MPE training.
Cons
Still takes a lot of disk space, though better
than word lattices with equal richness of
alternative hypotheses.

9
Exploring Training Procedures

Setup
For each iteration, we collect statistics for
both MMI and MPE training simultaneously. We
therefore can estimate the following different
models
MMI model
MPE model
MMI model with MLE prior using I-smoothing
(MMI/MLE)
MPE model with MLE prior using I-smoothing
(MPE/MLE)
MPE model with MMI prior using I-smoothing
(MPE/MMI)
MMI model with MPE prior using I-smoothing
(MMI/MPE)
Observations
I-smoothing generally helps.
Best model is not necessarily of the type of the
seed model that is used to generate statistics.
Model evaluation and selection is needed.
Alternating MMI and MPE (with priors) usually
speeds up training convergence (denoted MPEMMI).

10
Broadcast News Results

Broadcast News crossword PLP model, trained on
400 hours of data (Hub4 9697 and TDT4).
Tested on dev03, trigram decoding, with SAT
transforms.

11
CTS Results

CTS gender-dependent models crossword models
with SAT.
MFC, MFCICSI, PLP features trained on 1400 hours
of data (SWB ½ Fisher).
Tested on dev04, multiword bigram decoding.

Note MFCICSI features benefits less from
discriminative training, probably because ICSI
features are already extracted in a
discriminative manner.
12
Full CTS System Results

Tested with SRIs 20xRT CTS system on dev04.
SRI eval system uses MFCICSI features front end
in one branch, PLP in another branch. Two
branches cross adapt and combine with each other.
We run a control experiment without using ICSI
features.

Again, system with ICSI features benefits less
from discriminative training, which may make two
branches more similar, and gain less from
combination.
13
New Training CriterionMinimum Phone Frame Error
(MPFE)

Minimum Phone Error (MPE)
(Povey Woodland, 02)
Close to word error rate definition
Biased to hypotheses with less phones
Low occupancy values, sensitive to data
sparseness

Minimum Phone Frame Error (MPFE)
Different hypotheses are more comparable,
independent of phones
Occupancy values similar to MMI

14
MPFE Results English CTS

CTS MFC crossword model, trained on 1400 hours
(SWB ½ Fisher).
Tested on dev04, generating lattices with
multiword bigram LM and SAT transform and then
rescoring lattices with 4-gram LM, outputting
results from consensus decoding.

15
MPFE Results Different Test Sets

Same model tested on eval04,eval03,eval02 eval01.
Decoding with bigram generating lattices, 4-gram
rescoring, consensus decoding.

16
MPFE Results Mandarin CTS

Mandarin MFCC pitch feature (42 dim), SAT
normalized gender independent crossword model.
Trained on 100 hours of data.
Tested on dev04, rescoring trigram lattices, MLLR
adaptation to hypotheses of previous decoding
pass.

17
Summary

Discriminative training with phone lattices saves
time for both lattice generation and statistics
collection.
Alternating MPE and MMI criterion combined with
I-smoothing can speed up convergence.
Minimum Phone Frame Error (MPFE) gives more word
error rate reduction than standard MPE.

Write a Comment

User Comments (0)