Explorations in Discriminative Training - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Explorations in Discriminative Training

Description:

... word information, compiles pronunciation dictionary into compact HMM state ... CTS MFC crossword model, trained on 1400 hours (SWB Fisher) ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 18
Provided by: jingz
Category:

less

Transcript and Presenter's Notes

Title: Explorations in Discriminative Training


1
Explorations in Discriminative Training
  • Jing Zheng
  • Andreas Stolcke
  • Speech Technology Research Laboratory
  • SRI International, Menlo Park, CA

2
Talk Outline
  • Motivation
  • Review of training criteria
  • Denominator lattice generation
  • Exploring training procedures
  • Minimum Phone Frame Error (MPFE) criterion
  • Summary

3
Motivation
  • Available training data increased significantly
    this year. To keep discriminative training
    practical we need to speed up
  • Denominator lattices generation
  • Statistics collection
  • Convergence rate
  • Explore different discriminative training
    approaches, especially
  • MMI
  • MPE
  • Alternatives

4
Discriminative Training Criteria
  • Maximum Likelihood Criterion (ML)
  • Maximum Mutual Information Criterion (MMI)
  • Minimum Phone Error Criterion (MPE)

5
Estimation Algorithms
  • Maximum Likelihood (ML)
  • Baum-Welch algorithm (EM)
  • (Dempster et al., 77)
  • Maximum Mutual Information (MMIE)
  • Extended Baum-Welch (EBW)
  • (Normadin, 91)
  • Minimum Phone Error (MPE)
  • Adapted EBW with I-smoothing
  • (Povey Woodland, 02)

6
EBW and I-smoothing
  • EBW
  • I-smoothing

7
Prior Approaches to Denominator Lattice Generation
  • Word Lattices with phone boundaries (Valtchev et
    al., 1997)
  • Approach uses bigram/unigram LM to generate word
    lattices, then marks model boundaries. Collects
    statistics from the lattices restricted by phone
    boundaries.
  • Pros statistics collection is fast, especially
    with multiple iterations.
  • Cons takes large disk space slow to generate
    lattices with rich alternative hypotheses.
  • Implicit-lattice MMI (Huang et al., 2002)
  • Approach drops word information, compiles
    pronunciation dictionary into compact HMM state
    network, with unigram LM probabilities encoded.
    Collects statistics via forward-backward pass on
    the whole network.
  • Pros no need to generate lattices, saves a lot
    of disk space.
  • Cons statistics collection is slower
    non-trivial to extend to MPE.

8
New Denominator Lattice Generation
  • Drop word information, compile pronunciation
    dictionary into determinized / minimized phone
    network, encoding unigram probabilities and
    pronunciation probabilities.
  • Generate phone lattices by decoding from the
    network, each arc associated with a
    context-dependent phone and start/end times.
  • Use !NULL links to reduce lattice size and
    computation.
  • Collect statistics constrained by phone start and
    end times.
  • Pros
  • Fast lattice generation efficient representation
    of alternative phone hypothesis.
  • Fast statistics collection, even faster than word
    lattices for less arc overlapping.
  • Can be used for both MMI and MPE training.
  • Cons
  • Still takes a lot of disk space, though better
    than word lattices with equal richness of
    alternative hypotheses.

9
Exploring Training Procedures
  • Setup
  • For each iteration, we collect statistics for
    both MMI and MPE training simultaneously. We
    therefore can estimate the following different
    models
  • MMI model
  • MPE model
  • MMI model with MLE prior using I-smoothing
    (MMI/MLE)
  • MPE model with MLE prior using I-smoothing
    (MPE/MLE)
  • MPE model with MMI prior using I-smoothing
    (MPE/MMI)
  • MMI model with MPE prior using I-smoothing
    (MMI/MPE)
  • Observations
  • I-smoothing generally helps.
  • Best model is not necessarily of the type of the
    seed model that is used to generate statistics.
    Model evaluation and selection is needed.
  • Alternating MMI and MPE (with priors) usually
    speeds up training convergence (denoted MPEMMI).

10
Broadcast News Results
  • Broadcast News crossword PLP model, trained on
    400 hours of data (Hub4 9697 and TDT4).
  • Tested on dev03, trigram decoding, with SAT
    transforms.

11
CTS Results
  • CTS gender-dependent models crossword models
    with SAT.
  • MFC, MFCICSI, PLP features trained on 1400 hours
    of data (SWB ½ Fisher).
  • Tested on dev04, multiword bigram decoding.

Note MFCICSI features benefits less from
discriminative training, probably because ICSI
features are already extracted in a
discriminative manner.
12
Full CTS System Results
  • Tested with SRIs 20xRT CTS system on dev04.
  • SRI eval system uses MFCICSI features front end
    in one branch, PLP in another branch. Two
    branches cross adapt and combine with each other.
  • We run a control experiment without using ICSI
    features.

Again, system with ICSI features benefits less
from discriminative training, which may make two
branches more similar, and gain less from
combination.
13
New Training CriterionMinimum Phone Frame Error
(MPFE)
  • Minimum Phone Error (MPE)
  • (Povey Woodland, 02)
  • Close to word error rate definition
  • Biased to hypotheses with less phones
  • Low occupancy values, sensitive to data
    sparseness
  • Minimum Phone Frame Error (MPFE)
  • Different hypotheses are more comparable,
    independent of phones
  • Occupancy values similar to MMI

14
MPFE Results English CTS
  • CTS MFC crossword model, trained on 1400 hours
    (SWB ½ Fisher).
  • Tested on dev04, generating lattices with
    multiword bigram LM and SAT transform and then
    rescoring lattices with 4-gram LM, outputting
    results from consensus decoding.

15
MPFE Results Different Test Sets
  • Same model tested on eval04,eval03,eval02 eval01.
  • Decoding with bigram generating lattices, 4-gram
    rescoring, consensus decoding.

16
MPFE Results Mandarin CTS
  • Mandarin MFCC pitch feature (42 dim), SAT
    normalized gender independent crossword model.
  • Trained on 100 hours of data.
  • Tested on dev04, rescoring trigram lattices, MLLR
    adaptation to hypotheses of previous decoding
    pass.

17
Summary
  • Discriminative training with phone lattices saves
    time for both lattice generation and statistics
    collection.
  • Alternating MPE and MMI criterion combined with
    I-smoothing can speed up convergence.
  • Minimum Phone Frame Error (MPFE) gives more word
    error rate reduction than standard MPE.
Write a Comment
User Comments (0)
About PowerShow.com