SVitchboard - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

SVitchboard

Description:

... SVitchboard - Small Vocabulary Switchboard. SVitchboard [King, Bartels ... Trained on all of Fisher but not on any data from Switchboard 1. SVitchboard ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 41

Provided by: karenl5

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: SVitchboard

1

SVitchboard

2
Data SVitchboard - Small Vocabulary Switchboard

SVitchboard King, Bartels Bilmes, 2005 is a
collection of small-vocabulary tasks extracted
from Switchboard 1
Closed vocabulary no OOV issues
Various tasks of increasing vocabulary sizes 10,
500 words
Pre-defined train/validation/test sets
and 5-fold cross-validation scheme
Utterance fragments extracted from SWB 1
always surrounded by silence
Word alignments available (msstate)
Whole word HMM baselines already built
SVitchboard SVB

3
SVitchboard amount of data
4
SVitchboard amount of data
5
SVitchboard word frequency distributions
6
SVitchboard number of words per utterance
7
SVitchboard example utterances

10 word task
oh
right
oh really
so
well the
500 word task
oh how funny
oh no
i feel like they need a big home a nice place
where someone can have the time to play with them
and things but i can't give them up
oh
oh i know it's like the end of the world
i know i love mine too

8
SVitchboard isnt it too easy (or too hard)?

No (no).
Results on the 500 word task test set using a
recent SRI system
SVitchboard data included in the training set for
this system
SRI system has 50k vocab
System not tuned to SVB in any way

9
SVitchboard what is the point of a 10 word task?

Originally designed for debugging purposes
However, results on the 10 and 500 word tasks
obtained in this workshop show good correlation
between WERs on the two tasks

WER on 500 word task vs 10 word task
85
80
75
70
65
WER () 500 word task
60
55
50
15
17
19
21
23
25
27
29
WER () 10 word task
10
SVitchboard pre-existing baseline word error
rates

Whole word HMMs trained on SVitchboard
these results are from King, Bartels Bilmes,
2005
Built with HTK
Use MFCC observations

11
SVitchboard experimental technique

We only perfomed task 1 of SVitchboard (the first
of 5 cross-fold sets)
Training set is known as ABC
Validation set is known as D
Test set is known as E
SVitchboard defines cross-validation sets
But these were too big for the very large number
of experiments we ran
We mainly used a fixed 500 utterance
randomly-chosen subset of D which we call the
small validation set
All validation set results reported today are on
this set, unless stated otherwise

12
SVitchboard experimental technique

SVitchboard includes word alignments.
We found that using these made training
significantly faster, and gave improved results
in most cases
Word alignments are only ever used during
training
Results above is for a monophone HMM with PLP
observations

13
SVitchboard workshop baseline word error rates

Monophone HMMs trained on SVitchboard
PLP observations

14
SVitchboard workshop baseline word error rates

Triphone HMMs trained on SVitchboard
PLP observations
500 word task only
(GMTK system was trained without word alignments)

15
SVitchboard baseline word error rates summary

Test set word error rates

gmtkTie

17
gmtkTie

General parameter clustering and tying tool for
GMTK
Written for this workshop
Currently most developed parts
Decision-tree clustering of Gaussians, using same
technique as HTK
Bottom-up agglomerative clustering
Decision-tree tying was tested in this workshop
on various observation models using Gaussians
Conventional triphone models
Tandem models, including with factored
observation streams
Feature based models
Can tie based on values of any variables in the
graph, not just the phone state (e.g. feature
values)

18
gmtkTie

gmtkTie is more general than HTK
HTK asks questions about previous/next phone
identity
HTK clusters states only within the same phone
gmtkTie can ask user-supplied questions about
user-supplied features no assumptions about
states, triphones, or anything else
gmtkTie clusters user-defined groups of
parameters, not just states
gmtkTie can compute cluster sizes and centroids
in lots of different ways
GMTK/gmtkTie triphone system built in this
workshop is at least as good as HTK system

19
gmtkTie conclusions

It works!
Triphone performance at least as good as HTK
Can cluster arbitrary groups of parameters,
asking questions about any feature the user can
supply
Later in this presentation, we will see an
example of separately clustering the Gaussians
for two observation streams
Opens up new possibilities for clustering
Much to explore
Building different decision trees for various
factorings of the acoustic observation vector
Asking questions about other contextual factors

Hybrid models

21
Hybrid models introduction

Motivation
Want to use feature-based representation
In previous work, we have successfully recovered
feature values from continuous speech using
neural networks (MLPs)
MLPs alone are just frame-by-frame classifiers
Need some back end model to decode their output
into words
Ways to use such classifiers
Hybrid models
Tandem observations

22
Hybrid models introduction

Conventional HMMs generate observations via a
likelihood p(Ostate) or p(Oclass) using a
mixture of Gaussians
Hybrid models use another classifier (typically
an MLP) to obtain the posterior P(classO)
Dividing by the prior gives the likelihood, which
can be used directly in the HMM no Gaussians
required

23
Hybrid models introduction

Advantages of hybrid models include
Can easily train the classifier discriminatively
Once trained, MLPs will compute P(classO)
relatively fast
MLPs can use a long window of acoustic input
frames
MLPs dont require input feature distribution to
have diagonal covariance (e.g. can use filterbank
outputs from computational auditory scene
analysis front-ends)

24
Hybrid models standard method

Standard phone-based hybrid
Train an MLP to classify phonemes, frame by frame
Decode the MLP output using simple HMMs for
smoothing (transition probabilities easily
derived from phone duration statistics dont
even need to train them)

Hybrid models our method

Feature-based hybrid
Use ANNs to classify articulatory features
instead of phones
8 MLPs, classifying pl1, dg1, etc frame-by-frame
One of the motivations for using features is that
it should be easier to build a multi-lingual /
cross-language system this way

25
Hybrid models using feature-classifying MLPs
p(dg1 phoneState) Non-deterministic CPT
(learned)
phoneState
. . .
dg1
pl1
rd
dummy variable
MLPs provide virtual evidence here
26
Hybrid models training the MLPs

We use MLPs to classify speech into AFs,
frame-by-frame
Must obtain targets for training
These are derived from phone labels
obtained by forced alignment using the SRI
recogniser
this is less than ideal, but embedded training
might help (results later)
MLPs were trained by Joe Frankel (Edinburgh/ICSI)
Mathew Magimai (ICSI)
Standard feedforward MLPs
Trained using Quicknet
Input to nets is a 9-frame window of PLPs (with
VTLN and per-speaker mean and variance
normalisation)

27
Hybrid models training the MLPs

Two versions of MLPs were initially trained
Fisher
Trained on all of Fisher but not on any data from
Switchboard 1
SVitchboard
Trained only on the training set of SVB
The Fisher nets performed better, so were used in
all hybrid experiments

28
Hybrid models MLP details

MLP architecture is
input units x hidden units x
output units

29
Hybrid models MLP overall accuracies

Frame-level accuracies
MLPs trained on Fisher
Accuracy computed with respect to SVB test set
Silence frames excluded from this calculation
More detailed analysis coming up later

30
Hybrid models experiments

Using MLPs trained on Fisher using original
phone-derived targets
vs.
Using MLPs retrained on SVB data, which has been
aligned using one of our models
Hybrid model
vs
Hybrid model plus PLP observation

31
Hybrid models experiments basic model

Basic model is trained on activations from
original MLPs (Fisher-trained)
The only parameters in this DBN are the
conditional probability tables (CPTS) describing
how each feature depends on phone state
Embedded training
Use the model to realign the SVB data (500 word
task)
Starting from the Fisher-trained nets, retrain on
these new targets
Retrain the DBN on the new net activations

phoneState
. . .
dg1
pl1
rd
32
Hybrid models 500 word results
33
Hybrid models adding in PLPs

To improve accuracy, we combined the pure
hybrid model with a standard monophone model
Can/must weight contribution of PLPs
Used a global weight on each of the 8 virtual
evidences, and a fixed weight on PLPs of 1.0
Weight tuning worked best if done both during
training and decoding
Computationally expensive must train and
cross-validate many different systems

34
Hybrid models adding PLPs
p(dg1 phoneState) Non-deterministic CPT
(learned)
phoneState
. . .
PLPs
dg1
pl1
rd
dummy variable
MLP likelihoods (implemented via virtual evidence
in GMTK)
35
Hybrid models weighting virtual evidence vs PLP
36
Hybrid models experiments basic model PLP

Basic model is augmented with PLP observations
Generated from mixtures of Gaussians, initialised
from a conventional monphone model
A big improvement over hybrid-only model
A small improvement over the PLP-only monophone
model

phoneState
. . .
dg1
pl1
rd
PLPs
37
Hybrid experiments conclusions

Hybrid models perform reasonably well, but not
yet as well as conventional models
But they have fewer parameters to be trained
So may be a viable approach for small databases
Train MLPs on large database (e.g. Fisher)
Train hybrid model on small database
Cross-language??
Embedded training gives good improvements for the
pure hybrid model
Hybrid models augmented with PLPs perform better
than baseline PLP-only models
But improvement is only small
The best way to use the MLPs trained on Fisher
might be to construct tandem observation vectors

38
Using MLPs to transfer knowledge from larger
databases

Scenario
we need to build a system for a
domain/accent/language for which we have only a
small amount of data
We have lots of data from other
domains/accents/languages
Method
Train MLP on large database
Use it in either a hybrid or a tandem system in
target domain

39
Using MLPs to transfer knowledge from larger
databases

Articulatory features
It is plausible that training MLPs to be AF
classifiers could be more accent/language
independent than phones
Tandem results coming up shortly will show that,
across very similar domains (Fished SVB), AF
nets perform as well or better than phone nets

40
Hybrid models vs Tandem observations

Standard hybrid
Train an MLP to classify phonemes, frame by frame
Decode the MLP output using simple HMMs
(transition probabilities easily derived from
phone duration statistics dont even need to
train them)

Standard tandem
Instead of using MLP output to directly obtain
the likelihood, just use it as a feature vector,
after some transformations (e.g. taking logs) and
dimensionality reduction
Append the resulting features to standard
features, e.g. PLPs or MFCCs
Use this vector as the observation for a standard
HMM with a mixture-of-Gaussians observation model
Currently used in state-of-art systems such as
from SRI
but first a look at
structural modifications . . .