Title: SVitchboard
1 2Data SVitchboard - Small Vocabulary Switchboard
- SVitchboard King, Bartels Bilmes, 2005 is a
collection of small-vocabulary tasks extracted
from Switchboard 1 - Closed vocabulary no OOV issues
- Various tasks of increasing vocabulary sizes 10,
500 words - Pre-defined train/validation/test sets
- and 5-fold cross-validation scheme
- Utterance fragments extracted from SWB 1
- always surrounded by silence
- Word alignments available (msstate)
- Whole word HMM baselines already built
- SVitchboard SVB
3SVitchboard amount of data
4SVitchboard amount of data
5SVitchboard word frequency distributions
6SVitchboard number of words per utterance
7SVitchboard example utterances
- 10 word task
- oh
- right
- oh really
- so
- well the
- 500 word task
- oh how funny
- oh no
- i feel like they need a big home a nice place
where someone can have the time to play with them
and things but i can't give them up - oh
- oh i know it's like the end of the world
- i know i love mine too
8SVitchboard isnt it too easy (or too hard)?
- No (no).
- Results on the 500 word task test set using a
recent SRI system - SVitchboard data included in the training set for
this system - SRI system has 50k vocab
- System not tuned to SVB in any way
9SVitchboard what is the point of a 10 word task?
- Originally designed for debugging purposes
- However, results on the 10 and 500 word tasks
obtained in this workshop show good correlation
between WERs on the two tasks
WER on 500 word task vs 10 word task
85
80
75
70
65
WER () 500 word task
60
55
50
15
17
19
21
23
25
27
29
WER () 10 word task
10SVitchboard pre-existing baseline word error
rates
- Whole word HMMs trained on SVitchboard
- these results are from King, Bartels Bilmes,
2005 - Built with HTK
- Use MFCC observations
11SVitchboard experimental technique
- We only perfomed task 1 of SVitchboard (the first
of 5 cross-fold sets) - Training set is known as ABC
- Validation set is known as D
- Test set is known as E
- SVitchboard defines cross-validation sets
- But these were too big for the very large number
of experiments we ran - We mainly used a fixed 500 utterance
randomly-chosen subset of D which we call the
small validation set - All validation set results reported today are on
this set, unless stated otherwise
12SVitchboard experimental technique
- SVitchboard includes word alignments.
- We found that using these made training
significantly faster, and gave improved results
in most cases - Word alignments are only ever used during
training - Results above is for a monophone HMM with PLP
observations
13SVitchboard workshop baseline word error rates
- Monophone HMMs trained on SVitchboard
- PLP observations
14SVitchboard workshop baseline word error rates
- Triphone HMMs trained on SVitchboard
- PLP observations
- 500 word task only
- (GMTK system was trained without word alignments)
15SVitchboard baseline word error rates summary
- Test set word error rates
16 17gmtkTie
- General parameter clustering and tying tool for
GMTK - Written for this workshop
- Currently most developed parts
- Decision-tree clustering of Gaussians, using same
technique as HTK - Bottom-up agglomerative clustering
- Decision-tree tying was tested in this workshop
on various observation models using Gaussians - Conventional triphone models
- Tandem models, including with factored
observation streams - Feature based models
- Can tie based on values of any variables in the
graph, not just the phone state (e.g. feature
values)
18gmtkTie
- gmtkTie is more general than HTK
- HTK asks questions about previous/next phone
identity - HTK clusters states only within the same phone
- gmtkTie can ask user-supplied questions about
user-supplied features no assumptions about
states, triphones, or anything else - gmtkTie clusters user-defined groups of
parameters, not just states - gmtkTie can compute cluster sizes and centroids
in lots of different ways - GMTK/gmtkTie triphone system built in this
workshop is at least as good as HTK system
19gmtkTie conclusions
- It works!
- Triphone performance at least as good as HTK
- Can cluster arbitrary groups of parameters,
asking questions about any feature the user can
supply - Later in this presentation, we will see an
example of separately clustering the Gaussians
for two observation streams - Opens up new possibilities for clustering
- Much to explore
- Building different decision trees for various
factorings of the acoustic observation vector - Asking questions about other contextual factors
20 21Hybrid models introduction
- Motivation
- Want to use feature-based representation
- In previous work, we have successfully recovered
feature values from continuous speech using
neural networks (MLPs) - MLPs alone are just frame-by-frame classifiers
- Need some back end model to decode their output
into words - Ways to use such classifiers
- Hybrid models
- Tandem observations
22Hybrid models introduction
- Conventional HMMs generate observations via a
likelihood p(Ostate) or p(Oclass) using a
mixture of Gaussians - Hybrid models use another classifier (typically
an MLP) to obtain the posterior P(classO) - Dividing by the prior gives the likelihood, which
can be used directly in the HMM no Gaussians
required
23Hybrid models introduction
- Advantages of hybrid models include
- Can easily train the classifier discriminatively
- Once trained, MLPs will compute P(classO)
relatively fast - MLPs can use a long window of acoustic input
frames - MLPs dont require input feature distribution to
have diagonal covariance (e.g. can use filterbank
outputs from computational auditory scene
analysis front-ends)
24Hybrid models standard method
- Standard phone-based hybrid
- Train an MLP to classify phonemes, frame by frame
- Decode the MLP output using simple HMMs for
smoothing (transition probabilities easily
derived from phone duration statistics dont
even need to train them)
Hybrid models our method
- Feature-based hybrid
- Use ANNs to classify articulatory features
instead of phones - 8 MLPs, classifying pl1, dg1, etc frame-by-frame
- One of the motivations for using features is that
it should be easier to build a multi-lingual /
cross-language system this way
25Hybrid models using feature-classifying MLPs
p(dg1 phoneState) Non-deterministic CPT
(learned)
phoneState
. . .
dg1
pl1
rd
dummy variable
MLPs provide virtual evidence here
26Hybrid models training the MLPs
- We use MLPs to classify speech into AFs,
frame-by-frame - Must obtain targets for training
- These are derived from phone labels
- obtained by forced alignment using the SRI
recogniser - this is less than ideal, but embedded training
might help (results later) - MLPs were trained by Joe Frankel (Edinburgh/ICSI)
Mathew Magimai (ICSI) - Standard feedforward MLPs
- Trained using Quicknet
- Input to nets is a 9-frame window of PLPs (with
VTLN and per-speaker mean and variance
normalisation)
27Hybrid models training the MLPs
- Two versions of MLPs were initially trained
- Fisher
- Trained on all of Fisher but not on any data from
Switchboard 1 - SVitchboard
- Trained only on the training set of SVB
- The Fisher nets performed better, so were used in
all hybrid experiments
28Hybrid models MLP details
- MLP architecture is
- input units x hidden units x
output units
29Hybrid models MLP overall accuracies
- Frame-level accuracies
- MLPs trained on Fisher
- Accuracy computed with respect to SVB test set
- Silence frames excluded from this calculation
- More detailed analysis coming up later
30Hybrid models experiments
- Using MLPs trained on Fisher using original
phone-derived targets - vs.
- Using MLPs retrained on SVB data, which has been
aligned using one of our models - Hybrid model
- vs
- Hybrid model plus PLP observation
31Hybrid models experiments basic model
- Basic model is trained on activations from
original MLPs (Fisher-trained) - The only parameters in this DBN are the
conditional probability tables (CPTS) describing
how each feature depends on phone state - Embedded training
- Use the model to realign the SVB data (500 word
task) - Starting from the Fisher-trained nets, retrain on
these new targets - Retrain the DBN on the new net activations
phoneState
. . .
dg1
pl1
rd
32Hybrid models 500 word results
33Hybrid models adding in PLPs
- To improve accuracy, we combined the pure
hybrid model with a standard monophone model - Can/must weight contribution of PLPs
- Used a global weight on each of the 8 virtual
evidences, and a fixed weight on PLPs of 1.0 - Weight tuning worked best if done both during
training and decoding - Computationally expensive must train and
cross-validate many different systems
34Hybrid models adding PLPs
p(dg1 phoneState) Non-deterministic CPT
(learned)
phoneState
. . .
PLPs
dg1
pl1
rd
dummy variable
MLP likelihoods (implemented via virtual evidence
in GMTK)
35Hybrid models weighting virtual evidence vs PLP
36Hybrid models experiments basic model PLP
- Basic model is augmented with PLP observations
- Generated from mixtures of Gaussians, initialised
from a conventional monphone model - A big improvement over hybrid-only model
- A small improvement over the PLP-only monophone
model
phoneState
. . .
dg1
pl1
rd
PLPs
37Hybrid experiments conclusions
- Hybrid models perform reasonably well, but not
yet as well as conventional models - But they have fewer parameters to be trained
- So may be a viable approach for small databases
- Train MLPs on large database (e.g. Fisher)
- Train hybrid model on small database
- Cross-language??
- Embedded training gives good improvements for the
pure hybrid model - Hybrid models augmented with PLPs perform better
than baseline PLP-only models - But improvement is only small
- The best way to use the MLPs trained on Fisher
might be to construct tandem observation vectors
38Using MLPs to transfer knowledge from larger
databases
- Scenario
- we need to build a system for a
domain/accent/language for which we have only a
small amount of data - We have lots of data from other
domains/accents/languages - Method
- Train MLP on large database
- Use it in either a hybrid or a tandem system in
target domain
39Using MLPs to transfer knowledge from larger
databases
- Articulatory features
- It is plausible that training MLPs to be AF
classifiers could be more accent/language
independent than phones - Tandem results coming up shortly will show that,
across very similar domains (Fished SVB), AF
nets perform as well or better than phone nets
40Hybrid models vs Tandem observations
- Standard hybrid
- Train an MLP to classify phonemes, frame by frame
- Decode the MLP output using simple HMMs
(transition probabilities easily derived from
phone duration statistics dont even need to
train them)
- Standard tandem
- Instead of using MLP output to directly obtain
the likelihood, just use it as a feature vector,
after some transformations (e.g. taking logs) and
dimensionality reduction - Append the resulting features to standard
features, e.g. PLPs or MFCCs - Use this vector as the observation for a standard
HMM with a mixture-of-Gaussians observation model - Currently used in state-of-art systems such as
from SRI - but first a look at
structural modifications . . .