Title: Crosslingual Speech Recognition
1Crosslingual Speech Recognition
Firbush Presentation Partha Lal
Some slides were swiped from Karen Livescu's WS06
slides
2Multilingual Speech Recognition
- Speech Recognisers need large amounts of labelled
training data - What about languages for which little labelled
data exists? - Speech data in one language could be used for
another language - (the focus here is on acoustic modelling)
Image swiped from Joseph Picone's WS99 slides
3Hidden Markov Models
- HMMs traditionally used here
- Hidden variable Q represents discrete sub-phone
units - Observation obs represents continuous acoustic
observations - Parameters to be estimated
- Transition probabilities P(qiqi-1)
- Emission probabilities P(obsiqi) (GMM)
4Hybrid HMMs
- Emission probabilities are usually estimated with
a Gaussian Mixture Models - Instead we could use a neural network (MLP)
P(qi obsi)
5MLP training targets
- Could use phones but...
- Number of units in output layer 30-45
- Not all phones occur in all languages
- Phones sound different in different languages
qi
obsi
6MLP training targets
- So instead try articulatory features
- Smaller output layer
- Language independent
qi
. . .
voicing
manner
place
7MLP training data
- Now that the MLPs classify the acoustic signal
into more language universal classes, data can
be shared more easily between languages - e.g. English data can help train a Mandarin
recogniser
8Conclusions
- Speech data in resource-rich languages can be
used to train recognisers for resource-poor
languages - Neural networks that detect articulatory feature
values may be useful for transferring knowledge
between languages
9Thank you!
Questions? Comments?
10Bayesian networks (BNs)
- Directed acyclic graph (DAG) with one-to-one
correspondence between nodes and variables X1,
X2, ... , XN - Node Xi with parents pa(Xi) has a local
probability function pXipa(Xi) - Joint probability product of local
probabilities p(xi,...,xN) ?
p(xipa(xi))
p(ba)
?
p(a,b,c,d) p(a) p(ba) p(cb) p(db,c)
p(cb)
p(a)
p(db,c)
11Dynamic Bayesian networks (DBNs)
- BNs consisting of a structure that repeats an
indefinite (i.e. dynamic) number of times - Useful for modeling time series (e.g. speech!)
12Notation Representations of HMMs as DBNs
13A phone HMM-based recognizer
frame 0
frame i
last frame
variable name
values
- Standard phone HMM-based recognizer with bigram
language model
14Inference
- Definition
- Computation of the probability of one subset of
the variables given another subset - Inference is a subroutine of
- Viterbi decoding
- argmax p(word, subWordState, phoneState, ...
obs) - Maximum-likelihood parameter estimation
- ? argmax ? p(obs ?)
- For WS06, all models implemented, trained, and
tested using the Graphical Models Toolkit (GMTK)
Bilmes 02
15Feature set for observation modeling
16Hybrid models MLP overall accuracies
- Frame-level accuracies
- MLPs trained on Fisher
- Accuracy computed with respect to SVB test set
- Silence frames excluded from this calculation
17Tandem Processing Steps
- MLP posteriors are processed to make them
Gaussian like - There are 8 articulatory MLPs their outputs are
joined together at the input (64 dims) - PCA reduces dimensionality to 26 (95 of the
total variance) - Use this 26-dimensional vector as acoustic
observations in an HMM or some other model - The tandem features are usually used in
combination w/ a standard feature, e.g. PLP
18Articulatory vs. Phone Tandems
- Monophones on 500 vocabulary task w/o alignments
feature concatenated PLP/tandem models - All tandem systems are significantly better than
PLP alone - Articulatory tandems are as good as phone tandems
- Articulatory tandems from Fisher (1776 hrs)
trained MLPs outperform those from SVB (3 hrs)
trained MLPs