Crosslingual Speech Recognition

About This Presentation

Title:

Crosslingual Speech Recognition

Description:

Frame-level accuracies. MLPs trained on Fisher. Accuracy computed with respect to SVB test set ... Silence frames excluded from this calculation. 68.0. ht. 69.2 ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 19

Provided by: homepage7

Category:

more less

Transcript and Presenter's Notes

Title: Crosslingual Speech Recognition

1
Crosslingual Speech Recognition
Firbush Presentation Partha Lal
Some slides were swiped from Karen Livescu's WS06
slides
2
Multilingual Speech Recognition

Speech Recognisers need large amounts of labelled
training data
What about languages for which little labelled
data exists?
Speech data in one language could be used for
another language
(the focus here is on acoustic modelling)

Image swiped from Joseph Picone's WS99 slides
3
Hidden Markov Models

HMMs traditionally used here
Hidden variable Q represents discrete sub-phone
units
Observation obs represents continuous acoustic
observations
Parameters to be estimated
Transition probabilities P(qiqi-1)
Emission probabilities P(obsiqi) (GMM)

4
Hybrid HMMs

Emission probabilities are usually estimated with
a Gaussian Mixture Models
Instead we could use a neural network (MLP)

P(qi obsi)
5
MLP training targets

Could use phones but...
Number of units in output layer 30-45
Not all phones occur in all languages
Phones sound different in different languages

qi
obsi
6
MLP training targets

So instead try articulatory features
Smaller output layer
Language independent

qi
. . .
voicing
manner
place
7
MLP training data

Now that the MLPs classify the acoustic signal
into more language universal classes, data can
be shared more easily between languages
e.g. English data can help train a Mandarin
recogniser

8
Conclusions

Speech data in resource-rich languages can be
used to train recognisers for resource-poor
languages
Neural networks that detect articulatory feature
values may be useful for transferring knowledge
between languages

9
Thank you!
Questions? Comments?
10
Bayesian networks (BNs)

Directed acyclic graph (DAG) with one-to-one
correspondence between nodes and variables X1,
X2, ... , XN
Node Xi with parents pa(Xi) has a local
probability function pXipa(Xi)
Joint probability product of local
probabilities p(xi,...,xN) ?
p(xipa(xi))

p(ba)
?
p(a,b,c,d) p(a) p(ba) p(cb) p(db,c)
p(cb)
p(a)
p(db,c)
11
Dynamic Bayesian networks (DBNs)

BNs consisting of a structure that repeats an
indefinite (i.e. dynamic) number of times
Useful for modeling time series (e.g. speech!)

12
Notation Representations of HMMs as DBNs
13
A phone HMM-based recognizer
frame 0
frame i
last frame
variable name
values

Standard phone HMM-based recognizer with bigram
language model

14
Inference

Definition
Computation of the probability of one subset of
the variables given another subset
Inference is a subroutine of
Viterbi decoding
argmax p(word, subWordState, phoneState, ...
obs)
Maximum-likelihood parameter estimation
? argmax ? p(obs ?)
For WS06, all models implemented, trained, and
tested using the Graphical Models Toolkit (GMTK)
Bilmes 02

15
Feature set for observation modeling
16
Hybrid models MLP overall accuracies

Frame-level accuracies
MLPs trained on Fisher
Accuracy computed with respect to SVB test set
Silence frames excluded from this calculation

17
Tandem Processing Steps

MLP posteriors are processed to make them
Gaussian like
There are 8 articulatory MLPs their outputs are
joined together at the input (64 dims)
PCA reduces dimensionality to 26 (95 of the
total variance)
Use this 26-dimensional vector as acoustic
observations in an HMM or some other model
The tandem features are usually used in
combination w/ a standard feature, e.g. PLP

18
Articulatory vs. Phone Tandems

Monophones on 500 vocabulary task w/o alignments
feature concatenated PLP/tandem models
All tandem systems are significantly better than
PLP alone
Articulatory tandems are as good as phone tandems
Articulatory tandems from Fisher (1776 hrs)
trained MLPs outperform those from SVB (3 hrs)
trained MLPs

Write a Comment

User Comments (0)