Crosslingual Speech Recognition - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Crosslingual Speech Recognition

Description:

Frame-level accuracies. MLPs trained on Fisher. Accuracy computed with respect to SVB test set ... Silence frames excluded from this calculation. 68.0. ht. 69.2 ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 19
Provided by: homepage7
Category:

less

Transcript and Presenter's Notes

Title: Crosslingual Speech Recognition


1
Crosslingual Speech Recognition
Firbush Presentation Partha Lal
Some slides were swiped from Karen Livescu's WS06
slides
2
Multilingual Speech Recognition
  • Speech Recognisers need large amounts of labelled
    training data
  • What about languages for which little labelled
    data exists?
  • Speech data in one language could be used for
    another language
  • (the focus here is on acoustic modelling)

Image swiped from Joseph Picone's WS99 slides
3
Hidden Markov Models
  • HMMs traditionally used here
  • Hidden variable Q represents discrete sub-phone
    units
  • Observation obs represents continuous acoustic
    observations
  • Parameters to be estimated
  • Transition probabilities P(qiqi-1)
  • Emission probabilities P(obsiqi) (GMM)

4
Hybrid HMMs
  • Emission probabilities are usually estimated with
    a Gaussian Mixture Models
  • Instead we could use a neural network (MLP)

P(qi obsi)
5
MLP training targets
  • Could use phones but...
  • Number of units in output layer 30-45
  • Not all phones occur in all languages
  • Phones sound different in different languages

qi
obsi
6
MLP training targets
  • So instead try articulatory features
  • Smaller output layer
  • Language independent

qi
. . .
voicing
manner
place
7
MLP training data
  • Now that the MLPs classify the acoustic signal
    into more language universal classes, data can
    be shared more easily between languages
  • e.g. English data can help train a Mandarin
    recogniser

8
Conclusions
  • Speech data in resource-rich languages can be
    used to train recognisers for resource-poor
    languages
  • Neural networks that detect articulatory feature
    values may be useful for transferring knowledge
    between languages

9
Thank you!
Questions? Comments?
10
Bayesian networks (BNs)
  • Directed acyclic graph (DAG) with one-to-one
    correspondence between nodes and variables X1,
    X2, ... , XN
  • Node Xi with parents pa(Xi) has a local
    probability function pXipa(Xi)
  • Joint probability product of local
    probabilities p(xi,...,xN) ?
    p(xipa(xi))

p(ba)
?
p(a,b,c,d) p(a) p(ba) p(cb) p(db,c)
p(cb)
p(a)
p(db,c)
11
Dynamic Bayesian networks (DBNs)
  • BNs consisting of a structure that repeats an
    indefinite (i.e. dynamic) number of times
  • Useful for modeling time series (e.g. speech!)

12
Notation Representations of HMMs as DBNs
13
A phone HMM-based recognizer
frame 0
frame i
last frame
variable name
values
  • Standard phone HMM-based recognizer with bigram
    language model

14
Inference
  • Definition
  • Computation of the probability of one subset of
    the variables given another subset
  • Inference is a subroutine of
  • Viterbi decoding
  • argmax p(word, subWordState, phoneState, ...
    obs)
  • Maximum-likelihood parameter estimation
  • ? argmax ? p(obs ?)
  • For WS06, all models implemented, trained, and
    tested using the Graphical Models Toolkit (GMTK)
    Bilmes 02

15
Feature set for observation modeling
16
Hybrid models MLP overall accuracies
  • Frame-level accuracies
  • MLPs trained on Fisher
  • Accuracy computed with respect to SVB test set
  • Silence frames excluded from this calculation

17
Tandem Processing Steps
  • MLP posteriors are processed to make them
    Gaussian like
  • There are 8 articulatory MLPs their outputs are
    joined together at the input (64 dims)
  • PCA reduces dimensionality to 26 (95 of the
    total variance)
  • Use this 26-dimensional vector as acoustic
    observations in an HMM or some other model
  • The tandem features are usually used in
    combination w/ a standard feature, e.g. PLP

18
Articulatory vs. Phone Tandems
  • Monophones on 500 vocabulary task w/o alignments
    feature concatenated PLP/tandem models
  • All tandem systems are significantly better than
    PLP alone
  • Articulatory tandems are as good as phone tandems
  • Articulatory tandems from Fisher (1776 hrs)
    trained MLPs outperform those from SVB (3 hrs)
    trained MLPs
Write a Comment
User Comments (0)
About PowerShow.com