Potential team members to date: - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Potential team members to date:

Description:

Take advantage of human perception and production knowledge ... Growing number of sites investigating complementary aspects of this idea; a non-exhaustive list: ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 11
Provided by: kliv8
Category:

less

Transcript and Presenter's Notes

Title: Potential team members to date:


1
Articulatory Feature-based Speech RecognitionA
Proposal for the 2006 JHU Summer Workshop on
Language Engineering
November 11, 2005
  • Potential team members to date
  • Karen Livescu (presenter)
  • Simon King
  • Florian Metze
  • Jeff Bilmes

Mark Hasegawa-Johnson Ozgur Cetin Kate Saenko
2
Motivations
  • Why articulatory feature-based ASR?
  • Improved modeling of co-articulatory
    pronunciation phenomena
  • Take advantage of human perception and production
    knowledge
  • Application to audio-visual modeling
  • Application to multilingual ASR
  • Evidence of improved ASR performance with
    feature-based models
  • In noise Kirchhoff et al. 2002
  • For hyperarticulated speech Soltau et al. 2002
  • Why this workshop project?
  • Growing number of sites investigating
    complementary aspects of this idea a
    non-exhaustive list
  • U. Edinburgh (King et al.)
  • UIUC (Hasegawa-Johnson et al.)
  • MIT (Livescu, Glass, Saenko)
  • Recently developed tools (e.g. graphical models)
    for systematic exploration of the model space

3
Approach Main Ideas
  • Many ways to use articulatory features in ASR
  • Approach for this project Multiple streams of
    hidden articulatory states that can desynchronize
    and stray from target values
  • Inspired by linguistic theories, but simplified
    and cast in a probabilistic setting

baseform dictionary
0
0
0
0
1
1
1
2
2
2
2
2
ind GLOT
0
0
0
0
2
0
0
0
1
0
0
0
ind VEL
0
0
0
0
1
1
1
2
2
2
2
1
ind LIP-OPEN
W
W
W
W
C
C
C
C
W
W
W
W
U LIP-OPEN
W
W
N
N
N
C
C
C
W
W
W
W
S LIP-OPEN
4
Dynamic Bayesian network implementation The
context-independent case
Example DBN with 3 features
5
Recent related work
  • Product observation models combining phones and
    features, p(obss) p(obsphs) ?
    p(obsfi), improve ASR in some conditions
  • Kirchhoff et al. 2002, Metze et al. 2002,
    Stueker et al. 2002
  • Lexical access from manual transcriptions of
    Switchboard words using DBN model above Livescu
    Glass 2004, 2005
  • Improves over phone-based pronunciation models
    (50 ? 25 error)
  • Preliminary result Articulatory phonology
    features preferable to IPA-style (place/manner)
    features
  • JHU WS04 project Hasegawa-Johnson et al. 2004
  • Can combine landmarks IPA-style features at
    acoustic level with articulatory phonology
    features at pronunciation level
  • Articulatory recognition using DBN and ANN/DBN
    models Wester et al. 2004, Frankel et al. 2005
  • Modeling inter-feature dependencies useful,
    asynchrony may also be useful
  • Lipreading using multistream DBN model SVM
    feature detectors
  • Improves over viseme-based models in
    medium-vocabulary word ranking and realistic
    small-vocabulary task Saenko et al. 2005

6
Ongoing work Audio-visual ASR
7
Plan for 2006 Workshop
  • Goals
  • To build complete articulatory feature-based ASR
    systems using multistream DBN structures
  • To develop a thorough understanding of the design
    issues involved
  • Questions to be addressed
  • What are appropriate ways to combine models of
    articulation with observations?
  • Are discriminative feature classifiers preferable
    to generative observation models?
  • What asynchrony constraints can account for
    co-articulation while permitting efficient
    implementations?
  • How does context affect the modeling of
    articulatory feature streams?
  • Must the features modeled at the observation
    level be the same as the hidden state streams?
  • How can such models be applied to audio-visual
    ASR?
  • A possible work plan
  • Prior to workshop
  • Selection of feature sets to be considered
  • Baseline feature-based and phone-based models on
    selected data
  • Workshop, first half
  • Exploration of feature sets and classifiers

8
Potential participants and contributors
  • Local participants
  • Karen Livescu, MIT
  • Feature-based ASR structures, graphical models,
    GMTK
  • Mark Hasegawa-Johnson, U. Illinois at
    Urbana-Champaign
  • Discriminative feature classification, JHU WS04
  • Simon King, U. Edinburgh
  • Articulatory feature recognition, ANN/DBN
    structures
  • Ozgur Cetin, ICSI Berkeley
  • Multistream/multirate modeling, graphical models,
    GMTK
  • Florian Metze
  • Articulatory features in HMM framework
  • Jeff Bilmes, U. Washington
  • Graphical models, GMTK
  • Kate Saenko, MIT
  • Visual feature classification, AVSR
  • Others?
  • Satellite/advisory contributors
  • Jim Glass, MIT

9
Resources
  • Tools
  • GMTK
  • HTK
  • Intel AVCSR toolkit
  • Data
  • Audio-only
  • Svitchboard (CSTR Edinburgh) Small-vocab,
    continuous, conversational
  • PhoneBook Medium-vocab, isolated-word, read
  • (Switchboard rescoring? LVCSR)
  • Audio-visual
  • AVTIMIT (MIT) Medium-vocab, continuous, read,
    added noise
  • Digit strings database (MIT) Continuous, read,
    naturalistic setting (noise and video background)
  • Articulatory measurements
  • X-ray microbeam database (U. Wisconsin) Many
    speakers, large-vocab, isolated-word and
    continuous
  • MOCHA (QMUC, Edinburgh) Few speakers,
    medium-vocab, continuous
  • Others?
  • Manual transcriptions ICSI Berkeley Switchboard
    transcription project

10
Thanks!Questions? Comments?
Write a Comment
User Comments (0)
About PowerShow.com