Potential team members to date:

About This Presentation

Title:

Potential team members to date:

Description:

Kate Saenko. November 12, 2005. Dynamic Bayesian network implementation: ... for acoustic model, using only articulatory 'ground truth' and acoustics ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 6

Provided by: kliv8

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Potential team members to date:

1
Articulatory Feature-based Speech RecognitionA
Proposal for the 2006 JHU Summer Workshop on
Language Engineering
November 12, 2005

Potential team members to date
Karen Livescu (presenter)
Simon King
Florian Metze
Jeff Bilmes

Mark Hasegawa-Johnson Ozgur Cetin Kate Saenko
2
Dynamic Bayesian network implementation The
context-independent case
Example DBN with 3 features
3
Combination of articulatory phonology
coarticulation modeling with IPA feature-based
acoustic modeling
(deterministic mapping)

Suggests a potential work plan
1st half of workshop Sub-teams work in parallel
on
(1) Set of features and classifiers for acoustic
model, using only articulatory ground truth
and acoustics
(2) Aspects of hidden structure (asynchrony,
substitutions, context dependency), using only
articulatory ground truth and words
2nd half of workshop Integrate most successful
methods from 1st half

4
Resources

Tools
GMTK
HTK
Intel AVCSR toolkit
Data
Audio-only
Svitchboard (CSTR Edinburgh) Small-vocab,
continuous, conversational
PhoneBook Medium-vocab, isolated-word, read
(Switchboard rescoring? LVCSR)
Audio-visual
AVTIMIT (MIT) Medium-vocab, continuous, read,
added noise
Digit strings database (MIT) Continuous, read,
naturalistic setting (noise and video background)
AVICAR, UIUC
Articulatory measurements
X-ray microbeam database (U. Wisconsin) Many
speakers, large-vocab, isolated-word and
continuous
MOCHA (QMUC, Edinburgh) Few speakers,
medium-vocab, continuous
Others?
Manual transcriptions ICSI Berkeley Switchboard
transcription project

5
Question to address (soon)

Audio-only, audio-visual only, or both?
Audio-only
Better understood by current team members
Has more spontaneous speech data
Audio-visual
Potentially, many more interesting phenomena in
read data
Visual observations more closely tied to
articulatory features
Smaller tasks ? faster turnaround time ? higher
impact?
Can we reliably decouple investigation of
acoustic modeling and pronunciation modeling?
Evaluation via measures other than word error
rate
Forced alignments
Articulatory tracking
Reasonableness of model parameters
(Multi-style ASR Train on slow, test on fast?)