Underspecified feature models for pronunciation variation in ASR - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Underspecified feature models for pronunciation variation in ASR

Description:

Title: Underspecified feature models for pronunciation variation in ASR Author: Eric Fosler-Lussier Last modified by: Eric Fosler-Lussier Created Date – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 38
Provided by: EricF71
Category:

less

Transcript and Presenter's Notes

Title: Underspecified feature models for pronunciation variation in ASR


1
Underspecified feature models for pronunciation
variation in ASR
  • Eric Fosler-Lussier
  • The Ohio State University
  • Speech Language Technologies Lab
  • ITRW - Speech Recognition Intrinsic Variation
  • 20 May 2006

2
Fill in the blanks
  • 3, 6, __, 12, 15, __, 21, 24
  • A B C __ E F __ H
  • Youre going to Toulouse? Drink a bottle of
    _____ for me!
  • Whats the red object?

Were very good atfilling in the blankswhen we
havecontext!
3
Filling in the blanks missing data
  • Missing data approaches have been used to
    integrate over noisy acoustics

Wang Hu 06
4
Decode this!
  • (brackets indicate options)
  • s iy n y ah,ax,axr,er
  • l,r eh,ih,iy s er ch
  • ah,ax s ow s,sh,z,zh eh,ih,iy eh,ey t,d

5
Decode this!
  • (brackets indicate options)
  • s iy n y ah,ax,axr,er senior
  • l,r eh,ih,iy s er ch research
  • ah,ax s ow s,sh,z,zh eh,ih,iy eh,ey
    t,d associate

6
Decode this!
  • (brackets indicate options)
  • s iy n y ah,ax,axr,er senior
  • l,r eh,ih,iy s er ch research
  • ah,ax s ow s,sh,z,zh eh,ih,iy eh,ey
    t,d associate
  • dictionary pronunciation

7
Decode this!
  • (brackets indicate options)
  • s iy n y ah,ax,axr,er senior
  • l,r eh,ih,iy s er ch research
  • ah,ax s ow s,sh,z,zh eh,ih,iy eh,ey
    t,d associate
  • dictionary pronunciation
  • as marked by transcribers (Buckeye Corpus of
    Speech)

8
What do these tasks have in common?
  • Recovering from erroneous information?
  • Context plays a big role in helping clean up

9
What do these tasks have in common?
  • Recovering from erroneous information?
  • Context plays a big role in helping clean up
  • Recovering from incomplete information!
  • We should be treating pronunciation variation as
    a missing data problem
  • Integrate over missing phonological features
  • How much information do you need to decode words?
  • Particularly taking into account the context of
    the word, syllabic context of phones, etc
  • Information theory problem

10
Outline
  • Problems with phonetic representations of
    variation
  • Potential advantages of phonological features
  • Re-examining the role of phonetic transcription
  • Phonological feature approaches to ASR
  • Feature attribute detection
  • Feature combination methods
  • Learning to (dis-)trust features
  • A challenge for the future

11
The Case Against The PhonemeHomage to
Ostendorf (ASRU 99)
  • Four major indications that phonetic modeling of
    variation is not appropriate

12
The Case Against The PhonemeHomage to
Ostendorf (ASRU 99)
  • Four major indications that phonetic modeling of
    variation is not appropriate
  • Lack of progress on spontaneous speech WER
  • McAllaster et al (98) 50 improvement possible
  • Finke Waibel (97) 6 WER reduction

13
The Case Against The PhonemeHomage to
Ostendorf (ASRU 99)
  • Four major indications that phonetic modeling of
    variation is not appropriate
  • Lack of progress on spontaneous speech WER
  • Independence of decisions in phone-based models
  • When pronunciation variation is modeled on
    phone-by-phone level, unusual baseforms are often
    created
  • Word-based learning fails to generalize across
    words

Riley et al 98
14
The Case Against The PhonemeHomage to
Ostendorf (ASRU 99)
  • Four major indications that phonetic modeling of
    variation is not appropriate
  • Lack of progress on spontaneous speech WER
  • Independence of decisions in phone-based models
  • Lack of granularity
  • Triphone contexts mean a symbolic change in phone
    can affect 9 HMM states (min 90 msec)
  • Much variation is already handled by triphone
    context

Saraçlar et al 00
Jurafsky et al 01
15
The Case Against The PhonemeHomage to
Ostendorf (ASRU 99)
  • Four major indications that phonetic modeling of
    variation is not appropriate
  • Lack of progress on spontaneous speech WER
  • Independence of decisions in phone-based models
  • Lack of granularity
  • Difficulty in transcription
  • Phonetic transcription is expensive and time
    consuming
  • Many decisions difficult to make for transcribers

16
Using phonological features
  • Finer granularity
  • Some phonological changes dont result in
    canonical phones for a language
  • English uw can sometimes be fronted (toot)
  • Common enough TIMIT introduced a special phone
    (ux)
  • Symbol change loses all commonality between
    phones (uw-gtux)
  • Handling odd phonological effects
  • Phone deletions many deletions really leave
    small traces of coarticulation on neighboring
    segments
  • E.g. vowel nasalization with nasal deletion
  • Features may provide basis for cross-lingual
    recognition
  • International Phonetic Alphabet

17
Issues with phonological features
  • Interlingua high vowels in English are not the
    same as high vowels in Japanese
  • Richard Wright, lunch Wednesday, ICASSP 2006
  • Concept of independent directions false
  • Correlation of feature values
  • Distances no longer euclidean among feature
    dimensions
  • Dealing with feature spreading
  • Even more difficulty in transcription
  • (but Karen Livescus group, JHU workshop 2006)
  • Articulatory vs. acoustic features
  • No two definitions are exactly the same (see
    Richards talk)

18
Phonetic transcription
  • There have been a number of efforts to transcribe
    speech phonetically
  • American English
  • TIMIT (4 hr read speech)
  • Switchboard (4 hr spontaneous speech)
  • Buckeye Corpus (40 hr spontaneous speech)
    http//buckeyecorpus.osu.edu
  • ASR researchers have found it difficult to
    utilize phonetic transcriptions directly

Riley et al 99
19
ASR Phonetic Transcription
  • Saraclar Khudanpur (04) examined the means of
    acoustic models where canonical phone /x/ was
    transcribed as y over all pairs xy
  • Compared means of xy to xx, yy
  • Data showed that xy means often fell between xx
    and yy, sometimes closer to xx
  • Another view data from Buckeye Corpus
  • /ae/ is sometimes transcribed as eh
  • Examined 80 vowels from one speaker
  • Formant frequencies from center of vowel

20
(No Transcript)
21
(No Transcript)
22
Can you trust transcription?
  • Perceptual marking ? acoustic measurement
  • Cant take transcription at face value
  • What are the transcribers are trying to tell us?
  • This phone doesnt sound like a canonical phone
  • Perhaps we can look at commonalities across
    canonical/transcribed phone
  • aeeh -gt front vowel ( not high?)
  • Phonological features may help us represent
    transcription differences.

23
Variation in single-phone changes
  • Compared canonical vs. transcribed consonants
    with single-phone substitutions in Switchboard,
    Buckeye
  • Differences in manner, place, voicing counted

Manner Place Voicing SWB BCS
? 42.1 41.5
? 7.3 13.8
? 39.7 27.1
? ? 8.2 12.5
? ? 1.4 1.5
? ? 0.0 1.1
? ? ? 0.7 2.1
24
Recent approaches to feature modeling in ASR
  • Since 90s there has been increased interest in
    phonological feature modeling
  • Deng et al (92 ff), Kirchhoff (96 ff)
  • Current directions of research
  • Approaches for detecting phonological features
    from data
  • Methods of combining phonological features
  • Knowing when to ignore information

25
Feature detection methods
  • Frame-level decisions
  • Most common artificial neural network methods
  • Input various flavors of spectral/cepstral
    representations
  • Output estimating posterior P(featureacoustics)
    on a per-frame level
  • Recent competitor support vector machines
  • Typically used for binary decision problems
  • Segmental-level decisions integrate over time
  • HMM detectors
  • Hybrid ANN/Dynamic Bayesian Network

26
Binary vs. n-ary features
  • Features can either be described as binary or
    n-ary if they can contrast
  • Binary /t/ stop -fricative
  • N-ary /t/ mannerstop
  • No real conclusion on whether which is better
  • Binary more matched to SVM learning
  • N-ary allows for discrimination among classes
  • Should a segment be allowed to be stop
    fricative?
  • Anecdotally (our lab) we find n-ary features
    slightly better

27
Hierarchical representations
  • Phonological features are not truly independent
  • Chang et al (01) Place prediction improves if
    manner is known
  • ANN predicts P(placexmannery,X) vs
    P(placexX)
  • Suggests need for hierarchical detectors
  • Rajamanohar Fosler-Lussier (05) Cascading
    errors make chained decisions worse
  • Better to jointly model P(placex,manneryX), or
    even derive P(placexX) from phone
    probabilities
  • Frankel et al (04) Hierarchy can be integrated
    as additional dependencies in DBN

28
Combining features into higher-level structures
  • Once you have (frame-level) estimates of
    phonological features, need to combine
  • Temporal integration Markov structures
  • Phonetic spatial integration combining into
    higher-level units (phones, syllables, words)
  • Differences in methodologies
  • spatial first, then temporal
  • joint/factored spatio-temporal integration
  • phone-level temporal integration with spatial
    rescoring

29
Combining features into higher-level structures
  • Tandem ANN/HMM Systems
  • ANN feature posterior estimates are used as
    replacements for MFCCs for Mixture of Gaussians
    HMM system
  • We find decorrelation of features (via PCA)
    necessary to keep models well conditioned
  • Lattice rescoring with Landmarks
  • Maximum entropy models for local word
    discrimination
  • SVMs used as local features for MaxEnt model.
  • Dynamic Bayesian Models
  • Model asynchrony as a hidden variable
  • SVM outputs used as observations of features

Launay et al 02
Hasegawa-Johnson et al 05
Livescu 05
30
Combining features intohigher-level structures
  • Conditional random fields
  • CRFs jointly model spatio-temporal integration
  • Probability expressed in terms of indicator
    functions s (state), t (transition)
  • Usually binary in NLP applications
  • Frame-level ANN posteriors are bounded
  • Probabilities can serve as observation feature
    functions
  • sstop(/t/,x,i)P(mannerstopxi)

Morris Fosler-Lussier 06
31
Conditional Random Fields
  • CRFs make no independence assumptions about input
  • Posteriors can be used directly without
    decorrelation
  • Can combine features, phones,
  • No assumption of temporal independence
  • Entire label sequence is modeled jointly
  • Monophone feature CRF phone recog. similar to
    triphone HMM
  • Learning parameters (?,?) determines importance
    of feature/phone relationships
  • Implicit model of partial phonological
    underspecification
  • Slow to train

32
Underspecification
  • All of these models learn what phonological
    information is important in higher-level
    processing
  • Ignoring canonical feature definitions for
    phone is a form of underspecification
  • Traditional underspecification some features are
    undefined for a particular phone
  • Weighted models partial underspecification
  • When can you ignore phonetic information?
  • Crucially, when it doesnt help you disambiguate
    between word hypotheses

33
Underspecification
  • Example unstressed syllables tend to show more
    phonetic variation than stressed syllables
  • Experiment reduce phonetic representation for
    unstressed syllables to manner class
  • Allowing recognizer to choose best representation
    (phone/manner) during training (WSJ0)
  • Minor degradation for clean speech (9.9 vs. 9.1
    WER)
  • Larger improvement in 10dB car noise (15.8 vs
    13.0 WER)
  • Moral we dont need to have exact phonetic
    representation to decode words
  • But we may need to integrate more higher-level
    knowledge

Fosler-Lussier et al 05
34
Vision for the Future
  • Acoustic-phonetic variation is difficult
  • Still significant cause of errors in ASR
  • Underspecified models give a new way of looking
    at the problem
  • Rather than the change x to y model
  • Challenge for the field
  • Current techniques for accent modeling, intrinsic
    pronunciation variation separate
  • Can we build a model that handles both?

35
Conclusions
  • We have come quite a distance since 1999
  • New methods for phonological feature detection
  • New methods for feature integration
  • New ways of thinking about variation
    underspecification
  • Still have a long way to go
  • Integrating more knowledge sources
  • Stress, prosody, word confusability
  • Solving the pronunciation adaptation problem in a
    general way

36
Fin
37
An example feature grid
CLASS VOICED CMANNER CPLACE VHEIGHT VFRONTNES
SVROUNDVTENSE
OBS VOW OBS VOW SON VOW OBS VOW SON OBS VOW SON

VCD VLS VCD VLS VCD VLS VCD
SP - SP - AT - FE - NL SP - NL
VR - AR - LB - PL - VR AR - AR
- MD - HH - LW - HH - MD -
- BK - BK - BK - CL - CL -
- RD - RD - ND - ND - ND -
- TE - TE - TE - LX - LX -
g
ow
t
uw
w
aa
sh
ix
ng
t
ax
n
go
to
washington
Write a Comment
User Comments (0)
About PowerShow.com