Title: A glimpsing model of speech perception
1A glimpsing model of speech perception
- Martin Cooke Sarah Simpson
Speech and Hearing Research Department of
Computer Science University of Sheffield http//ww
w.dcs.shef.ac.uk/martin
2Motivation The nonstationarity paradox
- speech technology performance falls with the
nonstationarity of the noise background
Simpson Cooke (2003)
3MotivationThe nonstationarity paradox
- speech technology performance falls with the
nonstationarity of the noise background
Simpson Cooke (2003)
4Possible factors
- In a 1-speaker background, listeners can
- employ organisational cues from the
background source to help segregate foreground - employ schemas for both foreground and
background - benefit from better glimpses of the speech
target - but multi-speaker backgrounds have certain
advantages - less chance of informational masking
- easier enhancement algorithm
5Glimpsing opportunities
Spectro-temporal glimpse densities
of time-frequency regions with a
locally-positive SNR
6Glimpsing
Informal definition a glimpse is some
time-frequency region which contains a reasonably
undistorted view of local signal properties
- Precursors
- Term used by Miller Licklider (1950) to explain
intelligibility of interrupted speech - Related to multiple looks model of Viemeister
Wakefield (1991) which demonstrated intelligent
temporal integration of tone bursts - Assmann Summerfield (in press) suggest
glimpsing tracking as way of understanding
how listeners cope with adverse conditions - Culling Darwin (1994) developed a glimpsing
model to explain double vowel identification for
small ?F0s - de Cheveigné Kawahara (1999) can be considered
a glimpsing model of vowel identification - Close relation to missing data processing (Cooke
et al, 1994)
7Types of glimpses
Comodulated Eg Miller Licklider (1950)
Spectral Eg Warren et al (1995)
General uncomodulated Eg Howard-Jones Rosen
(1993), Buss et al (2003)
8Evidence from distorted speech
e.g. Drullman (1995) filtered noisy speech into
24 ¼-octave bands, extracted the temporal
envelope in each band, and replaced those parts
of the envelope below a target level with a
constant value. Found intelligibility of 60 when
98 of signal was missing
9Glimpsing in natural conditions the dominance
effect
- Although audio signals add additively, the
occlusion metaphor is more appropriate due to
loglike compression in the auditory system
Consequently, most regions in a mixture are
dominated by one or other source, leaving very
few ambiguous regions, even for a pair of speech
signals mixed at 0 dB.
10Issues for a glimpsing model
- What constitutes a useful glimpse?
- Is sufficient information contained in glimpses?
- How do listeners detect glimpses?
- How can they be integrated?
Glimpse detection
Glimpse integration
11Glimpsing study
- Aims
- Determine if glimpses contain sufficient
information - Explore definition of useful glimpse
- Comparison between listeners and model using
natural VCV stimuli - Subset of Shannon et al (1999) corpus
- V /a/
- C b, d, g, p, t, k, m, n, l, r, f, v, s, z,
sh, ch - Background source
- reversed multispeaker babbler for N1, 8
- Allows variation in glimpsing opportunities
- 3 SNRs (TMRs) 0, -6 and -12 dB
- 12 listeners heard 160 tokens in each condition
- 2 repeats X 16 VCVs X 5 male speakers
12Identification results
1-speaker
8-speaker
13Glimpsing model
- CDHMM employing missing data techniques
- 16 whole-word HMMs
- 8 states
- 4 component Gaussian mixture per state
- Input representation
- 10 ms frames of modelled auditory excitation
pattern (40 gammatone filters, Hilbert envelope,
8 ms smoothing) - NB only simultaneous masking is modelled
- Training
- 8 repetitions of each VCV by 5 male speakers per
model - Testing
- As for listeners viz. 2 repetitions of each VCV
by 5 male speakers - Performance in clean gt 99
14Model performance I ideal glimpses
- Ideal glimpses
- All time-frequency regions whose local SNR
exceeds a threshold - Optimum threshold 0 dB
- For this task, there is more than sufficient
information in the glimpsed regions - Listeners perform suboptimally with respect to
this glimpse definition
1
8
15Model performancevariation in detection
threshold
- Q Can varying the local SNR threshold for glimpse
detection prodce a better match? - No choice of local SNR threshold provides good
fit to listeners - Closest fit shown (-6 dB)
1
8
16Analysis
- Unreasonable to expect listeners to detect
individual glimpses in a sea of noise unless
glimpse region is large enough
17Analysis
- Unreasonable to expect listeners to detect
individual glimpses in a sea of noise unless
glimpse region is large enough
18Model performance useable glimpses
- Definition glimpsed region must occupy at least
N ERBs and T ms - Search over 1-15 ERBs, 10-100 ms, at various
detection thresholds - Best match at
- 6.3 ERBs (9 channels)
- 40 ms
- 0 dB local SNR threshold
1
8
- Howard-Jones Rosen (1993) suggested 2-4 bands
limit for uncomodulated glimpsing - Buss et al (2003) found evidence for
uncomodulated glimpsing in up to 9 bands
19Consonant identification
- Reasonable matches overall apart from b, s z
- However, little token-by-token agreement between
common listener errors and model errors. - Why?
20Factors
Confusability
Audibility of target
Informational masking
Energetic masking
Existence of schemas for target
Successful identification
Organisational cues in target
Existence of schemas for background
Organisational cues in background
21Measuring energetic masking
- Approach resynthesise glimpses alone
- Filter, time-reverse, refilter to remove phase
distortion - Select regions based on local SNR mask
- Results
- Little difference for 1-speaker background,
suggesting relatively low contribution of info
masking in this case (due to reversed masker?) - Larger difference for 8-speaker case possibly due
to unrealistic glimpses
1
8
glimpses alone
speechnoise
22Comparison with ideal model
- Results
- Ideal model performs well in excess of listeners
when supplied with precisely the same information - Possible reasons
- Distortions
- Glimpses do not occur in isolation possibility
that a noise background will help - Lack of nonsimultaneous masking model will
inflate model performance
Ideal (model)
Ideal? (listeners)
23The glimpse decoder
- Attempt at a unifying statistical theory for
primitive and model-driven processes in CASA - Basic idea decoder not only determines the most
likely speech hypothesis but also decides which
glimpses to use - Key advantage no longer need to rely on clean
acoustics! - Can interpret (some) informational masking
effects as the incorrect assignment of glimpses
during signal interpretation - Barker, J, Cooke, M.P. Ellis, D.P.W. Decoding
speech in the presence of other sources,
accepted for Speech Communication
24Summary outlook
- Proposed a glimpsing model of speech
identification in noise - Demonstrated sufficiency of information in target
glimpses, at least for VCV task - Preliminary definition of useful glimpse gives
good overall model-listener match - Introduced 2 procedures for measuring the amount
of energetic masking (i) via ASR (ii) via glimpse
resynthesis - Need nonsimultaneous masking model
- Need to isolate affects due to schemas
- Repeat using non-reversed speech to introduce
more informational masking - Need to quantify affect of distortion in glimpse
resynthesis -
-
25Masking noise can be beneficial
Warren et al (1995) demonstrated spectral
induction effect with 2 narrow bands of speech
with intervening noise
fullband
Cooke Cunningham (in prep) Spectral induction
with single speech-bands.
26Speech modulated noise
- Speech modulated noise
- As in Brungart (2001)
- Model results and glimpse distributions indicate
increase in energetic masking for this type of
masker
Natural speech
natural, 1 spkr
natural, 8 spkr
SMN, 1 spkr
SMN, 8 spkr
Speech modulated noise
27Speech modulated noise
- Listeners perform better with SMN than predicted
on the basis of reduced glimpses (cf SMN model),
but not quite as well as they do with natural
speech masker - Suggests energetic masking is not the whole story
(cf Brungart, 2001), but further work needed to
quantify relative contribution of - Release from IM
- Absence of background models/cues
-