A glimpsing model of speech perception

1 / 27

About This Presentation

Title:

A glimpsing model of speech perception

Description:

University of Sheffield http://www ... employ schemas for both foreground and ... Glimpsing study Aims Determine if glimpses contain sufficient information Explore ... –

Number of Views:137

Avg rating:3.0/5.0

Slides: 28

Provided by: Martin1178

Category:

more less

Transcript and Presenter's Notes

Title: A glimpsing model of speech perception

1
A glimpsing model of speech perception

Martin Cooke Sarah Simpson

Speech and Hearing Research Department of
Computer Science University of Sheffield http//ww
w.dcs.shef.ac.uk/martin
2
Motivation The nonstationarity paradox

speech technology performance falls with the
nonstationarity of the noise background

Simpson Cooke (2003)
3
MotivationThe nonstationarity paradox

speech technology performance falls with the
nonstationarity of the noise background

Simpson Cooke (2003)
4
Possible factors

In a 1-speaker background, listeners can
employ organisational cues from the
background source to help segregate foreground
employ schemas for both foreground and
background
benefit from better glimpses of the speech
target
but multi-speaker backgrounds have certain
advantages
less chance of informational masking
easier enhancement algorithm

5
Glimpsing opportunities
Spectro-temporal glimpse densities

of time-frequency regions with a
locally-positive SNR
6
Glimpsing
Informal definition a glimpse is some
time-frequency region which contains a reasonably
undistorted view of local signal properties

Precursors
Term used by Miller Licklider (1950) to explain
intelligibility of interrupted speech
Related to multiple looks model of Viemeister
Wakefield (1991) which demonstrated intelligent
temporal integration of tone bursts
Assmann Summerfield (in press) suggest
glimpsing tracking as way of understanding
how listeners cope with adverse conditions
Culling Darwin (1994) developed a glimpsing
model to explain double vowel identification for
small ?F0s
de Cheveigné Kawahara (1999) can be considered
a glimpsing model of vowel identification
Close relation to missing data processing (Cooke
et al, 1994)

7
Types of glimpses
Comodulated Eg Miller Licklider (1950)
Spectral Eg Warren et al (1995)
General uncomodulated Eg Howard-Jones Rosen
(1993), Buss et al (2003)
8
Evidence from distorted speech
e.g. Drullman (1995) filtered noisy speech into
24 ¼-octave bands, extracted the temporal
envelope in each band, and replaced those parts
of the envelope below a target level with a
constant value. Found intelligibility of 60 when
98 of signal was missing
9
Glimpsing in natural conditions the dominance
effect

Although audio signals add additively, the
occlusion metaphor is more appropriate due to
loglike compression in the auditory system

Consequently, most regions in a mixture are
dominated by one or other source, leaving very
few ambiguous regions, even for a pair of speech
signals mixed at 0 dB.
10
Issues for a glimpsing model

What constitutes a useful glimpse?
Is sufficient information contained in glimpses?
How do listeners detect glimpses?
How can they be integrated?

Glimpse detection
Glimpse integration
11
Glimpsing study

Aims
Determine if glimpses contain sufficient
information
Explore definition of useful glimpse
Comparison between listeners and model using
natural VCV stimuli
Subset of Shannon et al (1999) corpus
V /a/
C b, d, g, p, t, k, m, n, l, r, f, v, s, z,
sh, ch
Background source
reversed multispeaker babbler for N1, 8
Allows variation in glimpsing opportunities
3 SNRs (TMRs) 0, -6 and -12 dB
12 listeners heard 160 tokens in each condition
2 repeats X 16 VCVs X 5 male speakers

12
Identification results
1-speaker
8-speaker
13
Glimpsing model

CDHMM employing missing data techniques
16 whole-word HMMs
8 states
4 component Gaussian mixture per state
Input representation
10 ms frames of modelled auditory excitation
pattern (40 gammatone filters, Hilbert envelope,
8 ms smoothing)
NB only simultaneous masking is modelled
Training
8 repetitions of each VCV by 5 male speakers per
model
Testing
As for listeners viz. 2 repetitions of each VCV
by 5 male speakers
Performance in clean gt 99

14
Model performance I ideal glimpses

Ideal glimpses
All time-frequency regions whose local SNR
exceeds a threshold
Optimum threshold 0 dB
For this task, there is more than sufficient
information in the glimpsed regions
Listeners perform suboptimally with respect to
this glimpse definition

1
8
15
Model performancevariation in detection
threshold

Q Can varying the local SNR threshold for glimpse
detection prodce a better match?
No choice of local SNR threshold provides good
fit to listeners
Closest fit shown (-6 dB)

1
8
16
Analysis

Unreasonable to expect listeners to detect
individual glimpses in a sea of noise unless
glimpse region is large enough

17
Analysis

Unreasonable to expect listeners to detect
individual glimpses in a sea of noise unless
glimpse region is large enough

18
Model performance useable glimpses

Definition glimpsed region must occupy at least
N ERBs and T ms
Search over 1-15 ERBs, 10-100 ms, at various
detection thresholds
Best match at
6.3 ERBs (9 channels)
40 ms
0 dB local SNR threshold

1
8

Howard-Jones Rosen (1993) suggested 2-4 bands
limit for uncomodulated glimpsing
Buss et al (2003) found evidence for
uncomodulated glimpsing in up to 9 bands

19
Consonant identification

Reasonable matches overall apart from b, s z
However, little token-by-token agreement between
common listener errors and model errors.
Why?

20
Factors
Confusability
Audibility of target
Informational masking
Energetic masking
Existence of schemas for target
Successful identification
Organisational cues in target
Existence of schemas for background
Organisational cues in background
21
Measuring energetic masking

Approach resynthesise glimpses alone
Filter, time-reverse, refilter to remove phase
distortion
Select regions based on local SNR mask
Results
Little difference for 1-speaker background,
suggesting relatively low contribution of info
masking in this case (due to reversed masker?)
Larger difference for 8-speaker case possibly due
to unrealistic glimpses

1
8
glimpses alone
speechnoise
22
Comparison with ideal model

Results
Ideal model performs well in excess of listeners
when supplied with precisely the same information
Possible reasons
Distortions
Glimpses do not occur in isolation possibility
that a noise background will help
Lack of nonsimultaneous masking model will
inflate model performance

Ideal (model)
Ideal? (listeners)
23
The glimpse decoder

Attempt at a unifying statistical theory for
primitive and model-driven processes in CASA
Basic idea decoder not only determines the most
likely speech hypothesis but also decides which
glimpses to use
Key advantage no longer need to rely on clean
acoustics!
Can interpret (some) informational masking
effects as the incorrect assignment of glimpses
during signal interpretation
Barker, J, Cooke, M.P. Ellis, D.P.W. Decoding
speech in the presence of other sources,
accepted for Speech Communication

24
Summary outlook

Proposed a glimpsing model of speech
identification in noise
Demonstrated sufficiency of information in target
glimpses, at least for VCV task
Preliminary definition of useful glimpse gives
good overall model-listener match
Introduced 2 procedures for measuring the amount
of energetic masking (i) via ASR (ii) via glimpse
resynthesis
Need nonsimultaneous masking model
Need to isolate affects due to schemas
Repeat using non-reversed speech to introduce
more informational masking
Need to quantify affect of distortion in glimpse
resynthesis

25
Masking noise can be beneficial
Warren et al (1995) demonstrated spectral
induction effect with 2 narrow bands of speech
with intervening noise
fullband
Cooke Cunningham (in prep) Spectral induction
with single speech-bands.
26
Speech modulated noise

Speech modulated noise
As in Brungart (2001)
Model results and glimpse distributions indicate
increase in energetic masking for this type of
masker

Natural speech
natural, 1 spkr
natural, 8 spkr
SMN, 1 spkr
SMN, 8 spkr
Speech modulated noise
27
Speech modulated noise

Listeners perform better with SMN than predicted
on the basis of reduced glimpses (cf SMN model),
but not quite as well as they do with natural
speech masker
Suggests energetic masking is not the whole story
(cf Brungart, 2001), but further work needed to
quantify relative contribution of
Release from IM
Absence of background models/cues