Title: From speech signal acoustics to perception
1From speech signal acousticsto perception
- Louis C.W. Pols
- Institute of Phonetic Sciences (IFA)
- Amsterdam Center for Language and Communication
(ACLC)
NATO-ASI Dynamics of Speech Production and
Perception Il Ciocco, Tuscany,
Italy, July 4, 2002
2Overview
- how do we perceive (speech) dynamics?
- The Intelligent Ear. On the Nature of Sound
Perception, by Reinier Plomp (2002) - from psychoacoustics to speech perception
- (lack of) context robustness continuity
- V and C reduction coarticulation
- perceptual compensation for artic. undershoot?
- speech efficiency
- conclusions
3Various scientific preferences
- several biases have affected the history of
(speech ) hearing research (Plomp, 2002) - dominance of sinusoidal tones as stimuli
- preference for microscopic approach (e.g.,
phoneme discrimination rather than
intelligibility) - emphasis on psychophysical (rather than
cognitive) aspects of hearing - clean stimuli in the lab rather than the acoustic
reality of the outside world (disruptive sounds)
4Psychoacoustics - speech perc.
- duration, pitch, loudness, timbre, direction
- absolute and masked threshold, jnd, discrim.
- continuity
- complexity (pure - complex tone, voicing)
- effect of context, meaning (intell.), freq. occ.
- phoneme more text-guided than perceived
- speech perceptual tasks
- phoneme gt sent. identif. discrim. matching
5Detection thresholds and jnd multi-harmonic,
simple, stationary signals single-formant-like
periodic signals
6Perceiving speech-like trans.
- Ph.D thesis A. van Wieringen (1995)
- Perceiving dynamic speechlike sounds.
Psycho-acoustics and speech perception - see also vWie Pols, Acustica 84 (1998) 520-528
- stimulus characteristics
- (segmented and/or reversed) natural or synthetic
- tone glide single- or multi-formant transition
- isolated trans. initial or final trans. with
steady st. - converg. or diverg. trans. (var. duration or
slope) - task jnd/DL matching abs. ident. classif.
7DL for short speech-like transitions
Adopted from van Wieringen Pols (1998), Acta
Acustica 84, 520-528 Discrimination of short and
rapid speechlike transitions
8Perceiving (speech) dynamics
- vowel perception w/w or w/o transitions?
- our claims (vSon, IFA Proc. 17 (1993))
- only evidence for compensatory processes, i.e.
perceptual-overshoot and dynamic-specification,
when in an appropriate context - synthetic isolated dynamic formant tracks lead to
perceptual undershoot (averaging) - silent center studies are ambiguous
- concl. info in formant dynamics is only used
when Vs are heard in appropriate context
9(No Transcript)
10Vowel identification
- compare V responses for dynamic stimuli with
those for static stimuli - calculate net shift in V responses per onglide
(CV), complete (CVC), or offglide (VC) - result responses average over the trailing part
of the formant track
11Perceptual undershoot
Net shift in vowel responses to tokens with
curved formant tracks vs. stationary tokens. All
values significant, except small open triangles
12Effect of local context
- Perisegmental speech improves consonant and
vowel identification, vSon Pols, Speech Comm.
29,1-22 (1999) - also Phoneme recognition as a function of task
and context, IFA Proc. 24, 27-38 (2001) and
Proc. SPRAAC, 25-30 (2001) - also Pols vSon (1993), Acoustics and
perception of dynamic vowel segments, Speech
Comm. 13, 135-147
13V and C identification
- gated tokens from 120 CVC speech fragments taken
from a long text reading - 50 ms V kernel, V trans., C part (L/R)
- stimuli randomized V identification (17 Ss) and
Ci and Cf identification (15 Ss) - results
- phoneme identification benefits from extra speech
- left context more beneficial than right context
- better identification when also other member of
pair was identified correctly (context effect)
14 15Error rates of vowel identification for the
individual stimulus token types. Long-short vowel
errors (/a-a, -o/) are ignored
c
16V and C in CV tokens were identified better when
the other member of the pair was identified
correctly
17Effect of (lack of) context
- 100 Dutch listeners identifying V segments
- Vowel contrast reduction, K-vBeinum (1980)
3 conditions M1 M2 F1 F2 Av.
isolated V (3) ASC 95.2 433 88.9 404 88.0 447 86.4 634 89.6 480
words (5) ASC 88.1 406 78.8 320 84.9 374 85.3 529 84.3 407
unstr., free conv. (10) ASC 31.2 174 28.7 119 33.3 209 38.9 255 33.0 189
n
ASC 1/n S LFi - LFi2 (total variance), LFi
100 10log Fi
i1
18Human word intelligibility vs. noise
from Ph.D thesis H. Steeneken (1992) On
measuring and predicting speech intelligibility
19Robustness to degraded speech
- speech time-modulated signal in frequency bands
- relatively insensitive to (spectral) distortions
- prerequisite for digital hearing aid
- modulating spectral slope -5 to 5 dB/oct,
0.25-2 Hz - temporal smearing of envelope modulation
- ca. 4 Hz max. in modulation spectrum ? syllable
- LPgt4 Hz and HPlt8 Hz little effect on
intelligibility - spectral envelope smearing
- for BWgt1/3 oct masked SRT starts to degrade
- (for references, see keynote paper Pols in Proc.
ICPhS99)
20Some examples
- partly reversed speech (Saberi Perrott, Nature,
4/99) - fixed duration segments time reversed or shifted
in time - perfect sentence intelligibility up to 50 ms
- (demo every 50 ms reversed original )
- low frequency modulation envelope (3-8 Hz) vs.
acoustic spectrum - syllable as information unit? (S. Greenberg)
- gap and click restoration (Warren)
- gating experiments
21Continuity, especiallywhile masked
- continuity effect (Miller Licklider), auditory
induction (Warren), pulsation threshold
(Houtgast) - also for gliding tones
- also for complex tones
- also for pitch
- fission, fusion
- segregation, streaming
- phonemic restoration
22V and C reduction, coarticulation
- spectral variability is not random but, at
least partly, speaker-, style-, and
context-specific - read - spontaneous stressed - unstressed
- not just for vowels, but also for consonants
- duration spectral balance
- intervocalic sound energy difference
- F2 slope difference locus equation
23 C-duration C error rate
Mean consonant duration
Mean error rate for C identification
791 VCV pairs (read spontan. stressed unstr.
segments one male) C-identification by 22 Dutch
subjects
Adopted from van Son Pols (Eurospeech97)
24Perception of ac. V reduction
- Ph.D thesis Dick van Bergem (1995)
- Acoustic and lexical vowel reduction
- lexical V reduction Fr /betõ/ vs. Du /b_at_tOn/
- acoustic V reduction
- Du miljoen as /mIljun/ or as /m_at_ljun/
- identify the unstressed vowels (as V or _at_)
- by 20 listeners (8M, 12 F)
- in 47 words (cond. W and S)
- or 20 words (cond. P), like milJOEN or
biosCOOP - spoken by 20 male speakers (2280 stimuli)
25adapted from vBergem (1995)
Conclusion Vowel reduction is not centralization
but contextual assimilation
26Speech efficiency
- speech is most efficient if it contains only the
information needed to understand it - Speech is the missing information (Lindblom,
JASA 96) - less information needed for more predictable
things - shorter duration and more spectral reduction for
high-frequent syllables and words - C-confusion correlates with acoustic factors
(duration, CoG) and with information content
(syll./word freq.) I(x) -log2(Prob(x)) in
bits
(see van Son, Koopmans-van Beinum, and Pols
(ICSLP98))
27Correlation between consonant confusion and 4
measures indicated
Dutch male sp. 20 min. R/S 12 k syll. 8k
words 791 VCV R/S - 308 lex. str. - 483
unstr. C ident. 22 Ss
Adopted from van Son et al. (Proc. ICSLP98)
28Conclusions
- perceiving speech (segments) very much depends on
speech quality and context - isolated segments is also a kind of context
- only proper interpretation of formant
transitions (perceptual compensation for
spectro-temporal undershoot) when presented in an
appropriate context - reduced V are best perceived as schwa if
transitions are contextually assimilated