From speech signal acoustics to perception - PowerPoint PPT Presentation

About This Presentation
Title:

From speech signal acoustics to perception

Description:

complexity (pure - complex tone, voicing) effect of context, meaning (intell.), freq. ... tone glide; single- or multi-formant transition. isolated trans. ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 29
Provided by: louis78
Category:

less

Transcript and Presenter's Notes

Title: From speech signal acoustics to perception


1
From speech signal acousticsto perception
  • Louis C.W. Pols
  • Institute of Phonetic Sciences (IFA)
  • Amsterdam Center for Language and Communication
    (ACLC)

NATO-ASI Dynamics of Speech Production and
Perception Il Ciocco, Tuscany,
Italy, July 4, 2002
2
Overview
  • how do we perceive (speech) dynamics?
  • The Intelligent Ear. On the Nature of Sound
    Perception, by Reinier Plomp (2002)
  • from psychoacoustics to speech perception
  • (lack of) context robustness continuity
  • V and C reduction coarticulation
  • perceptual compensation for artic. undershoot?
  • speech efficiency
  • conclusions

3
Various scientific preferences
  • several biases have affected the history of
    (speech ) hearing research (Plomp, 2002)
  • dominance of sinusoidal tones as stimuli
  • preference for microscopic approach (e.g.,
    phoneme discrimination rather than
    intelligibility)
  • emphasis on psychophysical (rather than
    cognitive) aspects of hearing
  • clean stimuli in the lab rather than the acoustic
    reality of the outside world (disruptive sounds)

4
Psychoacoustics - speech perc.
  • duration, pitch, loudness, timbre, direction
  • absolute and masked threshold, jnd, discrim.
  • continuity
  • complexity (pure - complex tone, voicing)
  • effect of context, meaning (intell.), freq. occ.
  • phoneme more text-guided than perceived
  • speech perceptual tasks
  • phoneme gt sent. identif. discrim. matching

5
Detection thresholds and jnd multi-harmonic,
simple, stationary signals single-formant-like
periodic signals
6
Perceiving speech-like trans.
  • Ph.D thesis A. van Wieringen (1995)
  • Perceiving dynamic speechlike sounds.
    Psycho-acoustics and speech perception
  • see also vWie Pols, Acustica 84 (1998) 520-528
  • stimulus characteristics
  • (segmented and/or reversed) natural or synthetic
  • tone glide single- or multi-formant transition
  • isolated trans. initial or final trans. with
    steady st.
  • converg. or diverg. trans. (var. duration or
    slope)
  • task jnd/DL matching abs. ident. classif.

7
DL for short speech-like transitions
Adopted from van Wieringen Pols (1998), Acta
Acustica 84, 520-528 Discrimination of short and
rapid speechlike transitions
8
Perceiving (speech) dynamics
  • vowel perception w/w or w/o transitions?
  • our claims (vSon, IFA Proc. 17 (1993))
  • only evidence for compensatory processes, i.e.
    perceptual-overshoot and dynamic-specification,
    when in an appropriate context
  • synthetic isolated dynamic formant tracks lead to
    perceptual undershoot (averaging)
  • silent center studies are ambiguous
  • concl. info in formant dynamics is only used
    when Vs are heard in appropriate context

9
(No Transcript)
10
Vowel identification
  • compare V responses for dynamic stimuli with
    those for static stimuli
  • calculate net shift in V responses per onglide
    (CV), complete (CVC), or offglide (VC)
  • result responses average over the trailing part
    of the formant track

11
Perceptual undershoot
Net shift in vowel responses to tokens with
curved formant tracks vs. stationary tokens. All
values significant, except small open triangles
12
Effect of local context
  • Perisegmental speech improves consonant and
    vowel identification, vSon Pols, Speech Comm.
    29,1-22 (1999)
  • also Phoneme recognition as a function of task
    and context, IFA Proc. 24, 27-38 (2001) and
    Proc. SPRAAC, 25-30 (2001)
  • also Pols vSon (1993), Acoustics and
    perception of dynamic vowel segments, Speech
    Comm. 13, 135-147

13
V and C identification
  • gated tokens from 120 CVC speech fragments taken
    from a long text reading
  • 50 ms V kernel, V trans., C part (L/R)
  • stimuli randomized V identification (17 Ss) and
    Ci and Cf identification (15 Ss)
  • results
  • phoneme identification benefits from extra speech
  • left context more beneficial than right context
  • better identification when also other member of
    pair was identified correctly (context effect)

14

15
Error rates of vowel identification for the
individual stimulus token types. Long-short vowel
errors (/a-a, -o/) are ignored
c
16

V and C in CV tokens were identified better when
the other member of the pair was identified
correctly
17
Effect of (lack of) context
  • 100 Dutch listeners identifying V segments
  • Vowel contrast reduction, K-vBeinum (1980)

3 conditions M1 M2 F1 F2 Av.
isolated V (3) ASC 95.2 433 88.9 404 88.0 447 86.4 634 89.6 480
words (5) ASC 88.1 406 78.8 320 84.9 374 85.3 529 84.3 407
unstr., free conv. (10) ASC 31.2 174 28.7 119 33.3 209 38.9 255 33.0 189
n
ASC 1/n S LFi - LFi2 (total variance), LFi
100 10log Fi
i1
18
Human word intelligibility vs. noise
from Ph.D thesis H. Steeneken (1992) On
measuring and predicting speech intelligibility
19
Robustness to degraded speech
  • speech time-modulated signal in frequency bands
  • relatively insensitive to (spectral) distortions
  • prerequisite for digital hearing aid
  • modulating spectral slope -5 to 5 dB/oct,
    0.25-2 Hz
  • temporal smearing of envelope modulation
  • ca. 4 Hz max. in modulation spectrum ? syllable
  • LPgt4 Hz and HPlt8 Hz little effect on
    intelligibility
  • spectral envelope smearing
  • for BWgt1/3 oct masked SRT starts to degrade
  • (for references, see keynote paper Pols in Proc.
    ICPhS99)

20
Some examples
  • partly reversed speech (Saberi Perrott, Nature,
    4/99)
  • fixed duration segments time reversed or shifted
    in time
  • perfect sentence intelligibility up to 50 ms
  • (demo every 50 ms reversed original )
  • low frequency modulation envelope (3-8 Hz) vs.
    acoustic spectrum
  • syllable as information unit? (S. Greenberg)
  • gap and click restoration (Warren)
  • gating experiments

21
Continuity, especiallywhile masked
  • continuity effect (Miller Licklider), auditory
    induction (Warren), pulsation threshold
    (Houtgast)
  • also for gliding tones
  • also for complex tones
  • also for pitch
  • fission, fusion
  • segregation, streaming
  • phonemic restoration

22
V and C reduction, coarticulation
  • spectral variability is not random but, at
    least partly, speaker-, style-, and
    context-specific
  • read - spontaneous stressed - unstressed
  • not just for vowels, but also for consonants
  • duration spectral balance
  • intervocalic sound energy difference
  • F2 slope difference locus equation

23
C-duration C error rate
Mean consonant duration
Mean error rate for C identification
791 VCV pairs (read spontan. stressed unstr.
segments one male) C-identification by 22 Dutch
subjects
Adopted from van Son Pols (Eurospeech97)
24
Perception of ac. V reduction
  • Ph.D thesis Dick van Bergem (1995)
  • Acoustic and lexical vowel reduction
  • lexical V reduction Fr /betõ/ vs. Du /b_at_tOn/
  • acoustic V reduction
  • Du miljoen as /mIljun/ or as /m_at_ljun/
  • identify the unstressed vowels (as V or _at_)
  • by 20 listeners (8M, 12 F)
  • in 47 words (cond. W and S)
  • or 20 words (cond. P), like milJOEN or
    biosCOOP
  • spoken by 20 male speakers (2280 stimuli)

25
adapted from vBergem (1995)
Conclusion Vowel reduction is not centralization
but contextual assimilation
26
Speech efficiency
  • speech is most efficient if it contains only the
    information needed to understand it
  • Speech is the missing information (Lindblom,
    JASA 96)
  • less information needed for more predictable
    things
  • shorter duration and more spectral reduction for
    high-frequent syllables and words
  • C-confusion correlates with acoustic factors
    (duration, CoG) and with information content
    (syll./word freq.) I(x) -log2(Prob(x)) in
    bits

(see van Son, Koopmans-van Beinum, and Pols
(ICSLP98))
27
Correlation between consonant confusion and 4
measures indicated
Dutch male sp. 20 min. R/S 12 k syll. 8k
words 791 VCV R/S - 308 lex. str. - 483
unstr. C ident. 22 Ss
Adopted from van Son et al. (Proc. ICSLP98)
28
Conclusions
  • perceiving speech (segments) very much depends on
    speech quality and context
  • isolated segments is also a kind of context
  • only proper interpretation of formant
    transitions (perceptual compensation for
    spectro-temporal undershoot) when presented in an
    appropriate context
  • reduced V are best perceived as schwa if
    transitions are contextually assimilated
Write a Comment
User Comments (0)
About PowerShow.com