Title: Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology
1Flexible, Robust, and EfficientHuman Speech
Processing Versus Present-day Speech Technology
- Louis C.W. Pols
- Institute of Phonetic Sciences / IFOTT
- University of Amsterdam
- The Netherlands
2IFA Herengracht 338 Amsterdam
welcome
My pre-predecessor Louise Kaiser Secretary of
First International Congress of Phonetic
Sciences Amsterdam, 3-8 July 1932
3Amsterdam ICPhS32
- Jac. van Ginneken, president
- L. Kaiser, secretary A. Roozendaal, Treasurer
- Subjects
- - physiology of speech and voice
- (experimental phonetics in its strict meaning)
- - study of the development of speech and voice in
the individual their evolution in the history of
mankind the influence of heridity - - anthropology of speech and voice
- - phonology
- - linguistic psychology 136 participants
- - pathology of speech and voice from 16
countries - - comparative physiology of the sounds of
animals 43 plenary papers - - musicology 24 demonstrations
4Amsterdam ICPhS32
- Some of the participants
- prof. Daniel Jones, London The theory of
phonemes, and its importance in Practical
Linguistics - Sir Richard Paget, London The Evolution of
Speech in Men - prof. R.H. Stetson, Oberlin Breathing Movements
in Speech - prof. Prince N. Trubetzkoy, Wien Charakter und
Methode der systematischen phonologischen
Darstellung einer gegebenen Sprache - dr. E. Zwirner, Berlin-Buch
- - Phonetische Untersuchungen an Aphasischen und
Amusischen - - Quantität, Lautdauerschätzung und
Lautkurvenmessung (Theorie und Material) - --------------------------------------------------
--------------- - 2nd, London 35 3rd, Ghent38 4th, Helsinki
61 5th, Münster 64
5Overview
- Phonetics and speech technology
- Do recognizers need intelligent ears?
- What is knowledge?
- How good is human/machine speech recogn.?
- How good is synthetic speech?
- Pre-processor characteristics
- Useful (phonetic) knowledge
- Computational phonetics
- Discussion/conclusions
6Phonetics ?? Speech Technology
7Do recognizers needintelligent ears?
- intelligent ears ? front-end pre-processor
- only if it improves performance
- humans are generally better speech processors
than machines, perhaps system developers can
learn from human behavior - robustness at stake (noise, reverberation,
incompleteness, restoration, competing speakers,
variable speaking rate, context, dialects,
non-nativeness, style, emotion)
8What is knowledge?
- phonetic knowledge
- probabilistic knowledge from databases
- fixed set of features vs. adaptable set
- trading relations, selectivity
- knowledge of the world, expectation
- global vs. detailed
- ? see video
- (with permission from Interbrew Nederland NV)
9Video is a metaphor for
- from global to detail (world ? Europe ? Holland ?
North Sea coast ? Scheveningen ? beach - ? young lady ? drinking Dommelsch beer)
- sound ? speech ? speaker ? English ? utterance
- recognize speech or wreck a nice beach
- zoom in on whatever information is available
- make intelligent interpretation, given context
- beware for distracters!
10Human auditory sensitivity
- stationary vs. dynamic signals
- simple vs. spectrally complex
- detection threshold
- just noticeable differences
- see Table 3 in paper
11Detection thresholds and jnd multi-harmonic,
simple, stationary signals single-formant-like
periodic signals
3 - 5
F2
1.5 Hz
frequency
20 - 40
BW
12DL for short speech-like transitions
complex
simple
short
longer trans.
Adopted from van Wieringen Pols (Acta Acustica
98)
13How good ishuman / machine speech recognition?
14How good ishuman / machine speech recognition?
- machine SR surprisingly good for certain tasks
- machine SR could be better for many others
- robustness, outliers
- what are the limits of human performance?
- in noise
- for degraded speech
- missing information (trading)
15Human word intelligibility vs. noise
recognizers have trouble!
humans start to have some trouble
Adopted from Steeneken (1992)
16Robustness to degraded speech
- speech time-modulated signal in frequency bands
- relatively insensitive to (spectral) distortions
- prerequisite for digital hearing aid
- modulating spectral slope -5 to 5 dB/oct,
0.25-2 Hz - temporal smearing of envelope modulation
- ca. 4 Hz max. in modulation spectrum ? syllable
- LPgt4 Hz and HPlt8 Hz little effect on
intelligibility - spectral envelope smearing
- for BWgt1/3 oct masked SRT starts to degrade
- (for references, see paper in Proc. ICPhS99)
17Robustness to degraded speechand missing
information
- partly reversed speech (Saberi Perrott, Nature,
4/99) - fixed duration segments time reversed or shifted
in time - perfect sentence intelligibility up to 50 ms
- (demo every 50 ms reversed original )
- low frequency modulation envelope (3-8 Hz) vs.
acoustic spectrum - syllable as information unit? (S. Greenberg)
- gap and click restoration (Warren)
- gating experiments
18How good is synthetic speech?
- good enough for certain applications
- could be better in most others
- evaluation application-specific
- or multi-tier required
- interesting experience Synthesis workshop at
Jenolan Caves, Australia, Nov. 1998
19Workshop evaluation procedure
- participants as native listeners
- DARPA-type procedures in data preparations
- balanced listening design
- no detailed results made public
- 3 text types
- newspaper sentences
- semantically unpredictable sentences
- telephone directory entries
- 42 systems in 8 languages tested
20Screen for newspaper sentences
21Some global results
- it worked!, but many practical problems
- (for demo see http//www.fon.hum.uva.nl)
- this seems the way to proceed and to expand
- global rating (poor to excellent)
- text analysis, prosody signal processing
- and/or more detailed scores
- transcriptions subjectively judged
- major/minor/no problems per entry
- web site access of several systems
- (http//www.ldc.upenn.edu/ltts/)
22Phonetic knowledge to improve speech synthesis
- (suppose concatenative synthesis)
- control emotion, style, voice characteristics
- perceptual implications of
- parameterization (LPC, PSOLA)
- discontinuities (spectral, temporal, prosody)
- improve naturalness (prosody!)
- active adaptation to other conditions
- hyper/hypo, noise, comm. channel, listener
impairment - systematic evaluation
23Desired pre-processor characteristicsin
Automatic Speech Recognition
- basic sensitivity for stationary and dynamic
sounds - robustness to degraded speech
- rather insensitive to spectral and temporal
smearing - robustness to noise and reverberation
- filter characteristics
- is BP, PLP, MFCC, RASTA, TRAPS good enough?
- lateral inhibition (spectral sharpening)
dynamics - what can be neglected?
- non-linearities, limited dynamic range, active
elements, co-modulation, secondary pitch, etc.
24Caricature of present-day speech recognizer
- trained with a variety of speech input
- much global information, no interrelations
- monaural, uni-modal input
- pitch extractor generally not operational
- performs well on average behavior
- does poorly on any type of outlier (OOV,
non-native, fast - or whispered speech, other communication
channel) - neglects lots of useful (phonetic) information
- heavily relies on language model
25Useful (phonetic) knowledge neglected so far
- pitch information
- (systematic) durational variability
- spectral reduction/coarticulation (other than
multiphone) - intelligent selection from multiple features
- quick adaptation to speaker, style channel
- communicative expectations
- multi-modality
- binaural hearing
26Useful information durational variability
Adopted from Wang (1998)
27Useful information durational variability
overall average95 ms
normal rate95
primary stress104
word final136
utterance final186
Adopted from Wang (1998)
28Useful informationV and C reduction,
coarticulation
- spectral variability is not random but, at least
partly, speaker-, style-, and context-specific - read - spontaneous stressed - unstressed
- not just for vowels, but also for consonants
- duration
- spectral balance
- intervocalic sound energy difference
- F2 slope difference
- locus equation
29 C-duration C error rate
Mean consonant duration
Mean error rate for C identification
791 VCV pairs (read spontan. stressed unstr.
segments one male) C-identification by 22 Dutch
subjects
Adopted from van Son Pols (Eurospeech97)
30Other useful information
- pronunciation variation (ESCA workshop)
- acoustic attributes of prominence (B. Streefkerk)
- speech efficiency (post-doc project R. v. Son)
- confidence measure
- units in speech recognition
- rather than PLU, perhaps syllables (S. Greenberg)
- quick adaptation
- prosody-driven recognition / understanding
- multiple features
31Speech efficiency
- speech is most efficient if it contains only the
information needed to understand it - Speech is the missing information (Lindblom,
JASA 96) - less information needed for more predictable
things - shorter duration and more spectral reduction for
high-frequent syllables and words - C-confusion correlates with acoustic factors
(duration, CoG) and with information content
(syll./word freq.) I(x) -log2(Prob(x)) in bits - (see van Son, Koopmans-van Beinum, and Pols
(ICSLP98))
32Correlation between consonant confusion and 4
measures indicated
Dutch male sp. 20 min. R/S 12 k syll. 8k
words 791 VCV R/S - 308 lex. str. - 483
unstr. C ident. 22 Ss
Adopted from van Son et al. (Proc. ICSLP98)
33Computational Phonetics(R. Moore, ICPhS95
Stockholm)
- duration modeling
- optimal unit selection (like in concatenative
synthesis) - pronunciation variation modeling
- vowel reduction models
- computational prosody
- information measures for confusion
- speech efficiency models
- modulation transfer function for speech
34Discussion / Conclusions
- speech technology needs further improvement for
certain tasks (flexibility, robustness) - phonetic knowledge can help if provided in an
implementable form computational phonetics is
probably a good way to do that - phonetics and speech/language technology should
work together more closely, for their mutual
benefit - this conference is the ideal platform for that