Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology

About This Presentation

Title:

Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology

Description:

Amsterdam. My pre-predecessor: Louise Kaiser ... Amsterdam ICPhS'32. Jac. van Ginneken, president. L. Kaiser, ... Amsterdam ICPhS'32. Some of the participants: ... –

Number of Views:214

Avg rating:3.0/5.0

Slides: 35

Provided by: louisc

Category:

more less

Transcript and Presenter's Notes

Title: Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology

1
Flexible, Robust, and EfficientHuman Speech
Processing Versus Present-day Speech Technology

Louis C.W. Pols
Institute of Phonetic Sciences / IFOTT
University of Amsterdam
The Netherlands

2
IFA Herengracht 338 Amsterdam
welcome
My pre-predecessor Louise Kaiser Secretary of
First International Congress of Phonetic
Sciences Amsterdam, 3-8 July 1932
3
Amsterdam ICPhS32

Jac. van Ginneken, president
L. Kaiser, secretary A. Roozendaal, Treasurer
Subjects
- physiology of speech and voice
(experimental phonetics in its strict meaning)
- study of the development of speech and voice in
the individual their evolution in the history of
mankind the influence of heridity
- anthropology of speech and voice
- phonology
- linguistic psychology 136 participants
- pathology of speech and voice from 16
countries
- comparative physiology of the sounds of
animals 43 plenary papers
- musicology 24 demonstrations

4
Amsterdam ICPhS32

Some of the participants
prof. Daniel Jones, London The theory of
phonemes, and its importance in Practical
Linguistics
Sir Richard Paget, London The Evolution of
Speech in Men
prof. R.H. Stetson, Oberlin Breathing Movements
in Speech
prof. Prince N. Trubetzkoy, Wien Charakter und
Methode der systematischen phonologischen
Darstellung einer gegebenen Sprache
dr. E. Zwirner, Berlin-Buch
- Phonetische Untersuchungen an Aphasischen und
Amusischen
- Quantität, Lautdauerschätzung und
Lautkurvenmessung (Theorie und Material)
--------------------------------------------------
---------------
2nd, London 35 3rd, Ghent38 4th, Helsinki
61 5th, Münster 64

5
Overview

Phonetics and speech technology
Do recognizers need intelligent ears?
What is knowledge?
How good is human/machine speech recogn.?
How good is synthetic speech?
Pre-processor characteristics
Useful (phonetic) knowledge
Computational phonetics
Discussion/conclusions

6
Phonetics ?? Speech Technology
7
Do recognizers needintelligent ears?

intelligent ears ? front-end pre-processor
only if it improves performance
humans are generally better speech processors
than machines, perhaps system developers can
learn from human behavior
robustness at stake (noise, reverberation,
incompleteness, restoration, competing speakers,
variable speaking rate, context, dialects,
non-nativeness, style, emotion)

8
What is knowledge?

phonetic knowledge
probabilistic knowledge from databases
fixed set of features vs. adaptable set
trading relations, selectivity
knowledge of the world, expectation
global vs. detailed
? see video
(with permission from Interbrew Nederland NV)

9
Video is a metaphor for

from global to detail (world ? Europe ? Holland ?
North Sea coast ? Scheveningen ? beach
? young lady ? drinking Dommelsch beer)
sound ? speech ? speaker ? English ? utterance
recognize speech or wreck a nice beach
zoom in on whatever information is available
make intelligent interpretation, given context
beware for distracters!

10
Human auditory sensitivity

stationary vs. dynamic signals
simple vs. spectrally complex
detection threshold
just noticeable differences
see Table 3 in paper

11
Detection thresholds and jnd multi-harmonic,
simple, stationary signals single-formant-like
periodic signals
3 - 5
F2
1.5 Hz
frequency
20 - 40
BW
12
DL for short speech-like transitions
complex
simple
short
longer trans.
Adopted from van Wieringen Pols (Acta Acustica
98)
13
How good ishuman / machine speech recognition?
14
How good ishuman / machine speech recognition?

machine SR surprisingly good for certain tasks
machine SR could be better for many others
robustness, outliers
what are the limits of human performance?
in noise
for degraded speech
missing information (trading)

15
Human word intelligibility vs. noise
recognizers have trouble!
humans start to have some trouble
Adopted from Steeneken (1992)
16
Robustness to degraded speech

speech time-modulated signal in frequency bands
relatively insensitive to (spectral) distortions
prerequisite for digital hearing aid
modulating spectral slope -5 to 5 dB/oct,
0.25-2 Hz
temporal smearing of envelope modulation
ca. 4 Hz max. in modulation spectrum ? syllable
LPgt4 Hz and HPlt8 Hz little effect on
intelligibility
spectral envelope smearing
for BWgt1/3 oct masked SRT starts to degrade
(for references, see paper in Proc. ICPhS99)

17
Robustness to degraded speechand missing
information

partly reversed speech (Saberi Perrott, Nature,
4/99)
fixed duration segments time reversed or shifted
in time
perfect sentence intelligibility up to 50 ms
(demo every 50 ms reversed original )
low frequency modulation envelope (3-8 Hz) vs.
acoustic spectrum
syllable as information unit? (S. Greenberg)
gap and click restoration (Warren)
gating experiments

18
How good is synthetic speech?

good enough for certain applications
could be better in most others
evaluation application-specific
or multi-tier required
interesting experience Synthesis workshop at
Jenolan Caves, Australia, Nov. 1998

19
Workshop evaluation procedure

participants as native listeners
DARPA-type procedures in data preparations
balanced listening design
no detailed results made public
3 text types
newspaper sentences
semantically unpredictable sentences
telephone directory entries
42 systems in 8 languages tested

20
Screen for newspaper sentences
21
Some global results

it worked!, but many practical problems
(for demo see http//www.fon.hum.uva.nl)
this seems the way to proceed and to expand
global rating (poor to excellent)
text analysis, prosody signal processing
and/or more detailed scores
transcriptions subjectively judged
major/minor/no problems per entry
web site access of several systems
(http//www.ldc.upenn.edu/ltts/)

22
Phonetic knowledge to improve speech synthesis

(suppose concatenative synthesis)
control emotion, style, voice characteristics
perceptual implications of
parameterization (LPC, PSOLA)
discontinuities (spectral, temporal, prosody)
improve naturalness (prosody!)
active adaptation to other conditions
hyper/hypo, noise, comm. channel, listener
impairment
systematic evaluation

23
Desired pre-processor characteristicsin
Automatic Speech Recognition

basic sensitivity for stationary and dynamic
sounds
robustness to degraded speech
rather insensitive to spectral and temporal
smearing
robustness to noise and reverberation
filter characteristics
is BP, PLP, MFCC, RASTA, TRAPS good enough?
lateral inhibition (spectral sharpening)
dynamics
what can be neglected?
non-linearities, limited dynamic range, active
elements, co-modulation, secondary pitch, etc.

24
Caricature of present-day speech recognizer

trained with a variety of speech input
much global information, no interrelations
monaural, uni-modal input
pitch extractor generally not operational
performs well on average behavior
does poorly on any type of outlier (OOV,
non-native, fast
or whispered speech, other communication
channel)
neglects lots of useful (phonetic) information
heavily relies on language model

25
Useful (phonetic) knowledge neglected so far

pitch information
(systematic) durational variability
spectral reduction/coarticulation (other than
multiphone)
intelligent selection from multiple features
quick adaptation to speaker, style channel
communicative expectations
multi-modality
binaural hearing

26
Useful information durational variability
Adopted from Wang (1998)
27
Useful information durational variability
overall average95 ms
normal rate95
primary stress104
word final136
utterance final186
Adopted from Wang (1998)
28
Useful informationV and C reduction,
coarticulation

spectral variability is not random but, at least
partly, speaker-, style-, and context-specific
read - spontaneous stressed - unstressed
not just for vowels, but also for consonants
duration
spectral balance
intervocalic sound energy difference
F2 slope difference
locus equation

29
C-duration C error rate
Mean consonant duration
Mean error rate for C identification
791 VCV pairs (read spontan. stressed unstr.
segments one male) C-identification by 22 Dutch
subjects
Adopted from van Son Pols (Eurospeech97)
30
Other useful information

pronunciation variation (ESCA workshop)
acoustic attributes of prominence (B. Streefkerk)
speech efficiency (post-doc project R. v. Son)
confidence measure
units in speech recognition
rather than PLU, perhaps syllables (S. Greenberg)
quick adaptation
prosody-driven recognition / understanding
multiple features

31
Speech efficiency

speech is most efficient if it contains only the
information needed to understand it
Speech is the missing information (Lindblom,
JASA 96)
less information needed for more predictable
things
shorter duration and more spectral reduction for
high-frequent syllables and words
C-confusion correlates with acoustic factors
(duration, CoG) and with information content
(syll./word freq.) I(x) -log2(Prob(x)) in bits
(see van Son, Koopmans-van Beinum, and Pols
(ICSLP98))

32
Correlation between consonant confusion and 4
measures indicated
Dutch male sp. 20 min. R/S 12 k syll. 8k
words 791 VCV R/S - 308 lex. str. - 483
unstr. C ident. 22 Ss
Adopted from van Son et al. (Proc. ICSLP98)
33
Computational Phonetics(R. Moore, ICPhS95
Stockholm)

duration modeling
optimal unit selection (like in concatenative
synthesis)
pronunciation variation modeling
vowel reduction models
computational prosody
information measures for confusion
speech efficiency models
modulation transfer function for speech

34
Discussion / Conclusions

speech technology needs further improvement for
certain tasks (flexibility, robustness)
phonetic knowledge can help if provided in an
implementable form computational phonetics is
probably a good way to do that
phonetics and speech/language technology should
work together more closely, for their mutual
benefit
this conference is the ideal platform for that

Write a Comment

User Comments (0)