Human and Machine Performance in Speech Processing

About This Presentation

Title:

Human and Machine Performance in Speech Processing

Description:

Title: Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Author: Louis C.W. Pols Last modified by: Louis Pols – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 35

Provided by: LouisC1

Category:

more less

Transcript and Presenter's Notes

Title: Human and Machine Performance in Speech Processing

1
Human andMachine Performancein Speech Processing

Louis C.W. Pols
Institute of Phonetic Sciences / ACLC
University of Amsterdam, The Netherlands
(Apologies this presentation resembles keynote
at ICPhS99, San Fransisco, CA)

2
IFA Herengracht 338 Amsterdam
welcome
Heraeus-Seminar Speech Recognition and Speech
Understanding April 3-5, 2000, Physikzentrum Bad
Honnef, Germany
3
Overview

Phonetics and speech technology
Do recognizers need intelligent ears?
What is knowledge?
How good is human/machine speech recogn.?
How good is synthetic speech?
Pre-processor characteristics
Useful (phonetic) knowledge
Computational phonetics
Discussion/conclusions

4
Phonetics ?? Speech Technology
5
Machine performancemore difficult, if ..

test condition deviates from training condition,
because of
nativeness and age of speakers
size and content of vocabulary
speaking style, emotion, rate
microphone, background noise, reverberation,
communication channel
nonavailability of certain features
however, machines get never tired, bored or
distracted

6
Do recognizers needintelligent ears?

intelligent ears ? front-end pre-processor
only if it improves performance
humans are generally better speech processors
than machines, perhaps system developers can
learn from human behavior
robustness at stake (noise, reverberation,
incompleteness, restoration, competing speakers,
variable speaking rate, context, dialects,
non-nativeness, style, emotion)

7
What is knowledge?

phonetic knowledge
probabilistic knowledge from databases
fixed set of features vs. adaptable set
trading relations, selectivity
knowledge of the world, expectation
global vs. detailed
? see video
(with permission from Interbrew Nederland NV)

8
(No Transcript)
9
Video is a metaphor for

from global to detail (world ? Europe ? Holland ?
North Sea coast ? Scheveningen ? beach
? young lady ? drinking Dommelsch beer)
sound ? speech ? speaker ? English ? utterance
recognize speech or wreck a nice beach
zoom in on whatever information is available
make intelligent interpretation, given context
beware for distracters!

10
Human auditory sensitivity

stationary vs. dynamic signals
simple vs. spectrally complex
detection threshold
just noticeable differences

11
Detection thresholds and jnd multi-harmonic,
simple, stationary signals single-formant-like
periodic signals
3 - 5
F2
1.5 Hz
frequency
20 - 40
BW
Table 3 in Proc. ICPhS99 paper
12
DL for short speech-like transitions
complex
simple
short
longer trans.
Adopted from van Wieringen Pols (Acta Acustica
98)
13
How good ishuman / machine speech recognition?
14
How good ishuman / machine speech recognition?

machine SR surprisingly good for certain tasks
machine SR could be better for many others
robustness, outliers
what are the limits of human performance?
in noise
for degraded speech
missing information (trading)

15
Human word intelligibility vs. noise
recognizers have trouble!
humans start to have some trouble
Adopted from Steeneken (1992)
16
Robustness to degraded speech

speech time-modulated signal in frequency bands
relatively insensitive to (spectral) distortions
prerequisite for digital hearing aid
modulating spectral slope -5 to 5 dB/oct,
0.25-2 Hz
temporal smearing of envelope modulation
ca. 4 Hz max. in modulation spectrum ? syllable
LPgt4 Hz and HPlt8 Hz little effect on
intelligibility
spectral envelope smearing
for BWgt1/3 oct masked SRT starts to degrade
(for references, see paper in Proc. ICPhS99)

17
Robustness to degraded speechand missing
information

partly reversed speech (Saberi Perrott, Nature,
4/99)
fixed duration segments time reversed or shifted
in time
perfect sentence intelligibility up to 50 ms
(demo every 50 ms reversed original )
low frequency modulation envelope (3-8 Hz) vs.
acoustic spectrum
syllable as information unit? (S. Greenberg)
gap and click restoration (Warren)
gating experiments

18
How good is synthetic speech?(not main theme of
this seminar, however, still attention for
synthesis and dialogue)

good enough for certain applications
could be better in most others
evaluation application-specific
or multi-tier required
interesting experience Synthesis workshop at
Jenolan Caves, Australia, Nov. 1998

19
Workshop evaluation procedure

participants as native listeners
DARPA-type procedures in data preparations
balanced listening design
no detailed results made public
3 text types
newspaper sentences
semantically unpredictable sentences
telephone directory entries
42 systems in 8 languages tested

20
Screen for newspaper sentences
21
Some global results

it worked!, but many practical problems
(for demo see http//www.fon.hum.uva.nl)
this seems the way to proceed and to expand
global rating (poor to excellent)
text analysis, prosody signal processing
and/or more detailed scores
transcriptions subjectively judged
major/minor/no problems per entry
web site access of several systems
(http//www.ldc.upenn.edu/ltts/)

22
Phonetic knowledge to improve speech synthesis

(supposing concatenative synthesis)
control emotion, style, voice characteristics
perceptual implications of
parameterization (LPC, PSOLA)
discontinuities (spectral, temporal, prosody)
improve naturalness (prosody!)
active adaptation to other conditions
hyper/hypo, noise, comm. channel, listener
impairment
systematic evaluation

23
Desired pre-processor characteristicsin
Automatic Speech Recognition

basic sensitivity for stationary and dynamic
sounds
robustness to degraded speech
rather insensitive to spectral and temporal
smearing
robustness to noise and reverberation
filter characteristics
is BP, PLP, MFCC, RASTA, TRAPS good enough?
lateral inhibition (spectral sharpening)
dynamics
what can be neglected?
non-linearities, limited dynamic range, active
elements, co-modulation, secondary pitch, etc.

24
Caricature of present-day speech recognizer

trained with a variety of speech input
much global information, no interrelations
monaural, uni-modal input
pitch extractor generally not operational
performs well on average behavior
does poorly on any type of outlier (OOV,
non-native, fast
or whispered speech, other communication
channel)
neglects lots of useful (phonetic) information
heavily relies on language model

25
Useful (phonetic) knowledge neglected so far

pitch information
(systematic) durational variability
spectral reduction/coarticulation (other than
multiphone)
intelligent selection from multiple features
quick adaptation to speaker, style channel
communicative expectations
multi-modality
binaural hearing

26
Useful information durational variability
Adopted from Wang (1998)
27
Useful information durational variability
overall average95 ms
normal rate95
primary stress104
word final136
utterance final186
Adopted from Wang (1998)
28
Useful informationV and C reduction,
coarticulation

spectral variability is not random but, at least
partly, speaker-, style-, and context-specific
read - spontaneous stressed - unstressed
not just for vowels, but also for consonants
duration
spectral balance
intervocalic sound energy difference
F2 slope difference
locus equation

29
C-duration C error rate
Mean consonant duration
Mean error rate for C identification
791 VCV pairs (read spontan. stressed unstr.
segments one male) C-identification by 22 Dutch
subjects
Adopted from van Son Pols (Eurospeech97)
30
Other useful information

pronunciation variation (ESCA workshop)
acoustic attributes of prominence (B. Streefkerk)
speech efficiency (post-doc project R. v. Son)
confidence measure
units in speech recognition
rather than PLU, perhaps syllables (S. Greenberg)
quick adaptation
prosody-driven recognition / understanding
multiple features

31
Speech efficiency

speech is most efficient if it contains only the
information needed to understand it
Speech is the missing information (Lindblom,
JASA 96)
less information needed for more predictable
things
shorter duration and more spectral reduction for
high-frequent syllables and words
C-confusion correlates with acoustic factors
(duration, CoG) and with information content
(syll./word freq.) I(x) -log2(Prob(x)) in bits
(see van Son, Koopmans-van Beinum, and Pols
(ICSLP98))

32
Correlation between consonant confusion and 4
measures indicated
Dutch male sp. 20 min. R/S 12 k syll. 8k
words 791 VCV R/S 308 lex. str. () 483
unstr. () C ident. 22 Ss p ? 0.01 ? p ? 0.001
Adopted from van Son et al. (Proc. ICSLP98)
33
Computational Phonetics(first suggested by R.
Moore, ICPhS95 Stockholm)

duration modeling
optimal unit selection (like in concatenative
synthesis)
pronunciation variation modeling (SpeCom Nov.
99)
vowel reduction models
computational prosody
information measures for confusion
speech efficiency models
modulation transfer function for speech

34
Discussion / Conclusions

speech technology needs further improvement for
certain tasks (flexibility, robustness)
phonetic knowledge can help if provided in an
implementable form computational phonetics is
probably a good way to do that
phonetics and speech / language technology should
work together more closely, for their mutual
benefit
this Heraeus-seminar is a possible platform for
that discussion

Write a Comment

User Comments (0)

About PowerShow.com

Human and Machine Performance in Speech Processing - PowerPoint PPT Presentation

Human and Machine Performance in Speech Processing

Title: Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Author: Louis C.W. Pols Last modified by: Louis Pols – PowerPoint PPT presentation