Prosody%20in%20Recognition/Understanding

About This Presentation

Title:

Prosody%20in%20Recognition/Understanding

Description:

Prosody in Recognition/Understanding – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 38

Provided by: JuliaH178

Learn more at: http://www.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Prosody%20in%20Recognition/Understanding

1
Prosody in Recognition/Understanding
2
Prosody in ASR Today

Little success in improving ASR transcription
More promise in non-traditional ASR-related
tasks
Improving rejection
Shrinking search space
Automatic segmentation
Identifying salient words
Disambiguating speech/dialogue acts
Prosody in ASR understanding

3
Overview

Recognizing communicative problems
ASR errors
User corrections
Identifying speech acts
Locating topic boundaries for topic tracking and
audio browsing
Recognizing speaker emotion

4
But...Systems Have Trouble Knowing When Theyve
Made a Mistake

Hard for humans to correct system misconceptions
(Krahmer et al 99)
User I want to go to Boston.
System What day do you want to go to Baltimore?
Easier answering explicit requests for
confirmation or responding to ASR rejections
System Did you say you want to go to Baltimore?
System I'm sorry. I didn't understand you. Could
you please repeat your utterance?
But constant confirmation or over-cautious
rejection lengthens dialogue and decreases user
satisfaction

5
And Systems Have Trouble Recognizing User
Corrections

Probability of recognition failures increases
after a misrecognition (Levow 98)
Corrections of system errors often
hyperarticulated (louder, slower, more internal
pauses, exaggerated pronunciation) ? more ASR
error (Wade et al 92, Oviatt et al 96, Swerts
Ostendorf 97, Levow 98, Bell Gustafson 99)

6
Can Prosodic Information Help Systems Perform
Better?

If errors occur where speaker turns are
prosodically marked.
Can we recognize turns that will be misrecognized
by examining their prosody?
Can we modify our dialogue and recognition
strategies to handle corrections more
appropriately?

7
Approach

Collect corpus from interactive voice response
system
Identify speaker turns
incorrectly recognized
where speakers first aware of error
that correct misrecognitions
Identify prosodic features of turns in each
category and compare to other turns
Use Machine Learning techniques to train a
classifier to make these distinctions
automatically

misrecognition
aware site
correction
8
Turn Types
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User
New York.
misrecognition
correction
aware site
9
TOOT Dialogues

Collected to study effects of differences in
dialogue strategy on user performance and
satisfaction (Litman Pan 99)
type of initiative (system, user, mixed)
type of confirmation (explicit, implicit, none)
adaptability condition
Subjects
39 summer students
16/23 (F/M)
20/19 (native speaker/non)

Platform combined over-the-phone ASR and TTS
(Kamm et al 97) with web access to train
information
Task find train information for 4 scenarios
Corpus for current study
2328 speaker turns
52 dialogues
Misrecognitions
Overall word accuracy 61
Overall concept accuracy(CA) 71
I want to go to Boston from Philadelphia (2
domain concepts)
recognized as I want to go to Boston (one
concept) 50

11
A Successful Dialogue

S Hi, this is ATT Amtrak schedule system. This
is TOOT. How may I help you?
U I would like a train to New York City from
Philadelphia on Sunday at ten thirty P M
S I heard you say go to New York City from
Philadelphia on Sunday approximately at ten
thirty p.m." Do you want me to find the trains
from Philadelphia to New York City on Sunday
approximately at ten thirty p. m. now?
U Yes.
S I am going to get the train schedule for
you....

12
An Unsuccessful Dialogue

S Hi, this is ATT Amtrak schedule system. This
is TOOT. How may I help you?
U I'd like to find a train from Ch- to Chicago
from Baltimore.
S Sorry, I can't understand you. Please repeat
your utterance.
U I'd like to find a train from Baltimore to
Chicago.
S I heard you say leave today around 5 thirty
in the morning . Which city do you want to go
to?
U I would like to leave Saturday at eight in the
morning.

13
S You can say the arrival city name, such as
New York City." U Baltimore to Chicago. S You
can say the arrival city name, such as New York
City. U Baltimore to Chicago. (I hate it.) U
Train from Baltimore ...S ...I heard you s...U
to Chicago.... Hello? S You can say the name of
your departure city, such as "New York City." U
Baltimore.
14
U Bal-ti-more.... S You can say... U
Baltimore. S Do you want to continue and find a
new set of trains? U Yes. S You can say "yes"
or "I am done here" to find a new set of
trains. U Yes. S Thank you for using ATT
Amtrak train time table system. See you next
time. U I nev-
15
Are Misrecognitions, Aware Turns, Corrections
Measurably Different from Other Turns?

For each type of turn
For each speaker, for each prosodic feature,
calculate mean values for e.g. all correctly
recognized speaker turns and for all incorrectly
recognized turns
Perform paired t-tests on these speaker pairs of
means (e.g., for each speaker, pairing mean
values for correctly and incorrectly recognized
turns)

16
How Prosodic Features Examined per Turn

Raw prosodic/acoustic features
f0 maximum and mean (pitch excursion/range)
rms maximum and mean (amplitude)
total duration
duration of preceding silence
amount of silence within turn
speaking rate (estimated from syllables of
recognized string per second)
Normalized versions of each feature (compared to
first turn in task, to previous turn in task, Z
scores)

17
Distinguishing Correct Recognitions from
Misrecognitions (NAACL 00)

Misrecognitions differ prosodically from correct
recognitions in
F0 maximum (higher)
RMS maximum (louder)
turn duration (longer)
preceding pause (longer)
slower
Effect holds up across speakers and even when
hyperarticulated turns are excluded

18
WER-Based Results
Misrecognitions are higher in pitch, louder,
longer, more preceding pause and less internal
silence
19
Does Hyperarticulation Lead to ASR Error?

In TOOT corpus
24.1 of turns (perceived as) hyperarticulated
Hyperarticulated turns are recognized more poorly
(59.5 WER) than non-hyperarticulated turns
(32.8)
More misrecognized turns are hyperarticulated
(36.5) than correctly recognized turns (16.0)
But .. same results w/out hyperarticulated turns

20
Predicting Turn Types Using Machine Learning

Ripper (Cohen 96) automatically induces rule
sets for predicting turn types
greedy search guided by measure of information
gain
input vectors of feature values
output ordered rules for predicting dependent
variable and X-validated scores for each ruleset
Independent variables
all prosodic features, raw and normalized
experimental conditions (initiative type,
confirmation style, adaptability, subject, task)
gender, native/non-native status
ASR recognized string, grammar, and acoustic
confidence score

21
ML Results WER-defined Misrecognition
22
Best Rule-Set for Predicting WER
Using prosody, ASR conf, ASR string, ASR grammar
if (conf lt -2.85 (duration gt 1.27) then
F if (conf lt -4.34) then F if (tempo lt .81)
then F If (conf lt -4.09 then F If (conf lt
-2.46 str contains help then F If conf lt
-2.47 ppau gt .77 tempo lt .25 then F If str
contains nope then F If dur gt 1.71 tempo lt
1.76 then F else T
23
Analyses of Awares and Corrections

Awares
Shorter, somewhat louder, with less internal
silence compared to other turns
Poorly recognized (49.9 misrecd vs. 34.6)
ML results 30.4 baseline (!aware)/Mean error
12.2 (/-.61)
Corrections
longer, louder, higher in pitch excursion, longer
preceding pause, less internal silence
ML results 30 baseline/Mean error 21.48 /-
0.68

24
Turn Types
S Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? U Hello. I
would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
S Which city do you want to go to? U New
York.
aware site
25
Awareness Sites

Shorter, somewhat louder, with less internal
silence compared to other turns
Poorly recognized (49.9 vs. 34.6)

26
ML Rules for Aware Prediction

30.4 baseline (!aware)/Average error 12.2
(/-.61)
T - preconflt-4.06, pretpolt2.65, ppaugt0.25
T - preconflt-3.59, prerejT
T - preconflt-2.85, predurgt1.04, tponm2gt1.04,
preppaugt0.57, pretpolt2.18
T - preconflt-3.78, pmnsylsgt4.04
T - preconflt-2.75, prestr help
T - pregramuniversal, pprewordsgt2
T - preconflt-2.60, predurgt1.04,
zerosnm1lt1.06, prermsavgt370.65
T - pretpolt0.13
T - predurgt1.27, pretpolt2.36, prermsavgt245.36
T - pretpolt0.80, pmntpolt1.75, ppretponm2lt1.39
default F

27
TOOT Corrections (ICSLP 00b)
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User New
York.
correction
28
Serious Problem for Spoken Dialogue Systems

29 of turns in our corpus are corrections
52 of corrections are hyperarticulated but only
12 of other turns
Corrections are misrecognized at least twice as
often as non-corrections (60 vs. 31)
But corrections are no more likely to be rejected
than non-corrections. (9 vs. 8)
Are corrections also measurably distinct from
non-corrections?

29
Prosodic Indicators of Corrections

Corrections differ from other turns prosodically
longer, louder, higher in pitch excursion,
longer preceding pause, less internal silence
ML results
Baseline 30 error
normd prosody non-prosody 18.45 /- 0.78
automatic 21.48 /- 0.68

30
Prosodic Indicators of Corrections

Corrections differ from other turns prosodically
longer, louder, higher in pitch excursion,
longer preceding pause, less internal silence

31
ML Rules for Correction Prediction

Baseline 30 error (predict not correction)
normd prosody non-prosody 18.45 /- 0.78
automatic 21.48 /- 0.68
T - gramuniversal, f0maxgt0.96, durgt6.55
T - gramuniversal, zerosgt0.57, asrlt-2.95
T - gramuniversal, f0maxlt1.98, durlt1.10,
tempogt1.21, zerosgt0.71
T - durgt0.76, asrlt-2.97, stratUsrNoConf
T - durgt2.28, ppault0.86
T - rmsavgt1.11, stratMixedImplicit,
gramcityname, f0maxgt0.70
default FALSE

32
Corrections in Context

Similar in prosodic features but
What about their form and content?
How do system behaviors affect the corrections
users produce?
What sort of corrections are most, least
effective?
When users correct the same mistake more than
once, do they vary their strategy in productive
ways?

33
User Correction Behavior

Correction classes
omits and repetitions lead to fewer
misrecognitions than adds and paraphrases
Turns that correct rejections are more likely to
be repetitions, while turns correcting
misrecognitions are more likely to be omits

34
Role of System Strategy
Would you use again (1-5) SE (3.5), MI (2.6),
UNC (1.7) Satisfaction (0-40) SE (31.25), MI
(24.00), UNC (22.10)
35

Type of correction sensitive to strategy
much more likely to exactly repeat their
misrecognized utterance in a system-initiative
environment
much more likely to correct by omitting
information if no system confirmation than with
explicit confirmation
omits used more in MixedImplicit and
UserNoConfirm conditions
Restarts unlikely to be recognized (77
misrecognized) and skewed in distribution
31 of corrections are restarts in MI and UNC
None for SE, where initial turns well recognized
It doesnt pay to start over!

36
Correction Chains
S You can say the arrival city name, such as
New York City. U Baltimore to Chicago. (I hate
it.) U Train from Baltimore ...S ...I heard you
s...U to Chicago.... Hello? S You can say the
name of your departure city, such as "New York
City." U Baltimore. Bal-ti-more.... S You can
say... U Baltimore.
37
Effects of Distance from Error on User Corrections

Corrections farther from original error in turns
of chain position
higher in f0 (max,mean), lower in rms (max,mean),
longer in duration (sec,words), slower, with more
pause preceding, and lower CA (and word accuracy
for latter)
A puzzle corrections more distant from
immediately preceding error more likely to be
recognized
But similar in f0, rms, and duration
Not a difference among strategies

38
Future Research

Hypothesis We can improve system recognition and
error recovery by
Analyzing user input differently to identify
higher level features of turns systems are likely
to misrecognize -- and turns speakers produce to
correct their errors
Anticipating user responses to system errors in
the context of different strategies and targeting
special error recovery procedures
Next combining our predictors and an
over-the-phone interface to SCANMail, our
voicemail browsing and retrieval system

39
Overview

Recognizing communicative problems
ASR errors
User corrections
Disfluencies and self-repairs
Identifying speech acts
Locating topic boundaries for topic tracking and
audio browsing
Recognizing speaker emotion

40
Disfluencies Problems and Possibilities for
Dialogue Systems

Disfluencies abound in spontaneous speech
every 4.6s in radio call-in (Blackmer Mitton
91)
many more in audio-only conditions (Kasl Mahl
65, Oviatt 95)
Hard to define and often unperceived by listeners
hesitation Ch- change strategy.
filled pause Um Baltimore.
self-repair Ba- uh Chicago.

41
Recognizing Disfluency

Hard to detect in ASR systems and often leads to
recognition error
Ch- change strategy. --gt to D C D C today ten
fifteen.
Um Baltimore. --gt From Baltimore ten.
Ba- uh Chicago. --gt For Boston Chicago.
ASR systems usually try to treat disfluencies as
garbage or assumes e.g. repairs can be
identified from (correct) transcription
But may lose valuable meta-level information

42
What Might Disfluencies Convey?

Speakers cognitive load?
Difficulty of domain (Clark Brennan 91)
Difficulty of task
Inter-speaker co-ordination?
Turn-taking (Boomer 65, Goodwin 81, Shriberg
96)
Difficulties in accessing information (Brennan
Williams 95, Bortfeld Brennan 97)
Speaker error (Brennan Schober, TA)
Discourse structure (Swerts 98)
Effects of familiarity, age, gender? (Bortfeld et
al 99)

43
Can Prosodic Cues Help in Detection?

Filled pauses
Shriberg 94 longer duration and falling f0
contour
Self-repairs
Levelt Cutler 83 increase in prominence on
repair
Hindle 83 edit signal hypothesis
Blackmer Mitton 91 aberrant phones
Howell Young 91 increase in prior pause and
f0 before repair (latter facilitated perception)
Bear et al 92 increase in internal pausal
duration and f0 of repair

44
(No Transcript)
45
RIM Model of Self-Repairs (Nakatani Hirschberg
94)

ATIS corpus
6414 turns with 346 (5.4) repairs from 122
speakers
hand-labeled for repairs and prosodic features
Findings
Reparanda 73 end in fragments, 30 in
glottalization, co-articulatory gestures
DI pausal duration differs significantly from
fluent boundaries,small increase in f0 and
amplitude
Repairsoffsets occur at phrase boundaries,
phrasing differences
CART prediction 86 precision, 91 recall
Duration of interval, presence of fragment, pause
filler, p.o.s., lexical matching across DI

46
Overview

Recognizing communicative problems
ASR errors
User corrections
Disfluencies and self-repairs
Identifying speech acts
Locating topic boundaries for topic tracking and
audio browsing
Recognizing speaker emotion

47
Interpreting Speech/Dialogue Acts

What function(s) is speaker turn serving?
Same phrase can perform different speech acts
Yes acknowledgment, acceptance, question,
Different phrases can perform the same speech act
Yes, Right, Okay, Certainly,..
Can prosody distinguish differences/identify
commonalities?
Okay
Contours distinguish different uses (Hockey 91)
Contours context distinguish different uses
(Kowtko 96)

48
Automatic Speech/Dialogue Act Recognition

S/DA recognition important for
Turn recognition (which grammar to use when)
Turn disambiguation, e.g.
S What city do you want to go to?
U1 Boston. (reply)
U2 Pardon? (request for information)
S Do you want to go to Boston?
U1 Boston. (confirmation)
U2 Boston? (question)

49
Current Approaches

Statistical modeling to recognize phrase
boundaries and accent from acoustic evidence for
Verbmobil (Nöth et al 99)
Prosodic boundaries provide potential DA
boundaries
Most frequently accented words (salient words) in
training corpus p.o.s. improve key-word
selection for identifying DAs
ACCEPT (ok, all right, marvellous, Friday, free)
SUGGEST (Monday, Friday, Thursday, Wednesday,
Saturday)
Improvements in DA identification over
non-prosody approaches (also cf Shriberg et al
98 on Switchboard, Taylor et al 98 on Map Task)

Key features
f0 (range better than accent or phrasing),
duration, energy, rate (Shriberg et al 98)
But little improvement of ASR accuracy
Importance of DA coding scheme
Some DAs more usefully disambiguated than others
Some coding schemes more disambiguable than others

51
Overview

Recognizing communicative problems
ASR errors
User corrections
Disfluencies and self-repairs
Identifying speech acts
Locating topic boundaries for topic tracking and
audio browsing
Recognizing speaker emotion

52
Prosodic Correlates of Discourse/Topic Structure

Pitch range
Lehiste 75, Brown et al 83, Silverman 86,
Avesani Vayra 88, Ayers 92, Swerts et al 92,
Grosz Hirschberg92, Swerts Ostendorf 95,
Hirschberg Nakatani 96
Preceding pause
Lehiste 79, Chafe 80, Brown et al 83,
Silverman 86, Woodbury 87, Avesani Vayra 88,
Grosz Hirschberg92, Passoneau Litman 93,
Hirschberg Nakatani 96

Rate
Butterworth 75, Lehiste 80, Grosz
Hirschberg92, Hirschberg Nakatani 96
Amplitude
Brown et al 83, Grosz Hirschberg92,
Hirschberg Nakatani 96
Contour
Brown et al 83, Woodbury 87, Swerts et al 92

54
Automatic Topic Segmentation

Important for audio browsing and retrieval tasks
Broadcast News (NIST TREC SDR track)
Topic Detection and Tracking (NIST/DARPA TDT)
Customer care call recordings, focus groups
Most relies on lexical information (Hearst 94,
Reynar 98, Beeferman et al 99)

55
Prosodic Cues to Segmentation

Paratones intonational paragraphs (Brown et al
80, Nakatani Hirschberg 95)
Recent results (Shriberg et al 00) show prosodic
cues perform as well or better than text-based
cues at topic segmentation -- and generalize
better?
Goal identify sentence and topic boundaries at
ASR-defined word boundaries
Procedure
CART decision trees provided boundary predictions
HMM combined these with lexical boundary
predictions

Features
Pause at boundary (raw and normalized by speaker)
Pause at word before boundary
Normalized phone and rhyme duration
F0 (smoothed/stylized) reset, range, slope and
continuity
Voice quality (halving/doubling estimates as
correlates of creak or glottalization)
Speaker change, time from start of turn, turns
in conversation and gender
Topic segmentation results (BN only)
Prosody alone better than LM combined improves
significantly
Useful features pause at boundary, f0 range,
turn/no turn, gender, time in turn

57
Features Examined

For each potential boundary location
Pause at boundary (raw and normalized by speaker)
Pause at word before boundary (is this a new
turn or part of continuous speech segment?)
Phone and rhyme duration (normalized by inherent
duration) (phrase-fine lengthening?)
F0 (smoothed and stylized) reset, range
(topline, baseline), slope and continuity
Voice quality (halving/doubling estimates as
correlates of creak or glottalization)
Speaker change, time from start of turn, turns
in conversation and gender
Trained/tested on Switchboard and Broadcast News

Sentence segmentation results
Prosody better than LM for BN but worse (on
transcription) and same for recognition results
on SB all better than chance
Useful features for BN pause at boundary
,turn/no turn, f0 diff across boundary, rhyme
duration
Useful features for SB phone/rhyme duration
before boundary, pause at boundary, turn/no turn,
pause at preceding word boundary, time in turn
Topic segmentation results (BN only)
Useful features pause at boundary, f0 range,
turn/no turn, gender, time in turn
Prosody alone better than LM combined improves
significantly

59
SCAN
60
SCAN demo
61
POP-3 Server
ASR Server
Email Server
AUDIX Server
SCANMail HUB/DB
Information Extraction Server
Caller Id Server
IR Server
Client
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
Overview

Recognizing communicative problems
ASR errors
User corrections
Disfluencies and self-repairs
Identifying speech acts
Locating topic boundaries for topic tracking and
audio browsing
Recognizing speaker emotion

66
Identifying Emotion

Human perception (Cahn 88, Murray Arnott 93)
Automatic identification (Nöth et al 99)

67
Future of Prosody in Recognition/Understanding

Finding more non-traditional aspects of
recognition/understanding where prosody can be
useful
Finding better ways to map linguistic information
(e.g. accent) into objective acoustic measures
Finding applications where prosodic information
makes a difference

Write a Comment

User Comments (0)