Title: Prosody%20in%20Recognition/Understanding
1Prosody in Recognition/Understanding
2Prosody in ASR Today
- Little success in improving ASR transcription
- More promise in non-traditional ASR-related
tasks - Improving rejection
- Shrinking search space
- Automatic segmentation
- Identifying salient words
- Disambiguating speech/dialogue acts
- Prosody in ASR understanding
3Overview
- Recognizing communicative problems
- ASR errors
- User corrections
- Identifying speech acts
- Locating topic boundaries for topic tracking and
audio browsing - Recognizing speaker emotion
4But...Systems Have Trouble Knowing When Theyve
Made a Mistake
- Hard for humans to correct system misconceptions
(Krahmer et al 99) - User I want to go to Boston.
- System What day do you want to go to Baltimore?
- Easier answering explicit requests for
confirmation or responding to ASR rejections - System Did you say you want to go to Baltimore?
- System I'm sorry. I didn't understand you. Could
you please repeat your utterance? - But constant confirmation or over-cautious
rejection lengthens dialogue and decreases user
satisfaction
5And Systems Have Trouble Recognizing User
Corrections
- Probability of recognition failures increases
after a misrecognition (Levow 98) - Corrections of system errors often
hyperarticulated (louder, slower, more internal
pauses, exaggerated pronunciation) ? more ASR
error (Wade et al 92, Oviatt et al 96, Swerts
Ostendorf 97, Levow 98, Bell Gustafson 99)
6Can Prosodic Information Help Systems Perform
Better?
- If errors occur where speaker turns are
prosodically marked. - Can we recognize turns that will be misrecognized
by examining their prosody? - Can we modify our dialogue and recognition
strategies to handle corrections more
appropriately?
7Approach
- Collect corpus from interactive voice response
system - Identify speaker turns
- incorrectly recognized
- where speakers first aware of error
- that correct misrecognitions
- Identify prosodic features of turns in each
category and compare to other turns - Use Machine Learning techniques to train a
classifier to make these distinctions
automatically
misrecognition
aware site
correction
8Turn Types
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User
New York.
misrecognition
correction
aware site
9TOOT Dialogues
- Collected to study effects of differences in
dialogue strategy on user performance and
satisfaction (Litman Pan 99) - type of initiative (system, user, mixed)
- type of confirmation (explicit, implicit, none)
- adaptability condition
- Subjects
- 39 summer students
- 16/23 (F/M)
- 20/19 (native speaker/non)
10- Platform combined over-the-phone ASR and TTS
(Kamm et al 97) with web access to train
information - Task find train information for 4 scenarios
- Corpus for current study
- 2328 speaker turns
- 52 dialogues
- Misrecognitions
- Overall word accuracy 61
- Overall concept accuracy(CA) 71
- I want to go to Boston from Philadelphia (2
domain concepts) - recognized as I want to go to Boston (one
concept) 50
11A Successful Dialogue
- S Hi, this is ATT Amtrak schedule system. This
is TOOT. How may I help you? - U I would like a train to New York City from
Philadelphia on Sunday at ten thirty P M - S I heard you say go to New York City from
Philadelphia on Sunday approximately at ten
thirty p.m." Do you want me to find the trains
from Philadelphia to New York City on Sunday
approximately at ten thirty p. m. now? - U Yes.
- S I am going to get the train schedule for
you....
12An Unsuccessful Dialogue
- S Hi, this is ATT Amtrak schedule system. This
is TOOT. How may I help you? - U I'd like to find a train from Ch- to Chicago
from Baltimore. - S Sorry, I can't understand you. Please repeat
your utterance. - U I'd like to find a train from Baltimore to
Chicago. - S I heard you say leave today around 5 thirty
in the morning . Which city do you want to go
to? - U I would like to leave Saturday at eight in the
morning.
13S You can say the arrival city name, such as
New York City." U Baltimore to Chicago. S You
can say the arrival city name, such as New York
City. U Baltimore to Chicago. (I hate it.) U
Train from Baltimore ...S ...I heard you s...U
to Chicago.... Hello? S You can say the name of
your departure city, such as "New York City." U
Baltimore.
14U Bal-ti-more.... S You can say... U
Baltimore. S Do you want to continue and find a
new set of trains? U Yes. S You can say "yes"
or "I am done here" to find a new set of
trains. U Yes. S Thank you for using ATT
Amtrak train time table system. See you next
time. U I nev-
15Are Misrecognitions, Aware Turns, Corrections
Measurably Different from Other Turns?
- For each type of turn
- For each speaker, for each prosodic feature,
calculate mean values for e.g. all correctly
recognized speaker turns and for all incorrectly
recognized turns - Perform paired t-tests on these speaker pairs of
means (e.g., for each speaker, pairing mean
values for correctly and incorrectly recognized
turns)
16How Prosodic Features Examined per Turn
- Raw prosodic/acoustic features
- f0 maximum and mean (pitch excursion/range)
- rms maximum and mean (amplitude)
- total duration
- duration of preceding silence
- amount of silence within turn
- speaking rate (estimated from syllables of
recognized string per second) - Normalized versions of each feature (compared to
first turn in task, to previous turn in task, Z
scores)
17Distinguishing Correct Recognitions from
Misrecognitions (NAACL 00)
- Misrecognitions differ prosodically from correct
recognitions in - F0 maximum (higher)
- RMS maximum (louder)
- turn duration (longer)
- preceding pause (longer)
- slower
- Effect holds up across speakers and even when
hyperarticulated turns are excluded
18WER-Based Results
Misrecognitions are higher in pitch, louder,
longer, more preceding pause and less internal
silence
19Does Hyperarticulation Lead to ASR Error?
- In TOOT corpus
- 24.1 of turns (perceived as) hyperarticulated
- Hyperarticulated turns are recognized more poorly
(59.5 WER) than non-hyperarticulated turns
(32.8) - More misrecognized turns are hyperarticulated
(36.5) than correctly recognized turns (16.0) - But .. same results w/out hyperarticulated turns
20Predicting Turn Types Using Machine Learning
- Ripper (Cohen 96) automatically induces rule
sets for predicting turn types - greedy search guided by measure of information
gain - input vectors of feature values
- output ordered rules for predicting dependent
variable and X-validated scores for each ruleset - Independent variables
- all prosodic features, raw and normalized
- experimental conditions (initiative type,
confirmation style, adaptability, subject, task) - gender, native/non-native status
- ASR recognized string, grammar, and acoustic
confidence score
21ML Results WER-defined Misrecognition
22Best Rule-Set for Predicting WER
Using prosody, ASR conf, ASR string, ASR grammar
if (conf lt -2.85 (duration gt 1.27) then
F if (conf lt -4.34) then F if (tempo lt .81)
then F If (conf lt -4.09 then F If (conf lt
-2.46 str contains help then F If conf lt
-2.47 ppau gt .77 tempo lt .25 then F If str
contains nope then F If dur gt 1.71 tempo lt
1.76 then F else T
23Analyses of Awares and Corrections
- Awares
- Shorter, somewhat louder, with less internal
silence compared to other turns - Poorly recognized (49.9 misrecd vs. 34.6)
- ML results 30.4 baseline (!aware)/Mean error
12.2 (/-.61) - Corrections
- longer, louder, higher in pitch excursion, longer
preceding pause, less internal silence - ML results 30 baseline/Mean error 21.48 /-
0.68
24Turn Types
S Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? U Hello. I
would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
S Which city do you want to go to? U New
York.
aware site
25Awareness Sites
- Shorter, somewhat louder, with less internal
silence compared to other turns - Poorly recognized (49.9 vs. 34.6)
26ML Rules for Aware Prediction
- 30.4 baseline (!aware)/Average error 12.2
(/-.61) - T - preconflt-4.06, pretpolt2.65, ppaugt0.25
- T - preconflt-3.59, prerejT
- T - preconflt-2.85, predurgt1.04, tponm2gt1.04,
preppaugt0.57, pretpolt2.18 - T - preconflt-3.78, pmnsylsgt4.04
- T - preconflt-2.75, prestr help
- T - pregramuniversal, pprewordsgt2
- T - preconflt-2.60, predurgt1.04,
zerosnm1lt1.06, prermsavgt370.65 - T - pretpolt0.13
- T - predurgt1.27, pretpolt2.36, prermsavgt245.36
- T - pretpolt0.80, pmntpolt1.75, ppretponm2lt1.39
- default F
27TOOT Corrections (ICSLP 00b)
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User New
York.
correction
28Serious Problem for Spoken Dialogue Systems
- 29 of turns in our corpus are corrections
- 52 of corrections are hyperarticulated but only
12 of other turns - Corrections are misrecognized at least twice as
often as non-corrections (60 vs. 31) - But corrections are no more likely to be rejected
than non-corrections. (9 vs. 8) - Are corrections also measurably distinct from
non-corrections?
29Prosodic Indicators of Corrections
- Corrections differ from other turns prosodically
longer, louder, higher in pitch excursion,
longer preceding pause, less internal silence - ML results
- Baseline 30 error
- normd prosody non-prosody 18.45 /- 0.78
- automatic 21.48 /- 0.68
30Prosodic Indicators of Corrections
- Corrections differ from other turns prosodically
longer, louder, higher in pitch excursion,
longer preceding pause, less internal silence
31ML Rules for Correction Prediction
- Baseline 30 error (predict not correction)
- normd prosody non-prosody 18.45 /- 0.78
- automatic 21.48 /- 0.68
- T - gramuniversal, f0maxgt0.96, durgt6.55
- T - gramuniversal, zerosgt0.57, asrlt-2.95
- T - gramuniversal, f0maxlt1.98, durlt1.10,
tempogt1.21, zerosgt0.71 - T - durgt0.76, asrlt-2.97, stratUsrNoConf
- T - durgt2.28, ppault0.86
- T - rmsavgt1.11, stratMixedImplicit,
gramcityname, f0maxgt0.70 - default FALSE
32Corrections in Context
- Similar in prosodic features but
- What about their form and content?
- How do system behaviors affect the corrections
users produce? - What sort of corrections are most, least
effective? - When users correct the same mistake more than
once, do they vary their strategy in productive
ways?
33User Correction Behavior
- Correction classes
- omits and repetitions lead to fewer
misrecognitions than adds and paraphrases - Turns that correct rejections are more likely to
be repetitions, while turns correcting
misrecognitions are more likely to be omits
34Role of System Strategy
Would you use again (1-5) SE (3.5), MI (2.6),
UNC (1.7) Satisfaction (0-40) SE (31.25), MI
(24.00), UNC (22.10)
35- Type of correction sensitive to strategy
- much more likely to exactly repeat their
misrecognized utterance in a system-initiative
environment - much more likely to correct by omitting
information if no system confirmation than with
explicit confirmation - omits used more in MixedImplicit and
UserNoConfirm conditions - Restarts unlikely to be recognized (77
misrecognized) and skewed in distribution - 31 of corrections are restarts in MI and UNC
- None for SE, where initial turns well recognized
- It doesnt pay to start over!
36Correction Chains
S You can say the arrival city name, such as
New York City. U Baltimore to Chicago. (I hate
it.) U Train from Baltimore ...S ...I heard you
s...U to Chicago.... Hello? S You can say the
name of your departure city, such as "New York
City." U Baltimore. Bal-ti-more.... S You can
say... U Baltimore.
37Effects of Distance from Error on User Corrections
- Corrections farther from original error in turns
of chain position - higher in f0 (max,mean), lower in rms (max,mean),
longer in duration (sec,words), slower, with more
pause preceding, and lower CA (and word accuracy
for latter) - A puzzle corrections more distant from
immediately preceding error more likely to be
recognized - But similar in f0, rms, and duration
- Not a difference among strategies
38Future Research
- Hypothesis We can improve system recognition and
error recovery by - Analyzing user input differently to identify
higher level features of turns systems are likely
to misrecognize -- and turns speakers produce to
correct their errors - Anticipating user responses to system errors in
the context of different strategies and targeting
special error recovery procedures - Next combining our predictors and an
over-the-phone interface to SCANMail, our
voicemail browsing and retrieval system
39Overview
- Recognizing communicative problems
- ASR errors
- User corrections
- Disfluencies and self-repairs
- Identifying speech acts
- Locating topic boundaries for topic tracking and
audio browsing - Recognizing speaker emotion
40Disfluencies Problems and Possibilities for
Dialogue Systems
- Disfluencies abound in spontaneous speech
- every 4.6s in radio call-in (Blackmer Mitton
91) - many more in audio-only conditions (Kasl Mahl
65, Oviatt 95) - Hard to define and often unperceived by listeners
- hesitation Ch- change strategy.
- filled pause Um Baltimore.
- self-repair Ba- uh Chicago.
41Recognizing Disfluency
- Hard to detect in ASR systems and often leads to
recognition error - Ch- change strategy. --gt to D C D C today ten
fifteen. - Um Baltimore. --gt From Baltimore ten.
- Ba- uh Chicago. --gt For Boston Chicago.
- ASR systems usually try to treat disfluencies as
garbage or assumes e.g. repairs can be
identified from (correct) transcription - But may lose valuable meta-level information
42What Might Disfluencies Convey?
- Speakers cognitive load?
- Difficulty of domain (Clark Brennan 91)
- Difficulty of task
- Inter-speaker co-ordination?
- Turn-taking (Boomer 65, Goodwin 81, Shriberg
96) - Difficulties in accessing information (Brennan
Williams 95, Bortfeld Brennan 97) - Speaker error (Brennan Schober, TA)
- Discourse structure (Swerts 98)
- Effects of familiarity, age, gender? (Bortfeld et
al 99)
43Can Prosodic Cues Help in Detection?
- Filled pauses
- Shriberg 94 longer duration and falling f0
contour - Self-repairs
- Levelt Cutler 83 increase in prominence on
repair - Hindle 83 edit signal hypothesis
- Blackmer Mitton 91 aberrant phones
- Howell Young 91 increase in prior pause and
f0 before repair (latter facilitated perception) - Bear et al 92 increase in internal pausal
duration and f0 of repair
44(No Transcript)
45RIM Model of Self-Repairs (Nakatani Hirschberg
94)
- ATIS corpus
- 6414 turns with 346 (5.4) repairs from 122
speakers - hand-labeled for repairs and prosodic features
- Findings
- Reparanda 73 end in fragments, 30 in
glottalization, co-articulatory gestures - DI pausal duration differs significantly from
fluent boundaries,small increase in f0 and
amplitude - Repairsoffsets occur at phrase boundaries,
phrasing differences - CART prediction 86 precision, 91 recall
- Duration of interval, presence of fragment, pause
filler, p.o.s., lexical matching across DI
46Overview
- Recognizing communicative problems
- ASR errors
- User corrections
- Disfluencies and self-repairs
- Identifying speech acts
- Locating topic boundaries for topic tracking and
audio browsing - Recognizing speaker emotion
47Interpreting Speech/Dialogue Acts
- What function(s) is speaker turn serving?
- Same phrase can perform different speech acts
- Yes acknowledgment, acceptance, question,
- Different phrases can perform the same speech act
- Yes, Right, Okay, Certainly,..
- Can prosody distinguish differences/identify
commonalities? - Okay
- Contours distinguish different uses (Hockey 91)
- Contours context distinguish different uses
(Kowtko 96)
48Automatic Speech/Dialogue Act Recognition
- S/DA recognition important for
- Turn recognition (which grammar to use when)
- Turn disambiguation, e.g.
- S What city do you want to go to?
- U1 Boston. (reply)
- U2 Pardon? (request for information)
- S Do you want to go to Boston?
- U1 Boston. (confirmation)
- U2 Boston? (question)
49Current Approaches
- Statistical modeling to recognize phrase
boundaries and accent from acoustic evidence for
Verbmobil (Nöth et al 99) - Prosodic boundaries provide potential DA
boundaries - Most frequently accented words (salient words) in
training corpus p.o.s. improve key-word
selection for identifying DAs - ACCEPT (ok, all right, marvellous, Friday, free)
- SUGGEST (Monday, Friday, Thursday, Wednesday,
Saturday) - Improvements in DA identification over
non-prosody approaches (also cf Shriberg et al
98 on Switchboard, Taylor et al 98 on Map Task)
50- Key features
- f0 (range better than accent or phrasing),
duration, energy, rate (Shriberg et al 98) - But little improvement of ASR accuracy
- Importance of DA coding scheme
- Some DAs more usefully disambiguated than others
- Some coding schemes more disambiguable than others
51Overview
- Recognizing communicative problems
- ASR errors
- User corrections
- Disfluencies and self-repairs
- Identifying speech acts
- Locating topic boundaries for topic tracking and
audio browsing - Recognizing speaker emotion
52Prosodic Correlates of Discourse/Topic Structure
- Pitch range
- Lehiste 75, Brown et al 83, Silverman 86,
Avesani Vayra 88, Ayers 92, Swerts et al 92,
Grosz Hirschberg92, Swerts Ostendorf 95,
Hirschberg Nakatani 96 - Preceding pause
- Lehiste 79, Chafe 80, Brown et al 83,
Silverman 86, Woodbury 87, Avesani Vayra 88,
Grosz Hirschberg92, Passoneau Litman 93,
Hirschberg Nakatani 96
53- Rate
- Butterworth 75, Lehiste 80, Grosz
Hirschberg92, Hirschberg Nakatani 96 - Amplitude
- Brown et al 83, Grosz Hirschberg92,
Hirschberg Nakatani 96 - Contour
- Brown et al 83, Woodbury 87, Swerts et al 92
54Automatic Topic Segmentation
- Important for audio browsing and retrieval tasks
- Broadcast News (NIST TREC SDR track)
- Topic Detection and Tracking (NIST/DARPA TDT)
- Customer care call recordings, focus groups
- Most relies on lexical information (Hearst 94,
Reynar 98, Beeferman et al 99)
55Prosodic Cues to Segmentation
- Paratones intonational paragraphs (Brown et al
80, Nakatani Hirschberg 95) - Recent results (Shriberg et al 00) show prosodic
cues perform as well or better than text-based
cues at topic segmentation -- and generalize
better? - Goal identify sentence and topic boundaries at
ASR-defined word boundaries - Procedure
- CART decision trees provided boundary predictions
- HMM combined these with lexical boundary
predictions
56- Features
- Pause at boundary (raw and normalized by speaker)
- Pause at word before boundary
- Normalized phone and rhyme duration
- F0 (smoothed/stylized) reset, range, slope and
continuity - Voice quality (halving/doubling estimates as
correlates of creak or glottalization) - Speaker change, time from start of turn, turns
in conversation and gender - Topic segmentation results (BN only)
- Prosody alone better than LM combined improves
significantly - Useful features pause at boundary, f0 range,
turn/no turn, gender, time in turn
57Features Examined
- For each potential boundary location
- Pause at boundary (raw and normalized by speaker)
- Pause at word before boundary (is this a new
turn or part of continuous speech segment?) - Phone and rhyme duration (normalized by inherent
duration) (phrase-fine lengthening?) - F0 (smoothed and stylized) reset, range
(topline, baseline), slope and continuity - Voice quality (halving/doubling estimates as
correlates of creak or glottalization) - Speaker change, time from start of turn, turns
in conversation and gender - Trained/tested on Switchboard and Broadcast News
58- Sentence segmentation results
- Prosody better than LM for BN but worse (on
transcription) and same for recognition results
on SB all better than chance - Useful features for BN pause at boundary
,turn/no turn, f0 diff across boundary, rhyme
duration - Useful features for SB phone/rhyme duration
before boundary, pause at boundary, turn/no turn,
pause at preceding word boundary, time in turn - Topic segmentation results (BN only)
- Useful features pause at boundary, f0 range,
turn/no turn, gender, time in turn - Prosody alone better than LM combined improves
significantly
59SCAN
60SCAN demo
61POP-3 Server
ASR Server
Email Server
AUDIX Server
SCANMail HUB/DB
Information Extraction Server
Caller Id Server
IR Server
Client
62(No Transcript)
63(No Transcript)
64(No Transcript)
65Overview
- Recognizing communicative problems
- ASR errors
- User corrections
- Disfluencies and self-repairs
- Identifying speech acts
- Locating topic boundaries for topic tracking and
audio browsing - Recognizing speaker emotion
66Identifying Emotion
- Human perception (Cahn 88, Murray Arnott 93)
- Automatic identification (Nöth et al 99)
67Future of Prosody in Recognition/Understanding
- Finding more non-traditional aspects of
recognition/understanding where prosody can be
useful - Finding better ways to map linguistic information
(e.g. accent) into objective acoustic measures - Finding applications where prosodic information
makes a difference