Title: Agust
1Turn-Yielding Cuesin Task-Oriented Dialogue
- Agustín Gravano1,2
- Julia Hirschberg1
- Columbia University, New York, USA
- (2) Universidad de Buenos Aires, Argentina
2Interactive Voice Response Systems
Introduction
- Quickly spreading.
- Uncomfortable, awkward.
- ASRTTS account for most IVR problems.
- Other problems revealed.
- Coordination of system-user exchanges.
- Long pauses after user turns interruptions.
- Modeling turn-taking behavior should lead to
improved system-user coordination.
3Goal
Introduction
- Learn when the speaker is likely to end her/his
conversational turn. - Find turn-yielding cues.
- Cues displayed by the speaker when approaching a
potential turn boundary. - This should improve the coordination of IVRs
- Speech understanding Detect the end of the
users turn. - Speech generation Display cues signalling the
end of systems turn.
4Talk Outline
- Previous work
- Material
- Method
- Results
- Conclusions
5Previous Work on Turn-Taking
- Duncan 1972, 1973, 1974, inter alia.
- Hypothesized 6 turn-yielding cues in face-to-face
dialogue. - Conjectured a linear relation between the number
of displayed cues and the likelihood of a
turn-taking attempt. - Studies formalized and verified some of Duncans
hypotheses. ForTho96 WenSie03 CutPea86
WicCas01 - Implementations of turn-boundary detection.
- Simulations Feral.02,03 Edlal.05 Sch06
Attal.08 Bau08 - Actual systems Lets Go! RauEsk08
- Exploiting turn-yielding cues improves
performance.
6Columbia Games Corpus
Material
- 12 task-oriented spontaneous dialogues.
- Standard American English.
- 13 subjects 6 female, 7 male.
- Series of collaborative computer games.
- No eye contact. No speech restrictions.
- 9 hours of dialogue.
- Manual orthographic transcription, alignment.
- Manual prosodic annotations (ToBI).
7Columbia Games Corpus
Material
Player 1 Describer
Player 2 Follower
8Turn-Yielding Cues
- Cues displayed by the speaker when approaching a
potential turn boundary.
9Method
Turn-Yielding Cues
- IPU (Inter Pausal Unit) Maximal sequence of
words from the same speaker surrounded by silence
50ms.
- Smooth switch Speaker A finishes her utterance
speaker B takes the turn with no overlapping
speech. - Trained annotators distinguished Smooth switches
from Interruptions and Backchannels using a
scheme based on Ferguson 1977, Beattie 1982.
10Method
Turn-Yielding Cues
- To find turn-yielding cues, we compare
- IPUs preceding Holds,
- IPUs preceding Smooth switches.
- 200 features acoustic, prosodic, lexical,
syntactic.
11Individual Cues
Turn-Yielding Cues
- Final intonation
- Falling (L-L) or high-rising (H-H).
- Faster speaking rate.
- Reduction of final lengthening.
- Lower intensity level.
- Lower pitch level.
- Higher jitter, shimmer, NHR.
- Related to perception of voice quality.
- Longer IPU duration (seconds and words).
12Individual Cues
Turn-Yielding Cues
- Textual completion (independent of intonation).
- (1) Manually annotated a portion of the data.
- Labelers read up to the end of a target IPU (no
right context), judged whether it could
constitute a complete utterance. 400 tokens.
K0.81. - (2) Trained an SVM classifier.19 lexical
syntactic features.Accuracy 80. Maj-class
baseline 55. Human agreement 91. - (3) Labeled all IPUs in the corpus with the SVM
model.
13Individual Cues
Turn-Yielding Cues
- Final intonation L-L or H-H.
- Faster speaking rate.
- Lower intensity level.
- Lower pitch level.
- Higher jitter, shimmer, NHR.
- Longer IPU duration.
- Textual completion.
14Defining Presence of a Cue
Turn-Yielding Cues
- 2-3 representative features for each cue
Final intonation Abs. pitch slope over final 200ms, 300ms.
Speaking rate Syllables/sec, phonemes/sec over IPU.
Intensity level Mean intensity over final 500ms, 1000ms.
Pitch level Mean pitch over final 500ms, 1000ms.
Voice quality Jitter, shimmer, NHR over final 500ms.
IPU duration Duration in ms, and in number of words.
Textual completion Complete vs. incomplete (binary).
- Define presence/absence based on whether the
value is closer to the mean before S or H.
15Top Frequencies of Complex Cues
digit cue present dot cue absent
Turn-yielding cues 1 Final intonation 2
Speaking rate 3 Intensity level 4 Pitch
level 5 IPU duration 6 Voice quality 7
Completion
16Combined Cues
Turn-Yielding Cues
r 2 0.969
Percentage of turn-taking attempts
Number of cues conjointly displayed
17IVR Systems
Turn-Yielding Cues
- After each IPU from the user
- if estimated likelihood gt threshold
- then take the turn
- To signal the end of a systems turn
- Include as many cues as possible in the systems
final IPU.
18Summary
- Study of turn-yielding cues.
- Objective, automatically computable.
- Combined cues.
- Improve turn-taking decisions of IVR systems.
- Results drawn from task-oriented dialogues.
- Not necessarily generalizable.
- Suitable for most IVR domains.
- Interspeech 2009 Study of backchannel-inviting
cues.
19Special thanks to
- Julia Hirschberg
- Thesis Committee Members
- Maxine Eskenazi, Kathy McKeown, Becky Passonneau,
Amanda Stent. - Speech Lab at Columbia University
- Stefan Benus, Fadi Biadsy, Sasha Caskey, Bob
Coyne, Frank Enos, Martin Jansche, Jackson
Liscombe, Sameer Maskey, Andrew Rosenberg. - Collaborators
- Gregory Ward and Elisa Sneed German (Northwestern
U) Ani Nenkova (UPenn) Héctor Chávez, David
Elson, Michel Galley, Enrique Henestroza, Hanae
Koiso, Shira Mitchell, Michael Mulley, Kristen
Parton, Ilia Vovsha, Lauren Wilcox.