Title: Characterizing Task-Oriented Dialog using a Simulated ASR Channel
1Characterizing Task-Oriented Dialog using a
Simulated ASR Channel
- Jason D. Williams
- Machine Intelligence Laboratory
- Cambridge University Engineering Department
2SACTI-1 CorpusSimulated ASR-Channel Tourist
Information
- Motivation for the data collection
- Experimental set-up
- Transcription Annotation
- Effects of ASR error rate on
- Turn length / dialog length
- Perception of error rate
- Task completion
- Initiative
- Overall satisfaction (PARADISE)
3ASR channel vs. HH channel
Properties
ASR channel
- Instant communication
- Effectively perfect recognition of words
- Prosodic information carries additional
information
- Turns explicitly segmented
- Barge-in, End-pointed
- Prosody virtually eliminated
- ASR parsing errors
Observations
- Frequent but brief overlaps
- 80 of utterances contain fewer than 12 words
50 lt 5 - Approximately equal turn length
- Approximately equal balance of initiative
- About half of turns are ACK (often spliced)
- Few overlaps
- Longer system turns shorter user turns
- Initiative more often with system
- Virtually no turns are ACK
- Virtually no splicing
Are models of HC dialog/grounding appropriate in
the presence of the ASR channel?
4My approach
- Study the ASR channel in the abstract
- WoZ experiments using a simulated ASR channel
- Understand how people behave with an ideal
dialog manager - For example, grounding model
- Use these insights to inform state space and
action set selection - Note that collected data has unique properties
useful to - RL-based systems
- Hidden-state estimation
- User modeling
- Formulate dialog management problem as a POMDP
- Decompose state into BN nodes for example
- Conversation state (grounding state)
- User action
- User belief (goal)
- Train using data collected
- Solve using approximations
5The paradox of dialog data
- To build a user model, we need to see the users
reaction to all kinds of misunderstandings - However, most systems use a fixed policy
- Systems typically do not take different actions
in the same situation - Taking random actions is clearly not an option!
- Constraining actions means building very complex
systems - and which actions should be in the systems
repertoire?
6An ideal data collection
- would show users reactions to a variety of error
handling strategies (no fixed policy) - BUT would not be nonsense dialogs!
- would use the ASR channel
- would explore a variety of operating conditions
e.g., WER rate - would not assume a particular state space
- ... would somehow discover the set of system
actions
7Data collection set-up
8ASR simulation state machine
- Simple energy-based barge-in
- User interrupts wizard
Userstarts talking
SILENCE
USER_TALKING
Typist done reco result displayed
Userstops talking
Wizard starts talking
Wizard stops talking
Userstarts talking
WIZARD_TALKING
TYPIST_TYPING
9ASR simulation
- Simplified FSM-based recognizer
- Weighted finite state transducer (WFST)
- Flow
- Reference input
- Spell-checked against full dictionary
- Converted to phonetic string using full
dictionary - Phonetic lattice generated based on confusion
model - Word lattice produced
- Language model composed to re-score lattice
- De-coded to produce word strings
- N-Best list extracted.
- Various free variables to induce random behavior
- Plumb N-Best list for variability
10ASR simulation evaluation
- Hypothesis Simulation produces errors similar to
errors induced by additive noise w.r.t. concept
accuracy
- Assess concept accuracy as F-Measure using
automated, data-driven procedure (HVS model) - Plot concept accuracy of
- Real additive noise
- A naïve confusion model (simple insertion,
substitution, deletion) - WFST confusion model
- WFST appears to follow real data much more
closely than naïve model
11Scenario Tasks
- Tourist / Tourist information scenario
- Intentionally goal-directed
- Intentionally simple tasks
- Mixtures of simple information gathering and
basic planning - Wizard (Information giver)
- Access to bus times, tram times, restaurants,
hotels, bars, tourist attraction information,
etc. - User given series of tasks
- Likert scores asked at end of each task
- 4 Dialogs / user 3 users/Wizard
- Example task Finding the perfect hotel
- Youre looking for a hotel for you and your
travelling partner that meets a number of
requirements. - Youd like the following
- En suite rooms
- Quiet rooms
- As close to the main square as possible
- Given those desires, find the least expensive
hotel. Youd prefer not compromise on your
requirements, but of course you will if you must! - Please indicate the location of the hotel on the
map and fill in the boxes below.
Name of accommodation
Cost per night for 2 people
12Users Map
13Wizards map
14Likert-scale questions
- User/wizard each given 6 questions after each
task - Subject example
- In this task, I accomplished the goal.
- In this task, I thought the speech recognition
was accurate. - In this task, I found it difficult to communicate
because of the speech recognition. - In this task, I believe the other subject was
very helpful. - In this task, the other subject found using the
speech recognition difficult. - Overall, I was very satisfied with this past task.
Disagree strongly (1) Disagree (2) Disagree somewhat (3) Neither agree nor disagree (4) Agree somewhat (5) Agree (6) Strongly agree (7)
15Transcription
- User-side transcribed during experiments
- Prioritized for speed
- "I NEED UH IM LOOKING FOR A PIZZA"
- Wizard-side transcribed using a subset of LDC
transcription guidelines more detail - "ok uh (()) sure you - i can"
- epErrorEndtrue
16Annotation (acts)
- Each turn is a sequence of tags
- Inspired by Traums Grounding Acts
- More detailed / easier to infer from surface words
Tag Meaning
Request Question/request requiring response
Inform Statement/provision of task information
Greet-Farewell Hello, How can I help, thats all, Thanks, Goodbye, etc.
ExplAck Explicit statement of acknowledgement, showing speaker understanding of OS
Unsolicited-Affirm Explicit statement of acknowledgement, showing OS understands speaker
HoldFloor Explicit request for OS to wait
ReqRepeat Request for OS to repeat their last turn
ReqAck Request for OS to show understanding
RspAffirm Affirmative response to ReqAck
RspNegate Negative response to ReqAck
StateInterp A statement of intention of OS
DisAck Show of lack of understanding of OS
RejOther Display of lack of understanding of speakers intention or desire by OS
17Annotation (understanding)
- Each wizard turn was labeled to indicate whether
the wizard understood the previous user turn
Label Wizards understanding of previous user turn Wizards understanding of previous user turn
Full All intentions understood correctly.
Partial Some intentions understood none misunderstood.
Non Wizard made no guess at user intention.
Flagged-Mis The wizard formed an incorrect hypothesis of the users meaning, and signalled a dialog problem
Un-Flagged-Mis The wizard formed an incorrect hypothesis of the users meaning, accepted it as correct and continued with the dialog.
18Corpus summary
WER target WER target Wiz User Task Completed in time limit Per-turn WER Per-dialog WER
None 2 6 24 83 0 0
Low 4 12 48 83 32 28
Med 4 12 48 77 46 41
Hi 2 6 24 42 63 60
19Perception of ASR accuracy
- How accurately do users wizards perceive WER?
- Perceptions of recognition quality broadly
reflected actual performance, but users
consistently gave higher quality scores than
wizards for the same WER
20Average turn length (words)
- How does WER affect wizard user turn length?
- Wizard turn length increases
- User turn length stays relatively constant
21Grounding behavior
- How does WER affect wizard grounding behavior?
- As WER increases, wizard grounding behaviors
become increasingly prevalent
22Wizard understanding
- How does WER affect wizard understanding status?
- Misunderstanding increases with WER
- and task completion falls (83, 83, 77, 42)
23Wizard strategies
- Classify each wizard turn into one of 5
strategies
Label Meaning wiz init? Tags Tags
REPAIR Attempt to repair Yes ReqAck, ReqRepeat, StateInterp, DisAck, RejOther
ASKQ Ask task-question Yes Request
GIVEINFO Provide task info No Inform
RSPND Non-initiative taking grounding actions No ExplAck, Rsp-Affirm, Rsp-Negate, Unsolicited-Affirm
OTHER Not included in analysis n/a All others
24Wizard strategies
- What are the most successful strategy after
known dialog trouble? - This plot shows wizard understanding status one
turn after known dialog trouble effect of
REPAIR vs ASKQ. - S indicates significant differences
25User reactions to misunderstandings
- How does a user respond after being
misunderstood? - Surprisingly little explicit indication!
WER target User turns including tag User turns including tag User turns including tag User turns including tag
WER target DisAck RejectOther Request Request
None N/A N/A N/A
Low 0.0 3.8 92.3
Med 2.5 19.0 75.9
Hi 0.0 12.3 87.0
26Level of wizard initiative
- How does initiative vary with WER?
- Define wizard initiative using strategies, above
27Reward measures/PARADISE
- Satisfaction Task completion Dialog Cost
metrics - 2 kinds of user satisfaction
- Single
- Combi
- 3 kinds of task completion
- User
- Obj
- Hyb
- Cost metrics
- PerDialogWER
- UnFlaggedMis
- FlaggedMis
- Non
- Turns
- REPAIR
- ASKQ
28Reward measures/PARADISE
- In almost all experiments using the User task
completion metric, it was the only significant
predictor - The single/combi metrics almost always selected
the same predictors
Dataset Metrics (task user sat) R2 Significant predictors
ALL User-S 52 1.03 Task
ALL User-C 60 5.29 Task 1.54 UnFlagMis
ALL Obj-S 24 -0.49 Turns 0.38 Task
ALL Obj-C 27 -2.43 Turns 1.45 UnFlagMis 1.35 Task
ALL Hyb-S 41 0.74 Task 0.36 Turns
Hi Obj-S 40 0.98 Task
Hi Hyb-S 48 1.07 Task
Med Obj-S 16 -0.62 Non
Med Obj-C 37 -3.35 Non 2.94 Turns
Med Hyb-S 38 0.97 Task
Low Obj-S 28 -0.59 Turns
Low Hyb-S 40 -0.49 Turns 0.40 Task
29Reward measures/PARADISE
- What indicators best predict user satisfaction?
- When run on all data, mixtures of Task, Turns,
UnFlaggedMis best predict user satisfaction. - UnFlaggedMis is serving as a better measurement
of understanding accuracy than WER alone, since
it effectively combines recognition accuracy with
a measure of confidence. - Broadly speaking
- Task completion is most important at the High WER
level - Task completion and dialog quality is most
important at the Med WER level - Efficiency is most important at the Low WER level
- These patterns mirror findings from other
PARADISE experiments using Human/Computer data - This gives us some confidence that this data set
is valid for training Human/Computer systems
30Conclusions/Next steps
- At moderate WER levels, asking task-related
questions appears to be more successful than
direct dialog repair. - Levels of expert initiative increase with WER,
primarily as a result of grounding behavior. - Users infrequently give a direct indication of
having been misunderstood, with no clear
correlation to WER. - When run on all data, mixtures of Task, Turns,
UnFlaggedMis best predict user satisfaction. - Task completion appears to be most predictive of
user satisfaction however, efficiency shows some
influence at lower WERs. - Next apply this corpus to statistical systems.
31Thanks!
Jason D. Williams jdw30_at_cam.ac.uk