Characterizing Task-Oriented Dialog using a Simulated ASR Channel - PowerPoint PPT Presentation

About This Presentation

Title:

Characterizing Task-Oriented Dialog using a Simulated ASR Channel

Description:

... should be in the system's repertoire? An ideal data collection... Mixtures of simple information gathering and basic planning. Wizard (Information giver) ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 32

Provided by: philip291

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Characterizing Task-Oriented Dialog using a Simulated ASR Channel

1
Characterizing Task-Oriented Dialog using a
Simulated ASR Channel

Jason D. Williams
Machine Intelligence Laboratory
Cambridge University Engineering Department

2
SACTI-1 CorpusSimulated ASR-Channel Tourist
Information

Motivation for the data collection
Experimental set-up
Transcription Annotation
Effects of ASR error rate on
Turn length / dialog length
Perception of error rate
Task completion
Initiative
Overall satisfaction (PARADISE)

3
ASR channel vs. HH channel
Properties

HH dialog

ASR channel

Instant communication
Effectively perfect recognition of words
Prosodic information carries additional
information

Turns explicitly segmented
Barge-in, End-pointed
Prosody virtually eliminated
ASR parsing errors

Observations

Frequent but brief overlaps
80 of utterances contain fewer than 12 words
50 lt 5
Approximately equal turn length
Approximately equal balance of initiative
About half of turns are ACK (often spliced)

Few overlaps
Longer system turns shorter user turns
Initiative more often with system
Virtually no turns are ACK
Virtually no splicing

Are models of HC dialog/grounding appropriate in
the presence of the ASR channel?
4
My approach

Study the ASR channel in the abstract
WoZ experiments using a simulated ASR channel
Understand how people behave with an ideal
dialog manager
For example, grounding model
Use these insights to inform state space and
action set selection
Note that collected data has unique properties
useful to
RL-based systems
Hidden-state estimation
User modeling
Formulate dialog management problem as a POMDP
Decompose state into BN nodes for example
Conversation state (grounding state)
User action
User belief (goal)
Train using data collected
Solve using approximations

5
The paradox of dialog data

To build a user model, we need to see the users
reaction to all kinds of misunderstandings
However, most systems use a fixed policy
Systems typically do not take different actions
in the same situation
Taking random actions is clearly not an option!
Constraining actions means building very complex
systems
and which actions should be in the systems
repertoire?

6
An ideal data collection

would show users reactions to a variety of error
handling strategies (no fixed policy)
BUT would not be nonsense dialogs!
would use the ASR channel
would explore a variety of operating conditions
e.g., WER rate
would not assume a particular state space
... would somehow discover the set of system
actions

7
Data collection set-up
8
ASR simulation state machine

Simple energy-based barge-in
User interrupts wizard

Userstarts talking
SILENCE
USER_TALKING
Typist done reco result displayed
Userstops talking
Wizard starts talking
Wizard stops talking
Userstarts talking
WIZARD_TALKING
TYPIST_TYPING
9
ASR simulation

Simplified FSM-based recognizer
Weighted finite state transducer (WFST)
Flow
Reference input
Spell-checked against full dictionary
Converted to phonetic string using full
dictionary
Phonetic lattice generated based on confusion
model
Word lattice produced
Language model composed to re-score lattice
De-coded to produce word strings
N-Best list extracted.
Various free variables to induce random behavior
Plumb N-Best list for variability

10
ASR simulation evaluation

Hypothesis Simulation produces errors similar to
errors induced by additive noise w.r.t. concept
accuracy

Assess concept accuracy as F-Measure using
automated, data-driven procedure (HVS model)
Plot concept accuracy of
Real additive noise
A naïve confusion model (simple insertion,
substitution, deletion)
WFST confusion model
WFST appears to follow real data much more
closely than naïve model

11
Scenario Tasks

Tourist / Tourist information scenario
Intentionally goal-directed
Intentionally simple tasks
Mixtures of simple information gathering and
basic planning
Wizard (Information giver)
Access to bus times, tram times, restaurants,
hotels, bars, tourist attraction information,
etc.
User given series of tasks
Likert scores asked at end of each task
4 Dialogs / user 3 users/Wizard

Example task Finding the perfect hotel
Youre looking for a hotel for you and your
travelling partner that meets a number of
requirements.
Youd like the following
En suite rooms
Quiet rooms
As close to the main square as possible
Given those desires, find the least expensive
hotel. Youd prefer not compromise on your
requirements, but of course you will if you must!
Please indicate the location of the hotel on the
map and fill in the boxes below.

Name of accommodation

Cost per night for 2 people

12
Users Map
13
Wizards map
14
Likert-scale questions

User/wizard each given 6 questions after each
task
Subject example
In this task, I accomplished the goal.
In this task, I thought the speech recognition
was accurate.
In this task, I found it difficult to communicate
because of the speech recognition.
In this task, I believe the other subject was
very helpful.
In this task, the other subject found using the
speech recognition difficult.
Overall, I was very satisfied with this past task.

Disagree strongly (1) Disagree (2) Disagree somewhat (3) Neither agree nor disagree (4) Agree somewhat (5) Agree (6) Strongly agree (7)
15
Transcription

User-side transcribed during experiments
Prioritized for speed
"I NEED UH IM LOOKING FOR A PIZZA"
Wizard-side transcribed using a subset of LDC
transcription guidelines more detail
"ok uh (()) sure you - i can"
epErrorEndtrue

16
Annotation (acts)

Each turn is a sequence of tags
Inspired by Traums Grounding Acts
More detailed / easier to infer from surface words

Tag Meaning
Request Question/request requiring response
Inform Statement/provision of task information
Greet-Farewell Hello, How can I help, thats all, Thanks, Goodbye, etc.
ExplAck Explicit statement of acknowledgement, showing speaker understanding of OS
Unsolicited-Affirm Explicit statement of acknowledgement, showing OS understands speaker
HoldFloor Explicit request for OS to wait
ReqRepeat Request for OS to repeat their last turn
ReqAck Request for OS to show understanding
RspAffirm Affirmative response to ReqAck
RspNegate Negative response to ReqAck
StateInterp A statement of intention of OS
DisAck Show of lack of understanding of OS
RejOther Display of lack of understanding of speakers intention or desire by OS
17
Annotation (understanding)

Each wizard turn was labeled to indicate whether
the wizard understood the previous user turn

Label Wizards understanding of previous user turn Wizards understanding of previous user turn
Full All intentions understood correctly.
Partial Some intentions understood none misunderstood.
Non Wizard made no guess at user intention.
Flagged-Mis The wizard formed an incorrect hypothesis of the users meaning, and signalled a dialog problem
Un-Flagged-Mis The wizard formed an incorrect hypothesis of the users meaning, accepted it as correct and continued with the dialog.
18
Corpus summary
WER target WER target Wiz User Task Completed in time limit Per-turn WER Per-dialog WER
None 2 6 24 83 0 0
Low 4 12 48 83 32 28
Med 4 12 48 77 46 41
Hi 2 6 24 42 63 60
19
Perception of ASR accuracy

How accurately do users wizards perceive WER?
Perceptions of recognition quality broadly
reflected actual performance, but users
consistently gave higher quality scores than
wizards for the same WER

20
Average turn length (words)

How does WER affect wizard user turn length?
Wizard turn length increases
User turn length stays relatively constant

21
Grounding behavior

How does WER affect wizard grounding behavior?
As WER increases, wizard grounding behaviors
become increasingly prevalent

22
Wizard understanding

How does WER affect wizard understanding status?
Misunderstanding increases with WER
and task completion falls (83, 83, 77, 42)

23
Wizard strategies

Classify each wizard turn into one of 5
strategies

Label Meaning wiz init? Tags Tags
REPAIR Attempt to repair Yes ReqAck, ReqRepeat, StateInterp, DisAck, RejOther
ASKQ Ask task-question Yes Request
GIVEINFO Provide task info No Inform
RSPND Non-initiative taking grounding actions No ExplAck, Rsp-Affirm, Rsp-Negate, Unsolicited-Affirm
OTHER Not included in analysis n/a All others
24
Wizard strategies

What are the most successful strategy after
known dialog trouble?
This plot shows wizard understanding status one
turn after known dialog trouble effect of
REPAIR vs ASKQ.
S indicates significant differences

25
User reactions to misunderstandings

How does a user respond after being
misunderstood?
Surprisingly little explicit indication!

WER target User turns including tag User turns including tag User turns including tag User turns including tag
WER target DisAck RejectOther Request Request
None N/A N/A N/A
Low 0.0 3.8 92.3
Med 2.5 19.0 75.9
Hi 0.0 12.3 87.0
26
Level of wizard initiative

How does initiative vary with WER?
Define wizard initiative using strategies, above

27
Reward measures/PARADISE

Satisfaction Task completion Dialog Cost
metrics
2 kinds of user satisfaction
Single
Combi
3 kinds of task completion
User
Obj
Hyb
Cost metrics
PerDialogWER
UnFlaggedMis
FlaggedMis
Non
Turns
REPAIR
ASKQ

28
Reward measures/PARADISE

In almost all experiments using the User task
completion metric, it was the only significant
predictor
The single/combi metrics almost always selected
the same predictors

Dataset Metrics (task user sat) R2 Significant predictors
ALL User-S 52 1.03 Task
ALL User-C 60 5.29 Task 1.54 UnFlagMis
ALL Obj-S 24 -0.49 Turns 0.38 Task
ALL Obj-C 27 -2.43 Turns 1.45 UnFlagMis 1.35 Task
ALL Hyb-S 41 0.74 Task 0.36 Turns
Hi Obj-S 40 0.98 Task
Hi Hyb-S 48 1.07 Task
Med Obj-S 16 -0.62 Non
Med Obj-C 37 -3.35 Non 2.94 Turns
Med Hyb-S 38 0.97 Task
Low Obj-S 28 -0.59 Turns
Low Hyb-S 40 -0.49 Turns 0.40 Task
29
Reward measures/PARADISE

What indicators best predict user satisfaction?
When run on all data, mixtures of Task, Turns,
UnFlaggedMis best predict user satisfaction.
UnFlaggedMis is serving as a better measurement
of understanding accuracy than WER alone, since
it effectively combines recognition accuracy with
a measure of confidence.
Broadly speaking
Task completion is most important at the High WER
level
Task completion and dialog quality is most
important at the Med WER level
Efficiency is most important at the Low WER level
These patterns mirror findings from other
PARADISE experiments using Human/Computer data
This gives us some confidence that this data set
is valid for training Human/Computer systems

30
Conclusions/Next steps

At moderate WER levels, asking task-related
questions appears to be more successful than
direct dialog repair.
Levels of expert initiative increase with WER,
primarily as a result of grounding behavior.
Users infrequently give a direct indication of
having been misunderstood, with no clear
correlation to WER.
When run on all data, mixtures of Task, Turns,
UnFlaggedMis best predict user satisfaction.
Task completion appears to be most predictive of
user satisfaction however, efficiency shows some
influence at lower WERs.
Next apply this corpus to statistical systems.