Characterizing Task-Oriented Dialog using a Simulated ASR Channel - PowerPoint PPT Presentation

About This Presentation
Title:

Characterizing Task-Oriented Dialog using a Simulated ASR Channel

Description:

... should be in the system's repertoire? An ideal data collection... Mixtures of simple information gathering and basic planning. Wizard (Information giver) ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 32
Provided by: philip291
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Characterizing Task-Oriented Dialog using a Simulated ASR Channel


1
Characterizing Task-Oriented Dialog using a
Simulated ASR Channel
  • Jason D. Williams
  • Machine Intelligence Laboratory
  • Cambridge University Engineering Department

2
SACTI-1 CorpusSimulated ASR-Channel Tourist
Information
  • Motivation for the data collection
  • Experimental set-up
  • Transcription Annotation
  • Effects of ASR error rate on
  • Turn length / dialog length
  • Perception of error rate
  • Task completion
  • Initiative
  • Overall satisfaction (PARADISE)

3
ASR channel vs. HH channel
Properties
  • HH dialog

ASR channel
  • Instant communication
  • Effectively perfect recognition of words
  • Prosodic information carries additional
    information
  • Turns explicitly segmented
  • Barge-in, End-pointed
  • Prosody virtually eliminated
  • ASR parsing errors

Observations
  • Frequent but brief overlaps
  • 80 of utterances contain fewer than 12 words
    50 lt 5
  • Approximately equal turn length
  • Approximately equal balance of initiative
  • About half of turns are ACK (often spliced)
  • Few overlaps
  • Longer system turns shorter user turns
  • Initiative more often with system
  • Virtually no turns are ACK
  • Virtually no splicing

Are models of HC dialog/grounding appropriate in
the presence of the ASR channel?
4
My approach
  • Study the ASR channel in the abstract
  • WoZ experiments using a simulated ASR channel
  • Understand how people behave with an ideal
    dialog manager
  • For example, grounding model
  • Use these insights to inform state space and
    action set selection
  • Note that collected data has unique properties
    useful to
  • RL-based systems
  • Hidden-state estimation
  • User modeling
  • Formulate dialog management problem as a POMDP
  • Decompose state into BN nodes for example
  • Conversation state (grounding state)
  • User action
  • User belief (goal)
  • Train using data collected
  • Solve using approximations

5
The paradox of dialog data
  • To build a user model, we need to see the users
    reaction to all kinds of misunderstandings
  • However, most systems use a fixed policy
  • Systems typically do not take different actions
    in the same situation
  • Taking random actions is clearly not an option!
  • Constraining actions means building very complex
    systems
  • and which actions should be in the systems
    repertoire?

6
An ideal data collection
  • would show users reactions to a variety of error
    handling strategies (no fixed policy)
  • BUT would not be nonsense dialogs!
  • would use the ASR channel
  • would explore a variety of operating conditions
    e.g., WER rate
  • would not assume a particular state space
  • ... would somehow discover the set of system
    actions

7
Data collection set-up
8
ASR simulation state machine
  • Simple energy-based barge-in
  • User interrupts wizard

Userstarts talking
SILENCE
USER_TALKING
Typist done reco result displayed
Userstops talking
Wizard starts talking
Wizard stops talking
Userstarts talking
WIZARD_TALKING
TYPIST_TYPING
9
ASR simulation
  • Simplified FSM-based recognizer
  • Weighted finite state transducer (WFST)
  • Flow
  • Reference input
  • Spell-checked against full dictionary
  • Converted to phonetic string using full
    dictionary
  • Phonetic lattice generated based on confusion
    model
  • Word lattice produced
  • Language model composed to re-score lattice
  • De-coded to produce word strings
  • N-Best list extracted.
  • Various free variables to induce random behavior
  • Plumb N-Best list for variability

10
ASR simulation evaluation
  • Hypothesis Simulation produces errors similar to
    errors induced by additive noise w.r.t. concept
    accuracy
  • Assess concept accuracy as F-Measure using
    automated, data-driven procedure (HVS model)
  • Plot concept accuracy of
  • Real additive noise
  • A naïve confusion model (simple insertion,
    substitution, deletion)
  • WFST confusion model
  • WFST appears to follow real data much more
    closely than naïve model

11
Scenario Tasks
  • Tourist / Tourist information scenario
  • Intentionally goal-directed
  • Intentionally simple tasks
  • Mixtures of simple information gathering and
    basic planning
  • Wizard (Information giver)
  • Access to bus times, tram times, restaurants,
    hotels, bars, tourist attraction information,
    etc.
  • User given series of tasks
  • Likert scores asked at end of each task
  • 4 Dialogs / user 3 users/Wizard
  • Example task Finding the perfect hotel
  • Youre looking for a hotel for you and your
    travelling partner that meets a number of
    requirements.
  • Youd like the following
  • En suite rooms
  • Quiet rooms
  • As close to the main square as possible
  • Given those desires, find the least expensive
    hotel. Youd prefer not compromise on your
    requirements, but of course you will if you must!
  • Please indicate the location of the hotel on the
    map and fill in the boxes below.

Name of accommodation

Cost per night for 2 people

12
Users Map
13
Wizards map
14
Likert-scale questions
  • User/wizard each given 6 questions after each
    task
  • Subject example
  • In this task, I accomplished the goal.
  • In this task, I thought the speech recognition
    was accurate.
  • In this task, I found it difficult to communicate
    because of the speech recognition.
  • In this task, I believe the other subject was
    very helpful.
  • In this task, the other subject found using the
    speech recognition difficult.
  • Overall, I was very satisfied with this past task.

Disagree strongly (1) Disagree (2) Disagree somewhat (3) Neither agree nor disagree (4) Agree somewhat (5) Agree (6) Strongly agree (7)
15
Transcription
  • User-side transcribed during experiments
  • Prioritized for speed
  • "I NEED UH IM LOOKING FOR A PIZZA"
  • Wizard-side transcribed using a subset of LDC
    transcription guidelines more detail
  • "ok uh (()) sure you - i can"
  • epErrorEndtrue

16
Annotation (acts)
  • Each turn is a sequence of tags
  • Inspired by Traums Grounding Acts
  • More detailed / easier to infer from surface words

Tag Meaning
Request Question/request requiring response  
Inform Statement/provision of task information  
Greet-Farewell Hello, How can I help, thats all, Thanks, Goodbye, etc.  
ExplAck Explicit statement of acknowledgement, showing speaker understanding of OS  
Unsolicited-Affirm Explicit statement of acknowledgement, showing OS understands speaker  
HoldFloor Explicit request for OS to wait  
ReqRepeat Request for OS to repeat their last turn  
ReqAck Request for OS to show understanding  
RspAffirm Affirmative response to ReqAck  
RspNegate Negative response to ReqAck  
StateInterp A statement of intention of OS  
DisAck Show of lack of understanding of OS  
RejOther Display of lack of understanding of speakers intention or desire by OS  
17
Annotation (understanding)
  • Each wizard turn was labeled to indicate whether
    the wizard understood the previous user turn

Label Wizards understanding of previous user turn Wizards understanding of previous user turn
Full All intentions understood correctly.  
Partial Some intentions understood none misunderstood.  
Non Wizard made no guess at user intention.  
Flagged-Mis The wizard formed an incorrect hypothesis of the users meaning, and signalled a dialog problem  
Un-Flagged-Mis The wizard formed an incorrect hypothesis of the users meaning, accepted it as correct and continued with the dialog.  
18
Corpus summary
WER target WER target Wiz User Task Completed in time limit Per-turn WER Per-dialog WER
  None 2 6 24 83 0 0
  Low 4 12 48 83 32 28
  Med 4 12 48 77 46 41
  Hi 2 6 24 42 63 60
19
Perception of ASR accuracy
  • How accurately do users wizards perceive WER?
  • Perceptions of recognition quality broadly
    reflected actual performance, but users
    consistently gave higher quality scores than
    wizards for the same WER

20
Average turn length (words)
  • How does WER affect wizard user turn length?
  • Wizard turn length increases
  • User turn length stays relatively constant

21
Grounding behavior
  • How does WER affect wizard grounding behavior?
  • As WER increases, wizard grounding behaviors
    become increasingly prevalent

22
Wizard understanding
  • How does WER affect wizard understanding status?
  • Misunderstanding increases with WER
  • and task completion falls (83, 83, 77, 42)

23
Wizard strategies
  • Classify each wizard turn into one of 5
    strategies

Label Meaning wiz init? Tags Tags
REPAIR Attempt to repair Yes ReqAck, ReqRepeat, StateInterp, DisAck, RejOther  
ASKQ Ask task-question Yes Request  
GIVEINFO Provide task info No Inform  
RSPND Non-initiative taking grounding actions No ExplAck, Rsp-Affirm, Rsp-Negate, Unsolicited-Affirm  
OTHER Not included in analysis n/a All others  
24
Wizard strategies
  • What are the most successful strategy after
    known dialog trouble?
  • This plot shows wizard understanding status one
    turn after known dialog trouble effect of
    REPAIR vs ASKQ.
  • S indicates significant differences

25
User reactions to misunderstandings
  • How does a user respond after being
    misunderstood?
  • Surprisingly little explicit indication!

WER target User turns including tag User turns including tag User turns including tag User turns including tag
WER target DisAck RejectOther Request Request
None N/A N/A N/A  
Low 0.0 3.8 92.3  
Med 2.5 19.0 75.9  
Hi 0.0 12.3 87.0  
26
Level of wizard initiative
  • How does initiative vary with WER?
  • Define wizard initiative using strategies, above

27
Reward measures/PARADISE
  • Satisfaction Task completion Dialog Cost
    metrics
  • 2 kinds of user satisfaction
  • Single
  • Combi
  • 3 kinds of task completion
  • User
  • Obj
  • Hyb
  • Cost metrics
  • PerDialogWER
  • UnFlaggedMis
  • FlaggedMis
  • Non
  • Turns
  • REPAIR
  • ASKQ

28
Reward measures/PARADISE
  • In almost all experiments using the User task
    completion metric, it was the only significant
    predictor
  • The single/combi metrics almost always selected
    the same predictors

Dataset Metrics (task user sat) R2 Significant predictors
ALL User-S 52 1.03 Task
ALL User-C 60 5.29 Task 1.54 UnFlagMis
ALL Obj-S 24 -0.49 Turns 0.38 Task
ALL Obj-C 27 -2.43 Turns 1.45 UnFlagMis 1.35 Task
ALL Hyb-S 41 0.74 Task 0.36 Turns
Hi Obj-S 40 0.98 Task
Hi Hyb-S 48 1.07 Task
Med Obj-S 16 -0.62 Non
Med Obj-C 37 -3.35 Non 2.94 Turns
Med Hyb-S 38 0.97 Task
Low Obj-S 28 -0.59 Turns
Low Hyb-S 40 -0.49 Turns 0.40 Task
29
Reward measures/PARADISE
  • What indicators best predict user satisfaction?
  • When run on all data, mixtures of Task, Turns,
    UnFlaggedMis best predict user satisfaction.
  • UnFlaggedMis is serving as a better measurement
    of understanding accuracy than WER alone, since
    it effectively combines recognition accuracy with
    a measure of confidence.
  • Broadly speaking
  • Task completion is most important at the High WER
    level
  • Task completion and dialog quality is most
    important at the Med WER level
  • Efficiency is most important at the Low WER level
  • These patterns mirror findings from other
    PARADISE experiments using Human/Computer data
  • This gives us some confidence that this data set
    is valid for training Human/Computer systems

30
Conclusions/Next steps
  • At moderate WER levels, asking task-related
    questions appears to be more successful than
    direct dialog repair.
  • Levels of expert initiative increase with WER,
    primarily as a result of grounding behavior.
  • Users infrequently give a direct indication of
    having been misunderstood, with no clear
    correlation to WER.
  • When run on all data, mixtures of Task, Turns,
    UnFlaggedMis best predict user satisfaction.
  • Task completion appears to be most predictive of
    user satisfaction however, efficiency shows some
    influence at lower WERs.
  • Next apply this corpus to statistical systems.

31
Thanks!
Jason D. Williams jdw30_at_cam.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com