Title: Language Technology II Language-Based Interaction: Dialogue design and evaluation
1Language Technology IILanguage-Based
InteractionDialogue design and evaluation
- Manfred PinkalCourse websitewww.coli.uni-saarl
and.de/courses/late2
2Outline
- The Software Development Cycle
- Dialogue Design
- Wizard-of-Oz Experiments
- Dialogue System Evaluation
3The Software Development Cycle
- Requirements Analysis
- Design
- Implementation
- Testing and Evaluation
- Integration
- Maintenance
4The Software Development Cycle
- Requirements Analysis
- Design
- Implementation
- Testing and Evaluation
- Integration
- Maintenance
5Outline
- The Software Development Cycle
- Dialogue Design
- Wizard-of-Oz Experiments
- Dialogue System Evaluation
6Dialogue Design Overall Aims
- Effectiveness (Task Success)
- Efficiency
- User Satisfaction
7Dialogue Design General Steps
The following slides are compiled from slides
Rolf Schwitters and Bernd Plannerer
- 1. Make sure you understand what you are trying
to achieve(use scenarios and build a conceptual
model). - 2. See if you can decompose the task into smaller
meaningful subtasks. - 3. Identify the information tokens you need for
each task or subtask. - 4. Decide how you will obtain this information
from the user. - 5. Sketch a dialogue model that capture this
information. - 6. Test your dialogue model.
- 7. Revise the dialogue model and repeat Step 6
8Dialogue Design Principal Decisions
- Specification of Target Group and Supported
languages - Frequency of usage
- Regional / National
- Monolingual / multilingual / foreign language
speakers - Age
- Environment
- Quiet Environment Home, Office
- Noisy Environment Car, Outdoor, Noisy Working
Environments - Choice of Persona and Voice
- Dialogue Structure
9Dialogue Design Practical Tips
- Guide the user towards responses that maximize
- clarity and
- unambiguousness.
- Allow for the user not knowing
- the active vocabulary
- the answer to a question or
- understanding a question.
- Guide users toward natural in vocabulary
responses. - Version 1 Welcome to ABC Bank. How can I help
you? - Version 2 Welcome to ABC Bank. What would you
like to do? - Version 3 Welcome to ABC Bank. You can check an
account balance, transfer funds, or pay a bill.
What would you like to do?
10More Practical Tips
- Do not give too many options at once (maximum 5)
- Keep prompts brief to encourage the user to be
brief. - Supply confirmation messages frequently,
especially when the cost or likelihood of a
recognition error is high. - Prefer implicit over explicit grounding.
- Use recognizer confidence values to avoid
unnecessary grounding steps.
11More Practical Tips
- Assume a frequent user will have a rapid learning
curve. - Allow shortcuts
- Switch to expert mode/ command level.
- Combine different steps in one.
- Barge-In
- Assume errors are the fault of the recognizer,
not the user. - Allow the user to access (context-sensitive) help
at any state. - Provide escape commands.
- Design graceful recovery when the recognizer
makes an error.
12Outline
- The Software Development Cycle
- Dialogue Design
- Wizard-of-Oz Experiments
- Dialogue System Evaluation
13Dialogue Design General Steps
- 1. Make sure you understand what you are trying
to achieve(use scenarios and build a conceptual
model). - 2. See if you can decompose the task into smaller
meaningful subtasks. - 3. Identify the information tokens you need for
each task or subtask. - 4. Decide how you will obtain this information
from the user. - 5. Sketch a dialogue model that capture this
information. - 6. Test your dialogue model.
- 7. Revise the dialogue model and repeat Step 6
14Wizard-of-Oz Experiments
- Central parts of the system are simulated by a
human "wizard". - Experimental WoZ systems allow to test a dialogue
system (to some extent) before it has been
(fully) implemented, thus uncovering basic
problems of the dialogue model. - Also, they allow to collect
- data about dialogue behavior of subjects
- the used syntax and lexicon (to hand-code
language models) - speech data (to train statistical language
models) - at an early stage.
15Wizard-of-Oz Experiments
- The WoZ is not just a person in a box The WoZ
system must - perform as poor as a computer "artificial"
speech output by typing and TTS system,
simulation of shortcomings in recognition wizard
sees typed input (no prosody), maybe even with
simulated recognition failure (e.g., by randomly
overwriting words in typed input). - perform as efficient as a computer support of
quick database access, complex real time
decisions, e.g., by displaying dialogueflow
diagram, marking the current state, offering
menus with contextually appropriate dialogue
moves and system prompts. - impose constraints on the options of the wizard
(to support impression of artificiality), and
allow to vary those constraints (to test
different dialogue strategies. - log all kinds of data in an appropriate and
easily accessible form.
16Wizard-of-Oz Experiments
- Ideally, a WoZ system is set up in a modular way,
allowing to replace functions contributed by
humans subsequently in the course of system
implementation. - Gradual transition between WoZ and fully
artificial system. - An example The DiaMant tool, run in WoZ mode.
17Motivations for WoZ experiments
- The original motivation
- Eearly testing, avoiding time-consuming and
expensive programming. - Studying dialogues disregarding the bottle-neck
of unreliable speech recognition. - Changing conditions
- Configuration and design of dialogue systems is
becoming comfortable, recognizers are becoming
pretty reliable Are WoZ experiments necessary? - Dialogue interaction is becoming increasingly
flexible, adaptive, complex. Are WoZ experiments
feasible? - A shift in motivation
- From exploration of the user's behavior, given
constraint and schematic system's behavior - To exploration of alternative wizard's behavior,
who is given a range of freedom for his/her
reaction.
18An example
- A WoZ study in the TALK project, Spring 2005
- MP3 Player
- Multi-modal dialogue, language German
- In-car/in-home scenario
- Saarland University, DFKI, CLT
19Tasks for the Subjects
- MP3 domain
- in-car with primary task Lane Change Task (LCT)
- in-home domain without LCT
- Tasks for the subject
- Play a song from the album "New Adventures in
Hi-Fi" by REM - Find a song with believe in the title and play
it. - Task for the wizard
- Help the user reach their goals(Deliberately
vague!)
20Goals of WOZ MP3 Experiment
- Gather pilot data on human multi-modal turn
planning - Collect wizard dialogue strategies
- Collect wizard media allocation decisions
- Collect wizard speech data
- Collect user data (speech signals and spontaneous
speech)
21User View
- Primary task driving
- Secondary task on second screen MP3 player
22Video Recording
23DFKI/USAAR WOZ system
- System features
- 14 (via OAA) communicating components distributed
over - 5 machines (3 windows, 2 linux)
- Plus LCT on a seperate machine
- People involved to run an experiment 5
- 1 experiment leader
- 1 wizard
- 1 subject
- 2 typists
24Data Flow
Wizard
Subject
Typist
Typist
25A Walk Through the final turns
- Wizard Ich zeige Ihnen die Liste an.I am
displaying the list. - User Ok. Zeige mir bitte das Lied aus dem
ausgewählten Album und spiel das vor.Ok. Please
show me that song (Believe) from the selected
album and play it.
26A Walk Through the Final Turns
- Wizard's actions
- Database search
- Select album presentation (vs. songs or
artists) - Select list presentation (vs. tables or textual
summary) - Utterance Ich zeige Ihnen die Liste an.I am
displaying the list. - Audio is sent to typist
- Text is sent to speech synthesis
- User Ok. Zeige mir bitte das Lied aus dem
ausgewählten Album und spiel das vor.Ok. Please
show me that song (Believe) from the selected
album and play it.
27Example(1) Wizard
says Ich zeige Ihnen die Liste an. (I am
displaying the list.) and clicks on the list
presentation
28(No Transcript)
29(No Transcript)
30Options presenter with User-Tab
31Data Flow
Wizard
Subject
Typist
Typist
32Example(2) WizardTypist
- types the wizards spoken text
I am displaying the list.
33Data Flow
Wizard
Subject
Typist
Typist
34Example(3) User
- Listens to wizard text synthesized by Mary and
receives the selected list presentation
35(No Transcript)
36Example(4) User
- Selects one album and says Ok. Zeige mir bitte
das Lied aus dem aus gewählten Album und spiel
das vor.
Ok. Please show me that song (Believe) from the
selected album and play it.
37Automatically updated wizard screen with check
38Data Flow
Wizard
Subject
Typist
Typist
39Example(5) UserTypist
- Types the users spoken text
Ok. Please show me that song (Believe) from the
selected album and play it.
40Data Flow
Wizard
Subject
Typist
Typist
41Example(6) Wizard
- Gets a correspondingly updated TextBox Window
42The current experimmental setup
- Usability Lab, Building C7 4
43GUI Development
Old
New
44Outline
- The Software Development Cycle
- Dialogue Design
- Wizard-of-Oz Experiments
- Dialogue System Evaluation
45Different levels of evaluation
- Technical evaluation
- Usability evaluation
- Customer evaluation
- According to L. Dybkjaer/ N.Bernsen/ W.Minker,
"Overview of evaluation and usability", in W.
Minker et al., Spoken multimodal human-computer
dialogue in mobile environments, Springer 2005
46Different levels of evaluation
- Technical evaluation
- Typically component evaluation (ASR, TTS,
Grammar, but e.g. System robustness) - Quantitative and objective, to some extent
- Usability evaluation
- Customer evaluation
47Evaluation of ASR Systems
- WER
- Speed (real-time performance)
- Size of lexicon
- Perplexity
48Evaluation of TTS
- Intuitive evaluation by users with respect to
- intellegibility
- pleasantness
- naturalness
- No objective (though quantitative) criteria, but
extremely important for user satisfation
49Different levels of evaluation
- Technical evaluation
- Usability evaluation
- Evaluation of user satisfaction
- Typically end-to-end evaluation
- Mostly subjective and qualitative measures
- Customer evaluation
50Different levels of evaluation
- Technical evaluation
- Usability evaluation
- Customer evaluation, including aspects like
- Costs
- Platform compatibility
- Maintenance
51Usability Evaluation
- Mostly soft criteria
- "Usability Guidelines", best-practice rules, form
the basis of expert evaluation or user
questionnaires.
52Usability Guidelines
- from Dybkjaer et al.
- Feedback adequacy The user must feel confident
that the system has understood the information
input in the way it was intended - Naturalness of the dialogue structure
- Sufficiency of interaction guidance
- Sufficiency of adaptation to user differences
-
53Usability Evaluation
- Mostly soft criteria
- "Usability Guidelines", best-practice rules, form
the basis of expert evaluation or user
questionnaires. - Hard, measurable criteria often contradict each
other Systems with high task success may lack
efficiency, and vice versa. - Is it possible to evaluate usability in a
objective, predictive, and general way? - Is there one (maybe parametrized) measure for
User Satisfaction?
54PARADISE
- An attempt to provide an objective, quantitative,
operational basis for qualitative user
assessments - M. Walker/ D. Litman/C.Kamm/A.Abella "PARADISE
A framework for evaluating spoken dialogue
agents", Proc. of ACL 1997
55PARADISE The Idea
- The top criterion for usability evaluation is
user satisfaction it is an intuitive criterion
which can not be directly measured, but is only
accessible through qualitative user judgments. - User satisfaction is
- correlated to task success (effectiveness)
- inversely correlated to the dialogue costs.
- There are features that can be easily and
objectively extracted from dialogue logfiles,
which approximate both task success and dialogue
costs.
56PARADISE The Idea
- Take a set of dialogues produced by interaction
of a dialogue system A with different subjects. - Let the users assess their satisfaction with the
dialogue. - Calculate the task success, and read the
different measures for dialogue costs off the
log-files. - Compute the correlation between satisfaction
assessment and quantitative measures (via
multiple linear regression). - Results
- Prediction of user satisfaction for new
individual dialogues with system A, or - or for dialogues with a modified system A'.
- Comparison of different dialogue systems A and B
with respect to user satisfaction.
57PARADISE The Structure
Maximise user satisfaction
Minimize costs
Maximise task success
Efficiency measures
Qualitative measures
58Efficiency and Quality Measures
- Efficiency measures
- Elapsed time
- System turns
- User turns
- Quality measures
- of timeout prompts
- of rejects
- of helps
- of cancels
- of barge-ins
- Mean ASR score
59A Measure for Task Success
- Option 1 Yes/No evaluation for the complete
dialogue - Option 2, available for dialog systems using the
form-filling paradigm Let task success be
determined by the fields in the form filled with
correct values.
This and the following 3 slides will not be part
of the exam
60Tasks as Attribute-Value Matrices
61An Instance
62A Measure for Task Success
- Identify task success with the ? value for
agreement between actual and intended values for
the AVM (? is usually employed for measuring
inter-annotator agreement). - P(A) P(E)
- 1- P(E)
- P(A) is the actual relative frequency of
coincidence between values, P(E) the expected
frequency.
?
63PARADISE The Structure
Maximise user satisfaction
Minimize costs
Maximise task success
Efficiency measures
Qualitative measures
64User Satisfaction
- Measured by adding the scores assigned to 8
questions by the subjects.
65A user satisfaction questionnaire
- Was the system easy to understand?
- Did the system understand what you said?
- Was it easy to find the information you wanted?
- Was the pace of interaction with the system
appropriate? - Did you know what you could say at each point in
the dialogue?
66A user satisfaction questionnaire
- How often was the system sluggish and slow to
reply to you? - Did the system work the way you expected it to?
- From your current experience with using the
system, do you think you would use the system
regularly?
67A hypothetical example
This and the following slide will not be part of
the exam
68The Performance Function
- N is a normalisation function, based on standard
deviation, - N(?) is normalised task success
- N(ci) are the normalised cost factors,
- ? and wi are weights on ? and the ci,
respectively.
69Comments on PARADISE
- Criterion for the feature selection is the easy
availability of features through log-files. Is it
really the interesting features that are
selected? - There is no strong theoretical foundation for the
choice of questions in the user questionnaire. - Does the methodology extend to more complex
dialogue applications in real-world environments?
70General Comments
- A trade-off between precision/objectivity and
usefulness - PARADISE (More or less) Precise and objective,
but of limited practical use. - Evaluation Guidelines Of some practical use, but
not really objective. - The most useful device is intuition If it is,
at least in part, an artist's intuition Dialogue
design is art, as well as technology.