Language Technology II Language-Based Interaction: Dialogue design and evaluation - PowerPoint PPT Presentation

About This Presentation

Title:

Language Technology II Language-Based Interaction: Dialogue design and evaluation

Description:

Title: Einf rung in die Allgemeine Sprachwissenschaft: Pragmatik Author: coli Last modified by: Coli Created Date: 2/3/2003 3:19:12 PM Document presentation format – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 71

Provided by: coli9

Category:

more less

Transcript and Presenter's Notes

Title: Language Technology II Language-Based Interaction: Dialogue design and evaluation

1
Language Technology IILanguage-Based
InteractionDialogue design and evaluation

Manfred PinkalCourse websitewww.coli.uni-saarl
and.de/courses/late2

2
Outline

The Software Development Cycle
Dialogue Design
Wizard-of-Oz Experiments
Dialogue System Evaluation

3
The Software Development Cycle

Requirements Analysis
Design
Implementation
Testing and Evaluation
Integration
Maintenance

4
The Software Development Cycle

Requirements Analysis
Design
Implementation
Testing and Evaluation
Integration
Maintenance

5
Outline

The Software Development Cycle
Dialogue Design
Wizard-of-Oz Experiments
Dialogue System Evaluation

6
Dialogue Design Overall Aims

Effectiveness (Task Success)
Efficiency
User Satisfaction

7
Dialogue Design General Steps
The following slides are compiled from slides
Rolf Schwitters and Bernd Plannerer

1. Make sure you understand what you are trying
to achieve(use scenarios and build a conceptual
model).
2. See if you can decompose the task into smaller
meaningful subtasks.
3. Identify the information tokens you need for
each task or subtask.
4. Decide how you will obtain this information
from the user.
5. Sketch a dialogue model that capture this
information.
6. Test your dialogue model.
7. Revise the dialogue model and repeat Step 6

8
Dialogue Design Principal Decisions

Specification of Target Group and Supported
languages
Frequency of usage
Regional / National
Monolingual / multilingual / foreign language
speakers
Age
Environment
Quiet Environment Home, Office
Noisy Environment Car, Outdoor, Noisy Working
Environments
Choice of Persona and Voice
Dialogue Structure

9
Dialogue Design Practical Tips

Guide the user towards responses that maximize
clarity and
unambiguousness.
Allow for the user not knowing
the active vocabulary
the answer to a question or
understanding a question.
Guide users toward natural in vocabulary
responses.
Version 1 Welcome to ABC Bank. How can I help
you?
Version 2 Welcome to ABC Bank. What would you
like to do?
Version 3 Welcome to ABC Bank. You can check an
account balance, transfer funds, or pay a bill.
What would you like to do?

10
More Practical Tips

Do not give too many options at once (maximum 5)
Keep prompts brief to encourage the user to be
brief.
Supply confirmation messages frequently,
especially when the cost or likelihood of a
recognition error is high.
Prefer implicit over explicit grounding.
Use recognizer confidence values to avoid
unnecessary grounding steps.

11
More Practical Tips

Assume a frequent user will have a rapid learning
curve.
Allow shortcuts
Switch to expert mode/ command level.
Combine different steps in one.
Barge-In
Assume errors are the fault of the recognizer,
not the user.
Allow the user to access (context-sensitive) help
at any state.
Provide escape commands.
Design graceful recovery when the recognizer
makes an error.

12
Outline

The Software Development Cycle
Dialogue Design
Wizard-of-Oz Experiments
Dialogue System Evaluation

13
Dialogue Design General Steps

1. Make sure you understand what you are trying
to achieve(use scenarios and build a conceptual
model).
2. See if you can decompose the task into smaller
meaningful subtasks.
3. Identify the information tokens you need for
each task or subtask.
4. Decide how you will obtain this information
from the user.
5. Sketch a dialogue model that capture this
information.
6. Test your dialogue model.
7. Revise the dialogue model and repeat Step 6

14
Wizard-of-Oz Experiments

Central parts of the system are simulated by a
human "wizard".
Experimental WoZ systems allow to test a dialogue
system (to some extent) before it has been
(fully) implemented, thus uncovering basic
problems of the dialogue model.
Also, they allow to collect
data about dialogue behavior of subjects
the used syntax and lexicon (to hand-code
language models)
speech data (to train statistical language
models)
at an early stage.

15
Wizard-of-Oz Experiments

The WoZ is not just a person in a box The WoZ
system must
perform as poor as a computer "artificial"
speech output by typing and TTS system,
simulation of shortcomings in recognition wizard
sees typed input (no prosody), maybe even with
simulated recognition failure (e.g., by randomly
overwriting words in typed input).
perform as efficient as a computer support of
quick database access, complex real time
decisions, e.g., by displaying dialogueflow
diagram, marking the current state, offering
menus with contextually appropriate dialogue
moves and system prompts.
impose constraints on the options of the wizard
(to support impression of artificiality), and
allow to vary those constraints (to test
different dialogue strategies.
log all kinds of data in an appropriate and
easily accessible form.

16
Wizard-of-Oz Experiments

Ideally, a WoZ system is set up in a modular way,
allowing to replace functions contributed by
humans subsequently in the course of system
implementation.
Gradual transition between WoZ and fully
artificial system.
An example The DiaMant tool, run in WoZ mode.

17
Motivations for WoZ experiments

The original motivation
Eearly testing, avoiding time-consuming and
expensive programming.
Studying dialogues disregarding the bottle-neck
of unreliable speech recognition.
Changing conditions
Configuration and design of dialogue systems is
becoming comfortable, recognizers are becoming
pretty reliable Are WoZ experiments necessary?
Dialogue interaction is becoming increasingly
flexible, adaptive, complex. Are WoZ experiments
feasible?
A shift in motivation
From exploration of the user's behavior, given
constraint and schematic system's behavior
To exploration of alternative wizard's behavior,
who is given a range of freedom for his/her
reaction.

18
An example

A WoZ study in the TALK project, Spring 2005
MP3 Player
Multi-modal dialogue, language German
In-car/in-home scenario
Saarland University, DFKI, CLT

19
Tasks for the Subjects

MP3 domain
in-car with primary task Lane Change Task (LCT)
in-home domain without LCT
Tasks for the subject
Play a song from the album "New Adventures in
Hi-Fi" by REM
Find a song with believe in the title and play
it.
Task for the wizard
Help the user reach their goals(Deliberately
vague!)

20
Goals of WOZ MP3 Experiment

Gather pilot data on human multi-modal turn
planning
Collect wizard dialogue strategies
Collect wizard media allocation decisions
Collect wizard speech data
Collect user data (speech signals and spontaneous
speech)

21
User View

Primary task driving
Secondary task on second screen MP3 player

22
Video Recording
23
DFKI/USAAR WOZ system

System features
14 (via OAA) communicating components distributed
over
5 machines (3 windows, 2 linux)
Plus LCT on a seperate machine
People involved to run an experiment 5
1 experiment leader
1 wizard
1 subject
2 typists

24
Data Flow
Wizard
Subject
Typist
Typist
25
A Walk Through the final turns

Wizard Ich zeige Ihnen die Liste an.I am
displaying the list.
User Ok. Zeige mir bitte das Lied aus dem
ausgewählten Album und spiel das vor.Ok. Please
show me that song (Believe) from the selected
album and play it.

26
A Walk Through the Final Turns

Wizard's actions
Database search
Select album presentation (vs. songs or
artists)
Select list presentation (vs. tables or textual
summary)
Utterance Ich zeige Ihnen die Liste an.I am
displaying the list.
Audio is sent to typist
Text is sent to speech synthesis
User Ok. Zeige mir bitte das Lied aus dem
ausgewählten Album und spiel das vor.Ok. Please
show me that song (Believe) from the selected
album and play it.

27
Example(1) Wizard
says Ich zeige Ihnen die Liste an. (I am
displaying the list.) and clicks on the list
presentation
28
(No Transcript)
29
(No Transcript)
30
Options presenter with User-Tab
31
Data Flow
Wizard
Subject
Typist
Typist
32
Example(2) WizardTypist

types the wizards spoken text

I am displaying the list.
33
Data Flow
Wizard
Subject
Typist
Typist
34
Example(3) User

Listens to wizard text synthesized by Mary and
receives the selected list presentation

35
(No Transcript)
36
Example(4) User

Selects one album and says Ok. Zeige mir bitte
das Lied aus dem aus gewählten Album und spiel
das vor.

Ok. Please show me that song (Believe) from the
selected album and play it.
37
Automatically updated wizard screen with check
38
Data Flow
Wizard
Subject
Typist
Typist
39
Example(5) UserTypist

Types the users spoken text

Ok. Please show me that song (Believe) from the
selected album and play it.
40
Data Flow
Wizard
Subject
Typist
Typist
41
Example(6) Wizard

Gets a correspondingly updated TextBox Window

42
The current experimmental setup

Usability Lab, Building C7 4

43
GUI Development
Old
New
44
Outline

The Software Development Cycle
Dialogue Design
Wizard-of-Oz Experiments
Dialogue System Evaluation

45
Different levels of evaluation

Technical evaluation
Usability evaluation
Customer evaluation
According to L. Dybkjaer/ N.Bernsen/ W.Minker,
"Overview of evaluation and usability", in W.
Minker et al., Spoken multimodal human-computer
dialogue in mobile environments, Springer 2005

46
Different levels of evaluation

Technical evaluation
Typically component evaluation (ASR, TTS,
Grammar, but e.g. System robustness)
Quantitative and objective, to some extent
Usability evaluation
Customer evaluation

47
Evaluation of ASR Systems

WER
Speed (real-time performance)
Size of lexicon
Perplexity

48
Evaluation of TTS

Intuitive evaluation by users with respect to
intellegibility
pleasantness
naturalness
No objective (though quantitative) criteria, but
extremely important for user satisfation

49
Different levels of evaluation

Technical evaluation
Usability evaluation
Evaluation of user satisfaction
Typically end-to-end evaluation
Mostly subjective and qualitative measures
Customer evaluation

50
Different levels of evaluation

Technical evaluation
Usability evaluation
Customer evaluation, including aspects like
Costs
Platform compatibility
Maintenance

51
Usability Evaluation

Mostly soft criteria
"Usability Guidelines", best-practice rules, form
the basis of expert evaluation or user
questionnaires.

52
Usability Guidelines

from Dybkjaer et al.
Feedback adequacy The user must feel confident
that the system has understood the information
input in the way it was intended
Naturalness of the dialogue structure
Sufficiency of interaction guidance
Sufficiency of adaptation to user differences

53
Usability Evaluation

Mostly soft criteria
"Usability Guidelines", best-practice rules, form
the basis of expert evaluation or user
questionnaires.
Hard, measurable criteria often contradict each
other Systems with high task success may lack
efficiency, and vice versa.
Is it possible to evaluate usability in a
objective, predictive, and general way?
Is there one (maybe parametrized) measure for
User Satisfaction?

54
PARADISE

An attempt to provide an objective, quantitative,
operational basis for qualitative user
assessments
M. Walker/ D. Litman/C.Kamm/A.Abella "PARADISE
A framework for evaluating spoken dialogue
agents", Proc. of ACL 1997

55
PARADISE The Idea

The top criterion for usability evaluation is
user satisfaction it is an intuitive criterion
which can not be directly measured, but is only
accessible through qualitative user judgments.
User satisfaction is
correlated to task success (effectiveness)
inversely correlated to the dialogue costs.
There are features that can be easily and
objectively extracted from dialogue logfiles,
which approximate both task success and dialogue
costs.

56
PARADISE The Idea

Take a set of dialogues produced by interaction
of a dialogue system A with different subjects.
Let the users assess their satisfaction with the
dialogue.
Calculate the task success, and read the
different measures for dialogue costs off the
log-files.
Compute the correlation between satisfaction
assessment and quantitative measures (via
multiple linear regression).
Results
Prediction of user satisfaction for new
individual dialogues with system A, or
or for dialogues with a modified system A'.
Comparison of different dialogue systems A and B
with respect to user satisfaction.

57
PARADISE The Structure
Maximise user satisfaction
Minimize costs
Maximise task success
Efficiency measures
Qualitative measures
58
Efficiency and Quality Measures

Efficiency measures
Elapsed time
System turns
User turns
Quality measures
of timeout prompts
of rejects
of helps
of cancels
of barge-ins
Mean ASR score

59
A Measure for Task Success

Option 1 Yes/No evaluation for the complete
dialogue
Option 2, available for dialog systems using the
form-filling paradigm Let task success be
determined by the fields in the form filled with
correct values.

This and the following 3 slides will not be part
of the exam
60
Tasks as Attribute-Value Matrices
61
An Instance
62
A Measure for Task Success

Identify task success with the ? value for
agreement between actual and intended values for
the AVM (? is usually employed for measuring
inter-annotator agreement).
P(A) P(E)
1- P(E)
P(A) is the actual relative frequency of
coincidence between values, P(E) the expected
frequency.

?
63
PARADISE The Structure
Maximise user satisfaction
Minimize costs
Maximise task success
Efficiency measures
Qualitative measures
64
User Satisfaction

Measured by adding the scores assigned to 8
questions by the subjects.

65
A user satisfaction questionnaire

Was the system easy to understand?
Did the system understand what you said?
Was it easy to find the information you wanted?
Was the pace of interaction with the system
appropriate?
Did you know what you could say at each point in
the dialogue?

66
A user satisfaction questionnaire

How often was the system sluggish and slow to
reply to you?
Did the system work the way you expected it to?
From your current experience with using the
system, do you think you would use the system
regularly?

67
A hypothetical example
This and the following slide will not be part of
the exam
68
The Performance Function

N is a normalisation function, based on standard
deviation,
N(?) is normalised task success
N(ci) are the normalised cost factors,
? and wi are weights on ? and the ci,
respectively.

69
Comments on PARADISE

Criterion for the feature selection is the easy
availability of features through log-files. Is it
really the interesting features that are
selected?
There is no strong theoretical foundation for the
choice of questions in the user questionnaire.
Does the methodology extend to more complex
dialogue applications in real-world environments?

70
General Comments

A trade-off between precision/objectivity and
usefulness
PARADISE (More or less) Precise and objective,
but of limited practical use.
Evaluation Guidelines Of some practical use, but
not really objective.
The most useful device is intuition If it is,
at least in part, an artist's intuition Dialogue
design is art, as well as technology.

Write a Comment

User Comments (0)