Language Technology II Language-Based Interaction: Dialogue design and evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

Language Technology II Language-Based Interaction: Dialogue design and evaluation

Description:

Title: Einf rung in die Allgemeine Sprachwissenschaft: Pragmatik Author: coli Last modified by: Coli Created Date: 2/3/2003 3:19:12 PM Document presentation format – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 71
Provided by: coli9
Category:

less

Transcript and Presenter's Notes

Title: Language Technology II Language-Based Interaction: Dialogue design and evaluation


1
Language Technology IILanguage-Based
InteractionDialogue design and evaluation
  • Manfred PinkalCourse websitewww.coli.uni-saarl
    and.de/courses/late2

2
Outline
  • The Software Development Cycle
  • Dialogue Design
  • Wizard-of-Oz Experiments
  • Dialogue System Evaluation

3
The Software Development Cycle
  • Requirements Analysis
  • Design
  • Implementation
  • Testing and Evaluation
  • Integration
  • Maintenance

4
The Software Development Cycle
  • Requirements Analysis
  • Design
  • Implementation
  • Testing and Evaluation
  • Integration
  • Maintenance

5
Outline
  • The Software Development Cycle
  • Dialogue Design
  • Wizard-of-Oz Experiments
  • Dialogue System Evaluation

6
Dialogue Design Overall Aims
  • Effectiveness (Task Success)
  • Efficiency
  • User Satisfaction

7
Dialogue Design General Steps
The following slides are compiled from slides
Rolf Schwitters and Bernd Plannerer
  • 1. Make sure you understand what you are trying
    to achieve(use scenarios and build a conceptual
    model).
  • 2. See if you can decompose the task into smaller
    meaningful subtasks.
  • 3. Identify the information tokens you need for
    each task or subtask.
  • 4. Decide how you will obtain this information
    from the user.
  • 5. Sketch a dialogue model that capture this
    information.
  • 6. Test your dialogue model.
  • 7. Revise the dialogue model and repeat Step 6

8
Dialogue Design Principal Decisions
  • Specification of Target Group and Supported
    languages
  • Frequency of usage
  • Regional / National
  • Monolingual / multilingual / foreign language
    speakers
  • Age
  • Environment
  • Quiet Environment Home, Office
  • Noisy Environment Car, Outdoor, Noisy Working
    Environments
  • Choice of Persona and Voice
  • Dialogue Structure

9
Dialogue Design Practical Tips
  • Guide the user towards responses that maximize
  • clarity and
  • unambiguousness.
  • Allow for the user not knowing
  • the active vocabulary
  • the answer to a question or
  • understanding a question.
  • Guide users toward natural in vocabulary
    responses.
  • Version 1 Welcome to ABC Bank. How can I help
    you?
  • Version 2 Welcome to ABC Bank. What would you
    like to do?
  • Version 3 Welcome to ABC Bank. You can check an
    account balance, transfer funds, or pay a bill.
    What would you like to do?

10
More Practical Tips
  • Do not give too many options at once (maximum 5)
  • Keep prompts brief to encourage the user to be
    brief.
  • Supply confirmation messages frequently,
    especially when the cost or likelihood of a
    recognition error is high.
  • Prefer implicit over explicit grounding.
  • Use recognizer confidence values to avoid
    unnecessary grounding steps.

11
More Practical Tips
  • Assume a frequent user will have a rapid learning
    curve.
  • Allow shortcuts
  • Switch to expert mode/ command level.
  • Combine different steps in one.
  • Barge-In
  • Assume errors are the fault of the recognizer,
    not the user.
  • Allow the user to access (context-sensitive) help
    at any state.
  • Provide escape commands.
  • Design graceful recovery when the recognizer
    makes an error.

12
Outline
  • The Software Development Cycle
  • Dialogue Design
  • Wizard-of-Oz Experiments
  • Dialogue System Evaluation

13
Dialogue Design General Steps
  • 1. Make sure you understand what you are trying
    to achieve(use scenarios and build a conceptual
    model).
  • 2. See if you can decompose the task into smaller
    meaningful subtasks.
  • 3. Identify the information tokens you need for
    each task or subtask.
  • 4. Decide how you will obtain this information
    from the user.
  • 5. Sketch a dialogue model that capture this
    information.
  • 6. Test your dialogue model.
  • 7. Revise the dialogue model and repeat Step 6

14
Wizard-of-Oz Experiments
  • Central parts of the system are simulated by a
    human "wizard".
  • Experimental WoZ systems allow to test a dialogue
    system (to some extent) before it has been
    (fully) implemented, thus uncovering basic
    problems of the dialogue model.
  • Also, they allow to collect
  • data about dialogue behavior of subjects
  • the used syntax and lexicon (to hand-code
    language models)
  • speech data (to train statistical language
    models)
  • at an early stage.

15
Wizard-of-Oz Experiments
  • The WoZ is not just a person in a box The WoZ
    system must
  • perform as poor as a computer "artificial"
    speech output by typing and TTS system,
    simulation of shortcomings in recognition wizard
    sees typed input (no prosody), maybe even with
    simulated recognition failure (e.g., by randomly
    overwriting words in typed input).
  • perform as efficient as a computer support of
    quick database access, complex real time
    decisions, e.g., by displaying dialogueflow
    diagram, marking the current state, offering
    menus with contextually appropriate dialogue
    moves and system prompts.
  • impose constraints on the options of the wizard
    (to support impression of artificiality), and
    allow to vary those constraints (to test
    different dialogue strategies.
  • log all kinds of data in an appropriate and
    easily accessible form.

16
Wizard-of-Oz Experiments
  • Ideally, a WoZ system is set up in a modular way,
    allowing to replace functions contributed by
    humans subsequently in the course of system
    implementation.
  • Gradual transition between WoZ and fully
    artificial system.
  • An example The DiaMant tool, run in WoZ mode.

17
Motivations for WoZ experiments
  • The original motivation
  • Eearly testing, avoiding time-consuming and
    expensive programming.
  • Studying dialogues disregarding the bottle-neck
    of unreliable speech recognition.
  • Changing conditions
  • Configuration and design of dialogue systems is
    becoming comfortable, recognizers are becoming
    pretty reliable Are WoZ experiments necessary?
  • Dialogue interaction is becoming increasingly
    flexible, adaptive, complex. Are WoZ experiments
    feasible?
  • A shift in motivation
  • From exploration of the user's behavior, given
    constraint and schematic system's behavior
  • To exploration of alternative wizard's behavior,
    who is given a range of freedom for his/her
    reaction.

18
An example
  • A WoZ study in the TALK project, Spring 2005
  • MP3 Player
  • Multi-modal dialogue, language German
  • In-car/in-home scenario
  • Saarland University, DFKI, CLT

19
Tasks for the Subjects
  • MP3 domain
  • in-car with primary task Lane Change Task (LCT)
  • in-home domain without LCT
  • Tasks for the subject
  • Play a song from the album "New Adventures in
    Hi-Fi" by REM
  • Find a song with believe in the title and play
    it.
  • Task for the wizard
  • Help the user reach their goals(Deliberately
    vague!)

20
Goals of WOZ MP3 Experiment
  • Gather pilot data on human multi-modal turn
    planning
  • Collect wizard dialogue strategies
  • Collect wizard media allocation decisions
  • Collect wizard speech data
  • Collect user data (speech signals and spontaneous
    speech)

21
User View
  • Primary task driving
  • Secondary task on second screen MP3 player

22
Video Recording
23
DFKI/USAAR WOZ system
  • System features
  • 14 (via OAA) communicating components distributed
    over
  • 5 machines (3 windows, 2 linux)
  • Plus LCT on a seperate machine
  • People involved to run an experiment 5
  • 1 experiment leader
  • 1 wizard
  • 1 subject
  • 2 typists

24
Data Flow
Wizard
Subject
Typist
Typist
25
A Walk Through the final turns
  • Wizard Ich zeige Ihnen die Liste an.I am
    displaying the list.
  • User Ok. Zeige mir bitte das Lied aus dem
    ausgewählten Album und spiel das vor.Ok. Please
    show me that song (Believe) from the selected
    album and play it.

26
A Walk Through the Final Turns
  • Wizard's actions
  • Database search
  • Select album presentation (vs. songs or
    artists)
  • Select list presentation (vs. tables or textual
    summary)
  • Utterance Ich zeige Ihnen die Liste an.I am
    displaying the list.
  • Audio is sent to typist
  • Text is sent to speech synthesis
  • User Ok. Zeige mir bitte das Lied aus dem
    ausgewählten Album und spiel das vor.Ok. Please
    show me that song (Believe) from the selected
    album and play it.

27
Example(1) Wizard
says Ich zeige Ihnen die Liste an. (I am
displaying the list.) and clicks on the list
presentation
28
(No Transcript)
29
(No Transcript)
30
Options presenter with User-Tab
31
Data Flow
Wizard
Subject
Typist
Typist
32
Example(2) WizardTypist
  • types the wizards spoken text

I am displaying the list.
33
Data Flow
Wizard
Subject
Typist
Typist
34
Example(3) User
  • Listens to wizard text synthesized by Mary and
    receives the selected list presentation

35
(No Transcript)
36
Example(4) User
  • Selects one album and says Ok. Zeige mir bitte
    das Lied aus dem aus gewählten Album und spiel
    das vor.

Ok. Please show me that song (Believe) from the
selected album and play it.
37
Automatically updated wizard screen with check
38
Data Flow
Wizard
Subject
Typist
Typist
39
Example(5) UserTypist
  • Types the users spoken text

Ok. Please show me that song (Believe) from the
selected album and play it.
40
Data Flow
Wizard
Subject
Typist
Typist
41
Example(6) Wizard
  • Gets a correspondingly updated TextBox Window

42
The current experimmental setup
  • Usability Lab, Building C7 4

43
GUI Development
Old
New
44
Outline
  • The Software Development Cycle
  • Dialogue Design
  • Wizard-of-Oz Experiments
  • Dialogue System Evaluation

45
Different levels of evaluation
  • Technical evaluation
  • Usability evaluation
  • Customer evaluation
  • According to L. Dybkjaer/ N.Bernsen/ W.Minker,
    "Overview of evaluation and usability", in W.
    Minker et al., Spoken multimodal human-computer
    dialogue in mobile environments, Springer 2005

46
Different levels of evaluation
  • Technical evaluation
  • Typically component evaluation (ASR, TTS,
    Grammar, but e.g. System robustness)
  • Quantitative and objective, to some extent
  • Usability evaluation
  • Customer evaluation

47
Evaluation of ASR Systems
  • WER
  • Speed (real-time performance)
  • Size of lexicon
  • Perplexity

48
Evaluation of TTS
  • Intuitive evaluation by users with respect to
  • intellegibility
  • pleasantness
  • naturalness
  • No objective (though quantitative) criteria, but
    extremely important for user satisfation

49
Different levels of evaluation
  • Technical evaluation
  • Usability evaluation
  • Evaluation of user satisfaction
  • Typically end-to-end evaluation
  • Mostly subjective and qualitative measures
  • Customer evaluation

50
Different levels of evaluation
  • Technical evaluation
  • Usability evaluation
  • Customer evaluation, including aspects like
  • Costs
  • Platform compatibility
  • Maintenance

51
Usability Evaluation
  • Mostly soft criteria
  • "Usability Guidelines", best-practice rules, form
    the basis of expert evaluation or user
    questionnaires.

52
Usability Guidelines
  • from Dybkjaer et al.
  • Feedback adequacy The user must feel confident
    that the system has understood the information
    input in the way it was intended
  • Naturalness of the dialogue structure
  • Sufficiency of interaction guidance
  • Sufficiency of adaptation to user differences

53
Usability Evaluation
  • Mostly soft criteria
  • "Usability Guidelines", best-practice rules, form
    the basis of expert evaluation or user
    questionnaires.
  • Hard, measurable criteria often contradict each
    other Systems with high task success may lack
    efficiency, and vice versa.
  • Is it possible to evaluate usability in a
    objective, predictive, and general way?
  • Is there one (maybe parametrized) measure for
    User Satisfaction?

54
PARADISE
  • An attempt to provide an objective, quantitative,
    operational basis for qualitative user
    assessments
  • M. Walker/ D. Litman/C.Kamm/A.Abella "PARADISE
    A framework for evaluating spoken dialogue
    agents", Proc. of ACL 1997

55
PARADISE The Idea
  • The top criterion for usability evaluation is
    user satisfaction it is an intuitive criterion
    which can not be directly measured, but is only
    accessible through qualitative user judgments.
  • User satisfaction is
  • correlated to task success (effectiveness)
  • inversely correlated to the dialogue costs.
  • There are features that can be easily and
    objectively extracted from dialogue logfiles,
    which approximate both task success and dialogue
    costs.

56
PARADISE The Idea
  • Take a set of dialogues produced by interaction
    of a dialogue system A with different subjects.
  • Let the users assess their satisfaction with the
    dialogue.
  • Calculate the task success, and read the
    different measures for dialogue costs off the
    log-files.
  • Compute the correlation between satisfaction
    assessment and quantitative measures (via
    multiple linear regression).
  • Results
  • Prediction of user satisfaction for new
    individual dialogues with system A, or
  • or for dialogues with a modified system A'.
  • Comparison of different dialogue systems A and B
    with respect to user satisfaction.

57
PARADISE The Structure
Maximise user satisfaction
Minimize costs
Maximise task success
Efficiency measures
Qualitative measures
58
Efficiency and Quality Measures
  • Efficiency measures
  • Elapsed time
  • System turns
  • User turns
  • Quality measures
  • of timeout prompts
  • of rejects
  • of helps
  • of cancels
  • of barge-ins
  • Mean ASR score

59
A Measure for Task Success
  • Option 1 Yes/No evaluation for the complete
    dialogue
  • Option 2, available for dialog systems using the
    form-filling paradigm Let task success be
    determined by the fields in the form filled with
    correct values.

This and the following 3 slides will not be part
of the exam
60
Tasks as Attribute-Value Matrices
61
An Instance
62
A Measure for Task Success
  • Identify task success with the ? value for
    agreement between actual and intended values for
    the AVM (? is usually employed for measuring
    inter-annotator agreement).
  • P(A) P(E)
  • 1- P(E)
  • P(A) is the actual relative frequency of
    coincidence between values, P(E) the expected
    frequency.

?
63
PARADISE The Structure
Maximise user satisfaction
Minimize costs
Maximise task success
Efficiency measures
Qualitative measures
64
User Satisfaction
  • Measured by adding the scores assigned to 8
    questions by the subjects.

65
A user satisfaction questionnaire
  • Was the system easy to understand?
  • Did the system understand what you said?
  • Was it easy to find the information you wanted?
  • Was the pace of interaction with the system
    appropriate?
  • Did you know what you could say at each point in
    the dialogue?

66
A user satisfaction questionnaire
  • How often was the system sluggish and slow to
    reply to you?
  • Did the system work the way you expected it to?
  • From your current experience with using the
    system, do you think you would use the system
    regularly?

67
A hypothetical example
This and the following slide will not be part of
the exam
68
The Performance Function
  • N is a normalisation function, based on standard
    deviation,
  • N(?) is normalised task success
  • N(ci) are the normalised cost factors,
  • ? and wi are weights on ? and the ci,
    respectively.

69
Comments on PARADISE
  • Criterion for the feature selection is the easy
    availability of features through log-files. Is it
    really the interesting features that are
    selected?
  • There is no strong theoretical foundation for the
    choice of questions in the user questionnaire.
  • Does the methodology extend to more complex
    dialogue applications in real-world environments?

70
General Comments
  • A trade-off between precision/objectivity and
    usefulness
  • PARADISE (More or less) Precise and objective,
    but of limited practical use.
  • Evaluation Guidelines Of some practical use, but
    not really objective.
  • The most useful device is intuition If it is,
    at least in part, an artist's intuition Dialogue
    design is art, as well as technology.
Write a Comment
User Comments (0)
About PowerShow.com