Title: Steps%20of%20The%20PARADISE%20Methodology
1George Mason University Learning Agents Laboratory
Concepts, Issues, and Methodologies for
Evaluating Mixed-initiative Intelligent Systems
Research Presentation by Ping Shyr Research
Director Dr Gheorghe Tecuci May 3, 2004
2Overview
Common Evaluation Issues
Evaluating Spoken Dialogue Agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
3 Common Issues
- Common Issues Related to Intelligent System
Evaluations - 1. Costly (resource-intensive), lengthy, and
labor-intensive - 2. Domain experts are scarce and expensive
- 3. Quality of the knowledge acquired is hard to
analyze and measure - 4. Efficiency and effectiveness are not easy to
evaluate - 5. Component evaluation is easier than approach
(full system) evaluation - Issues of Evaluating Mixed-initiative Systems
- Its hard to analyze the contribution of the
system to the overall performance - 2. Different knowledge sources (Imported
ontology, SMEs, and Agent) - 3. Mixed-initiative methods are more complex and
therefore more - difficult to evaluate
4 General Evaluation Questions
1. When to conduct an evaluation? 2. What kind of
evaluation should we use? Summative/outcome
evaluation? (predict/measure the final
level of performance) Diagnostic/process
evaluation? (identify problems in the
prototype/development system that may
degrade the performance of the final
system) 3. Where and when to collect evaluation
data? Who should supply
data? 4. Who should be involved in an evaluation?
5Usability Evaluation
- Usability is critical factor to judge a system
- 2. Usability evaluation provides us important
information to - determine
- a). How easy a system (especially human-
- computer/agent interface/interaction)
to learn and to use? - b). Users confidence in the system results.
- 3. Usability evaluation also can be used to
predict whether an - organization or will use the system?
6Issues in Usability Evaluation
1. To what extent does the interface meet
acceptable Human Factors standards? 2. To
what extent is the system easy to use? 3. To
what extent is the system easy to learn how to
use? 4. To what extent does the system decrease
user workload? 5. To what extent does the
explanation capability meet user needs? 6.
Is the allocation of tasks to user and system
appropriate? 7. Is the supporting documentation
adequate?
7Performance Evaluation
Performance evaluation helps us to answer the
question Does the system meet user and
organizational objects and need? Measure
the systems performance behavior. Experiment is
the best (only) method can appropriately evaluate
the systems performance of the stable prototype
and the final system.
8Issues in Performance Evaluation
1. Is the system cost-effective? 2. To what
extent does the system meet users need? 3. To
what extent does the system meet organizations
need? 4. How effective is the system in
enhancing users performance? 5. How
effective is the system in enhancing
organizational performance? 6. How
effective is the system in specific tasks ?
9Mixed-initiative user interface evaluation (1)
(based on Eric Horvitz) 1. Does the automated
service (automation) significantly add value
to the system? 2. Does the system employ
machinery for inferring/exploiting the
uncertainty about a users intentions and
focus? 3. Does the system include the
consideration of the status of a users
attention in the timing of services? - the
nature and timing of automated services and
alerts can be a critical factor in the
cost and benefits of actions. 4. Can automated
services be guided by expected value (cost
and benefit) of taking actions?
10Mixed-initiative user interface evaluation (2)
5. Will the system employ dialog to resolve key
uncertainties? 6. Can users directly invoke or
terminate the automated services? 7. Does
the system minimize the cost of poor guesses
about action and timing? 8. Can the system
adjust its level of services to match the
current uncertainty?
11Mixed-initiative user interface evaluation (3)
9. Does the system provide mechanisms for
efficient agent- user collaboration to refine
results? 10. Does the system employ socially
appropriate behaviors for agent-user
interaction? 11. Can the system maintain a
memory of the recent interactions? 12.
Will the system continue to learn about a users
goals and needs?
12Metrics for evaluating models of mixed-initiative
dialog (1) (Guinn)
- OBJECTIVE METRICS
- Percentage of correct answers
- Percentage of successful transactions
- Number of dialog turns
- Dialog time
- User response time
- System response time
- Percentage of error messages
- Percentage of non-trivial utterances
- Mean length of utterances
- Task completion
13Metrics for evaluating models of mixed-initiative
dialog (2) (Guinn)
- SUBJECTIVE METRICS
- Percentage of implicit recover utterances
- Percentage of explicit recover utterances
- Percentage of appropriate system utterances
- Cooperativity
- Percentage of correct and partially correct
utterances - User satisfaction
- Number of initiative changes
- Number of explicit initiative changing events
- Number of implicit initiative changing events
- Level of initiative (see examples later)
- Number of discourse segments
- Knowledge density of utterances
- Co-reference patterns
14Level (Strength) of initiative (1)
Level (Strength) of initiative - how strongly a
participant is in control, when he or she does
have the initiative. Initiative can be present
in many different degrees or strengths. 1) A I
wish I knew how to split up the work of peeling
this banana. 2) B Yeah. 3) A What do you think
we should do? 4) B I dont know. Its a tough
problem. 5) A Sure is. Im so confused. 6) B Me
too. Maybe the waiter has an idea. 7) A I hope
so, Im getting hungry. (From example 8 in What
is Initiative, R. Cohen et. al.)
15Level (Strength) of initiative (2)
1) A So, how should we split up the work of
peeling the banana? 2) B I dont know. What do
you think? 3) A We need a plan. 4) B I know we
need to split this up somehow. 5) A Yes, youre
right. We need something sharp. 6) B A
cleaver? 7) A Good idea! That way we can split
it up evenly. 8) B Then we can each peel our own
half. 1) A Need to split up the work of peeling
this banana. I have the plan. You grab the
bottom of the banana and hold it steady. Then I
grab the top of the banana and pull hard. Thats
how well do it. 2) B No! I think Ill just peel
the banana myself. That would be way more
efficient. (From example 9 and 10 in What is
Initiative, R. Cohen et. al.)
16Overview
Common Evaluation Issues
Evaluating spoken dialogue agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
17 Evaluating Spoken Dialogue Agent with
PARADISE (M. Walker)
- PARADISE (PARAdigm for DIalogue System
Evaluation), a general framework for evaluating
spoken dialogue agents. - Decouples task requirements from an agents
dialogue behaviors -
- Supports comparisons among dialogue strategies
-
- Enables the calculation of performance over
subdialogues and whole - dialogues
-
- Specifies the relative contribution of various
factors to performance -
- Makes it possible to compare agents performing
different tasks by - normalizing for task complexity.
18Attribute Value Matrix (AVM)
An attribute value matrix (AVM) can represent
many dialogue tasks. This consists of the
information that must be exchanged between the
agent and the user during the dialogue,
represented as a set of ordered pairs of
attributes and their possible values.
Attribute-value pairs are annotated with the
direction of information flow to represent who
acquires the information. Performance evaluation
for an agent requires a corpus of dialogues
between users and the agent, in which users
execute a set of scenarios. Each scenario
execution has a corresponding AVM instantiation
indicating the task information that was actually
obtained via the dialogue. (from PARADISE, M.
Walker)
19Kappa coefficient
Given a matrix M, success at achieving the
information requirements of the task is measured
with the Kappa coefficient
P(A) is the proportion of times that the AVMs for
the actual set of dialogues agree with the AVMs
for the scenario keys, and P(E) is the proportion
of times that the AVMs for the dialogues and the
keys are expected to agree by chance.
(from PARADISE, M. Walker)
20 Steps in PARADISE Methodology
- Definition of a task and a set of scenarios
- Specification of the AVM task representation
- Experiments with alternate dialogue agents for
the task - Calculation of user satisfaction using surveys
- Calculation of task success using k
- Calculation of dialogue cost using efficiency and
qualitative - measures
- Estimation of a performance function using linear
regression and values for user satisfaction, k,
and dialogue costs - Comparison with other agents/tasks to determine
which factor - generalize
- Refinement of the performance model.
- (from PARADISE, M. Walker)
21Objective Metrics for evaluating a dialog (Walker)
- Objective metrics can be calculated without
recourse to human judgment, and in many cases,
can be logged by the spoken dialogue system so
that they can be calculated automatically. - Percentage of correct answers with respect to a
set of reference - answers
- Percentage of successful transactions or
completed tasks - Number of turns or utterances
- Dialogue time or task completion time
- Mean user response time
- Mean system response time
- Percentage of diagnostic error messages
- Percentage of non-trivial (more than one word)
utterances - Mean length of non-trivial utterances
- (from PARADISE, M. Walker)
22Subjective Metrics for evaluating a dialog (1)
- Percentage of implicit recovery utterances
(where the system - uses dialogue context to recover from errors
of partial - recognition or understanding)
- Percentage of explicit recovery utterances
- Percentage of contextually appropriate system
utterances - Cooperativity (the adherence of the systems
behavior to - Grices conversational maxims Grice, 1967)
- Percentage of correct and partially correct
answers - Percentage of appropriate and inappropriate
system directive - and diagnostic Utterances
- User satisfaction (users perceptions about the
usability of a - system, usually assessed with multiple choice
questionnaires - that ask users to rank the systems
performance on a range of - usability features according to a scale of
potential assessments) - (from PARADISE, M. Walker)
23Subjective Metrics for evaluating a dialog (2)
Subjective metrics require subjects using the
system and/or human evaluators to categorize the
dialogue or utterances within the dialogue along
various qualitative dimensions. Because these
metrics are based on human judgments, such
judgments need to be reliable across judges in
order to compete with the reproducibility of
metrics based on objective criteria.
Subjective metrics can still be quantitative,
as when a ratio between two subjective categories
is computed. (from PARADISE, M. Walker)
24Limitations of Metrics Current Methodologies
for evaluating a dialog
- The use of reference answers makes it impossible
to compare - systems that use different dialogue strategies
for carrying out - the same task. This is because the reference
answer approach - requires canonical responses (i.e. a single
correct answer) to - be defined for every user utterance, even
though there are - potentially many correct answers.
- Interdependencies between metrics are not yet
well understood. -
- The inability to trade-off or combine various
metrics and to - make generalizations
- (from PARADISE, M. Walker)
25How Does PARADISE to Address these Limitations
for Evaluating a Dialog
- PARADISE supports comparisons among dialogue
strategies - by providing a task representation that
decouples what an agent - needs to achieve in terms of the task
requirements from how - the agent carries out the task via dialogue.
- PARADISE uses methods from decision theory
(Keeney - Raiffa, 1976 Doyle, 1992) to combine a
disparate set of - performance measures (i.e. user satisfaction,
task success - and dialogue cost) into a single performance
evaluation - function(weighted function).
- Once a performance function has been derived, it
can be used both to - make predictions about future versions of the
agent and as the basis - of feedback to the agent so that the agent can
learn to optimize its - behavior based on its experiences with users
over time. - (from PARADISE, M. Walker)
26Example Dialogue 1 (D1, Agent A)
Figure 2. Dialogue 1 (D1)
Agent A dialogue interaction (from Figure 2 in
PARADISE, M. Walker)
27Example Dialogue 2 (D2, Agent B)
Figure 3. Dialogue 2 (D2) Agent B dialogue
interaction
(from Figure 3 in PARADISE, M. Walker)
28PARADISEs structure of objectives for spoken
dialogue performance.
Figure 1. PARADISEs structure of objectives
for spoken dialogue performance.
(from Figure 1 in PARADISE, M. Walker)
29Example in Simplified Train Timetable Domain
Our example scenario requires the user to find a
train from Torino to Milano that leaves in the
evening. During the dialogue the agent must
acquire from the user the values of depart-city,
arrival-city, and depart-range, while the user
must acquire depart-time.
TABLE 1. Attribute value matrix,
simplified train timetable domain
(from Figure 1 in PARADISE, M. Walker)
This AVM consists of four attributes and each
contained a single value. (from PARADISE, M.
Walker)
30Measuring task success (1)
Success at the task for a whole dialogue (or
sub-dialogue) is measured by how well the agent
and user achieve the information requirements of
the task by the end of the dialogue (or
sub-dialogue). PARADISE uses the Kappa
coefficient (Siegel Castellan, 1988 Carletta,
1996) to operationalize the task-based success
measure in Figure 1.
TABLE 2. Attribute value matrix instantiation,
scenario key for Dialogues 1 and 2
(from Table 2 in PARADISE, M. Walker)
31Measuring task success (3)
TABLE 3. Confusion matrix for Agent A
(from Table 3 in PARADISE, M. Walker)
32Measuring task success (4)
TABLE 4. Confusion matrix for Agent B
(from Table 4 in PARADISE, M. Walker)
33Measuring task success (2)
The values in the matrix cells are based on
comparisons between the dialogue and scenario key
AVMs. Whenever an attribute value in a dialogue
(i.e. data) AVM matches the value in its scenario
key, the number in the appropriate diagonal cell
of the matrix (boldface for clarity) is
incremented by 1. The off-diagonal cells
represent misunderstandings that are not
corrected in the dialogue. The time course of
the dialogue and error handling for any
misunderstandings are assumed to be reflected in
the costs associated with the dialogue. The
matrix in Table 3 and 4 summarizes how the 100
AVMs representing each dialogue with Agent A and
B compare with the AVMs representing the relevant
scenario keys. (from PARADISE, M. Walker)
34Measuring task success (5) - Kappa coefficient
(1)
Given a confusion matrix M, success at achieving
the information requirements of the task is
measured with the Kappa coefficient
P(A) is the proportion of times that the AVMs for
the actual set of dialogues agree with the AVMs
for the scenario keys, and P(E) is the proportion
of times that the AVMs for the dialogues and the
keys are expected to agree by chance. P(A), the
actual agreement between the data and the key,
can be computed over all the scenarios from the
confusion matrix M
(from PARADISE, M. Walker)
35Measuring task success (6) - Kappa coefficient (2)
When the prior distribution of the categories is
unknown, P(E), the expected chance agreement
between the data and the key, can be estimated
from the distribution of the values in the keys.
This can be calculated from confusion matrix M,
since the columns represent the values in the
keys. In particular
where ti is the sum of the frequencies in column
i of M, and T is the sum of the frequencies in M
(t1. . .tn).
(from PARADISE, M. Walker)
36Measuring task success (7) - Kappa coefficient (3)
For Agent A P(A) (22 29 16 11 20 22
20 15 45 40 20 19 18 21) / 400
.795 2 2
2
2 P(E) (30/400) (30/400)
(25/400) (25/400) 0.079375 K
0.777 For Agent B P(A)0.59,
K0.555 Suggesting that Agent A is more
successful than B in achieving the task goals.
37Measuring Dialogue Costs (1)
PARADISE represents each cost measure as a
function ci that can be applied to any
(sub)dialogue. First, consider the simplest case
of calculating efficiency measures over a whole
dialogue. For example, let c1 be the total number
of utterances. For the whole dialogue D1 in
Figure 2, c1 (D1) is 23 utterances. For the whole
dialogue D2 in Figure 3, c1 (D2) is 10
utterances. Tagging by AVM attributes is
required to calculate costs over sub-dialogues,
since for any sub-dialogue task attributes define
the sub-dialogue. For sub-dialogue S4 in Figure
4, which is about the attribute arrival-city and
consists of utterances A6 and U6, c1 (S4) is 2.
(from PARADISE, M. Walker)
38Measuring Dialogue Costs (2)
Figure 4. Task-defined discourse structure of
Agent A dialogue interaction.
(from Figure 4 in PARADISE, M. Walker)
39Measuring Dialogue Costs (3)
Tagging by AVM attributes is also required to
calculate the cost of some of the qualitative
measures, such as number of repair utterances.
For example, let c2 be the number of repair
utterances. The repair utterances for the whole
dialogue D1 in Figure 2 are A3 through U6, thus
c2 (D1) is 10 utterances and c2 (S4) is two
utterances. The repair utterance for the whole
dialogue D2 in Figure 3 is U2, but note that
according to the AVM task tagging, U2
simultaneously addresses the information goals
for depart-range. In general, if an utterance U
contributes to the information goals of N
different attributes, each attribute accounts for
1/N of any costs derivable from U. Thus, c2 (D2)
is 0.5. (from PARADISE, M. Walker)
40Measuring Dialogue Performance (1)
41Measuring Dialogue Performance (2)
42Measuring Dialogue Performance (3)
TABLE 5 Hypothetical performance data from users
of Agent A and B.
(from Table 5 in PARADISE, M. Walker)
43Measuring Dialogue Performance (4)
In this illustrative example, the results of the
regression with all factors included shows that
only j and rep are significant (plt002). In
order to develop a performance function estimate
that includes only significant factors and
eliminates redundancies. A second regression
including only significant factors must then be
done. In this case, a second regression yields
the predictive equation
The mean performance of A is 0.44 and the mean
performance of B is 0.44, suggesting that Agent B
may perform better than Agent A overall. The
evaluator must then however test these
performance differences for statistical
significance. In this case, a t-test shows that
differences are only significant at the plt0.07
level, indicating a trend only. (from PARADISE,
M. Walker)
44Experimental Design (1)
Both experiments required every user to complete
a set of application tasks (3 task with 2
subtasks 4 tasks) in conversations with a
particular version of the agent. Instructions
to the users were given on a set of web pages
there was one page for each experimental task.
Each web page consisted of a brief general
description of the functionality of the agent, a
list of hints for talking to the agent, a task
description and information on how to call the
agent. (from PARADISE, M. Walker)
45Experimental Design (2)
Each page also contained a form for specifying
information acquired from the agent during the
dialogue, and a survey, to be filled out after
task completion, designed to probe the users
satisfaction with the system. Users read the
instructions in their offices before calling the
agent from their office phone. All of the
dialogues were recorded. The agents dialogue
behavior was logged in terms of entering and
exiting each state in the state transition table
for the dialogue. Users were required to fill
out a web page form after each task. (from
PARADISE, M. Walker)
46Examples of survey questions (1)
Was SYSTEM_NAME easy to understand in this
conversation? (text-to-speech (TTS)
Performance) In this conversation, did
SYSTEM_NAME understand what you said? (automatic
speech recognition (ASR) Performance) In this
conversation, was it easy to find the message you
wanted? (Task Ease) Was the pace of interaction
with SYSTEM_NAME appropriate in this
conversation? (Interaction Pace) How often was
SYSTEM_NAME sluggish and slow to reply to you in
this conversation? (System Response) (from
PARADISE, M. Walker)
47Examples of survey questions (2)
Did SYSTEM_NAME work the way you expected him to
in this conversation? (Expected Behavior) In
this conversation, how did SYSTEM_NAMEs voice
interface compare to the touch-tone interface to
voice mail? (Comparable Interface) From your
current experience with using SYSTEM_NAME to get
your e-mail, do you think youd use SYSTEM_NAME
regularly to access your mail when you are away
from your desk? (Future Use) (from PARADISE, M.
Walker)
48Summary of Measurements
User Satisfaction score is used as a measure of
User Satisfaction. Kappa measures actual task
success. The measures of System Turns, User
Turns and Elapsed Time (the total time of the
interaction ) are efficiency cost measures.
The qualitative measures are Completed, Barge
Ins (how many times users barged in on agent
utterances), Timeout Prompts (the number of
timeout prompts that were played), ASR Rejections
(the number of times that ASR rejected the users
utterance ), Help Requests and Mean Recognition
Score. (from PARADISE, M. Walker)
49Using the performance equation
One potentially broad use of the PARADISE
performance function is as feedback to the agent,
then agent to learn how to optimize its behavior
automatically. The basic idea is to apply the
performance function to any dialogues Di in which
the agent conversed with a user. Then each
dialogue has an associated real numbered
performance value Pi, which represents the
performance of the agent for that dialogue. If
the agent can make different choices in the
dialogue about what to do in various situations,
this performance feedback can be used to help the
agent learn automatically, over time, which
choices are optimal. Learning could be either
on-line so that the agent tracks its behavior on
a dialogue by dialogue basis, or off-line where
the agent collects a lot of experience and then
tries to learn from it. (from PARADISE, M. Walker)
50Overview
Common Evaluation Issues
Evaluating Spoken Dialogue Agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
51How to Evaluate a Mixed-initiative System?
Mike Pazzanis caution
- Dont lose sight of the goal.
- The metrics are just approximations of the goal.
- Optimizing the metric may not optimize the goal.
52Usability Evaluation - Sweeny et al., 1993 (1)
Three dimensions of usability evaluation 1.
Evaluation approach User-based,
Expert-based, and Theory-based approach 2.
Type of evaluation Diagnostic methods,
Summative evaluation, and Certification
approach 3. Time of evaluation
Specification, Rapid prototype, High Fidelity
prototype, and Operational system
53Usability Evaluation - Sweeny et al., 1993 (2)
Evaluation approach 1. User-based approach can
utilize a). Performance indicator (i.e.,
task times or error rate) b). Nonverbal
behaviors (i.e., eye movements) c).
Attitudes (i.e., questionnaires) d).
Cognitive indicator (i.e., verbal protocols)
e). Stress indicators (i.e., heart rates or
electro-encephalogram) f). Motivational
indicators (i.e., effort) 2. Expert-based
approach (heuristic evaluation -Nielsen
Phillips, 1993) can apply methods
indicating conformance to guidelines or design
criteria and expert attitudes (i.e.,
comments or rating). 3. Theory-based approach
uses idealized performance measures to predict
learning or performance times or ease of
understanding.
54Usability Evaluation - Sweeny et al., 1993 (3)
Evaluation by Type 1. Diagnostic methods seek
to identify problems with the design and
suggest solution. All three approaches can be
used as diagnostics, but expert-based and
user-based methods work better. 2.
Summative evaluation seek to determine the extent
to which the system helps the user
complete the desired task. 3. Certification
approach is used to determine if the system met
the required performance criteria for its
operational environment.
55Usability Evaluation - Sweeny et al., 1993 (4)
Time of Evaluation 1. Specification -
theory-based and expert-based methods 2. Rapid
prototype - expert-based and user-based methods
conducted in the laboratory (e.g., usability
testing and questionnaires) work best.
3. High Fidelity prototype 4. Operational
system In phase 3 and 4, evaluation should be
conducted in the field and utilized
user-based techniques.
56Usability Evaluation - Nielsen Phillips (1993)
Ten rules for usability inspection 1. Use
simple and nature dialogue 2. Speak the users
language 3. Minimize the users memory load 4.
Be consistent 5. Provide feedback 6. Provide
clear marked exit 7. Provide shortcut 8.
Provide good error message 9. Prevent
errors 10. Provide good on and off-line documents
57Subjective Usability Evaluation Methods
Nine subjective evaluation methods 1. Thinking
aloud 2. Observation 3. Questionnaires 4.
Interviews 5. Focus groups (5-9 participants,
lead by a moderator - member of evaluation
team) 6. User feedback 7. User diaries and
log-books 8. Teaching back (having users try to
teach others how to use a system) 9. video
taping
58How to handle Questionnaires (1)
Adelman Riedel (1997) suggested to use
Multi-Attribute Utility Assessment (MAUA)
approach 1. the overall system utility is
decomposed into three categories
(dimensions) - Effect on task
performance - System usability -
System fit. 2. Each dimensions is decomposed
into different criteria (e.g. fit with user
and fit with organization in system fit
dimension), and each criterion may be
further decomposed into specific
attributes.
59How to handle Questionnaires (2)
3. There are at least two questions for each
bottom-level criterion and attribute in the
hierarchy. 4. Each dimension is weighted equally
and each criterion is also weighted equally
in its dimension, so as attributes in each
criterion. 5. A simple arithmetic operation
can be used to score and weight the results.
6. Sensitivity analysis can be performed by
determining how sensitive the overall
utility score is to change in the relative
weights on the criteria and dimensions, or to the
system scores on the criteria and attributes
60Overall System Utility
Process Quality
Fit With Organization
Person- Machine Functional Allocation
Multi-Attribute Utility Assessment (MAUA)
Evaluation Hierarchy (From Adelman L. Riedel,
S.L. (1997). Handbook for Evaluating
Knowledge-based systems.)
61Objective Usability Evaluation Methods
Objective data about how well users can actually
use a system can be collected by empirical
evaluation methods, and this kind of data is the
best data one can gather to evaluate system
usability. Four Evaluation Methods proposed
1. Usability testing 2. Logging activity use
(includes system use) 3. Activity analysis
4. Profile examination Usability testing and
logging system use methods have been identified
as effective methods for evaluating a stable
prototype (Adelman and Riedel, 1997)
62Usability Testing (1)
1. Usability testing is the most common
empirical evaluation method it assesses
a systems usability on pre-defined object
performance measures. 2. Involves
potential system users in a laboratory-like
environment. Users will be giving either
test cases or problems to solve after
received proper training on the prototype. 3.
Evaluation team collects objective data on the
usability measures while they are
performing the test cases or solving problems.
- users individual or group performance
data - measure the relative position (e.g.
how much time difference or how many
times) of the current level of usability against
(1) the best possible expected
performance level and (2) the worst expected
performance level. This is our
upper/lower bound baseline.
63Usability Testing (2)
4. The best possible expected performance level
can be obtained by having development team
members perform each of the task and
recording their results (e.g., time). 5. The
worst expected performance level is the lowest
level of acceptable performance - this the
lowest level in which the system could
reasonably be expected to be used. We plan to
use 1/6 of the best possible expected
performance level as our worst expected
performance level in our initial study. This
proportion is based on the study in the
usability testing handbook (Uehling, 1994).
64Logging Activity Use (1)
Some of the important objective measures (system
use) provide by Nielsen (1993) are listed
below - The time users take to complete a
specific task. - The number of tasks of
various kinds that can be completed within a
given time period. - The ratio
between successful interactions and errors.
- The time spent recovering from errors.
- The number of user errors. - The number
of system features utilized by users. -
The frequency of use of the manuals and/or the
help system, and the time spent using
them. - The number of times the user
express clear frustration (or clear joy).
- The proportion of users who say that they would
prefer using the system over some
specified competitor. - The number of
times the user had to work around an unsolvable
problem.
65Logging Activity Use (2)
Proposed logging activity use method not only
collects system features related information,
but also records user behavior and relationship
between tasks/movements (e.g. sequence,
preference, pattern, trend, etc.) . This
provides us a better coverage on both system and
user level.
66Activity Analysis Profile Examination Methods
- Proposed activity analysis method analyzes the
user behavior (e.g. sequence or preference)
while using the system. - This method together
with profile examination will provide extreme
useful feedback to users, developers, and agent.
- Profile examination method analyzes
consolidated data from user logging files
and provide summary level results. - Review
single user profiles, group users profiles, and
all users profiles and compare behaviors,
trends, and progress on the same user and it
also provides comparison among users, groups,
and best possible expected performance.
67Integration of System Prototyping and Evaluation
68Architecture of Disciple Learning Agent Shell
69 Performance Evaluation
1. Experiment is the most appropriate method for
evaluating the performance of the stable
prototype (Adelman and Riedel, 1997 and
Shadbolt et al., 1999). 2. Two major kinds of
experiments - laboratory and field
experiments (allow evaluator to rigorously
evaluate the systems effect in its
operational environment). 3. Collect both
objective data and subjective data for
performance evaluation.
70Compare to baseline behavior?
Measure and compare speed, memory, accuracy,
competence, creativity for solving a class of
problems in different settings.
What are some of the settings to consider?
MI
Human alone
Agent alone
Mixed-initiative human-agent system
MI
MI-
Non mixed-initiative human-agent system
Ablated mixed-initiative human-agent system
71Other complex questions
Consider the setting
MI
Human alone (baseline)
Mixed-initiative human-agent system
How to account for human learning during baseline
evaluation?
Use other humans? How to account for human
variability? Use many humans? How to pay
for the associated cost??? Replace a human with
a simulation? How well does the simulation
actually represents a human? Since the
simulation is not perfect, how good is the
result? How much does a good simulation cost?
72Important Studies of Performance Evaluation (1)
Several important studies are needed for a
through performance evaluation 1).
Knowledge-level study - Analyzes agents
overall behavior and knowledge formation rate,
knowledge changes (adding, deletion, and
modification) and reasoning. -
Not only analyze the size changes among KBS
during the KB building process, but
also examine the real content changes among
KBS (e.g., same rule (name) in different
phase of KBS may cover different
examples or knowledge elements). - By just
comparing number of rules (or even name or rules)
in different KBS will not give us the
whole picture of knowledge changes.
73Important Studies of Performance Evaluation (2)
2). Ablation study tests a specific method, or a
specific aspect of the knowledge base
development methodology. - Tool
ablation study tests different versions of tools
that have different capabilities under
different conditions, with some of the
capabilities inhibited. - Knowledge
ablation study - incomplete and incorrect
knowledge - These experiments will
demonstrate the effectiveness of various
capabilities (a good example to explain why
design and development behavior and
philosophy will be changed when
incorporating evaluation utility and data
collecting function into the agent
we need to build the agent much more flexible.
74Important Studies of Performance Evaluation (3)
3). Sensitivity (or degradation) study which by
degrades parameters or data structure of a
system and observing the effects on
performance. - In order to get meaningful
results, the change of parameters
should be plausible and not random. - For
example, presenting only some of the explanations
generated by agent, or when doing
generalization, using different
level/depth/length of Semantic Net link/path)
Clancey and Cooper (Buchanan and Shortliffe,
1984, chapter 10), who tested MYCINs
sensitivity to the accuracy of certain
factors (CFs) by introducing inaccuracy
and observing the effect on therapy
recommendations
75Overview
Common Evaluation Issues
Evaluating Spoken Dialogue Agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
76Experiments Design Concepts
77Experiments Guidelines
- I. Key Concept thoroughly designing our
experiments to collect data and to generate
timely evaluation results - II. General Guidelines
- Involve multiple iterations of laboratory
experiments and field experiments - Collect Subjective and Objective Data
- Compare Subjective and Objective Data
- Evaluate alternative KB development methods
(based on mixed-initiative approach) and compare
their results - Compare evaluation results with a baseline (or
model answer), if the baseline does not exist,
well create one from development laboratory or
in our first iteration experiment - Utilize scoring functions when its applicable
- Handle Sample Size Issue
- Include multiple controlled studies
78- Participant(s) design SME (laboratory
experiment), field SMEs (field experiment), and
development team. - Domain(s) Any (e.g., military operational
domains) - Studies Knowledge-level, Tool ablation,
Knowledge ablation, Sensitivity/degradation, and
Simulate experts - Main tasks Import and enhance ontology,
Knowledge acquisition, Knowledge base
repairing/fixing, Knowledge base extension and
modification, Problem solving (e.g., Course Of
Action challenge problem) - Key methods (1) Use full system functions to
perform tasks record results and activities, (2)
Use the agent with and without one or some of the
(previous identified) functions, (3) Use the
agent with and without some knowledge
(pre-defined), (4) Use various system level
variables to perform tasks (laboratory
experiment), and (5) Use various usability
evaluation methods to perform tasks.
79- Data collection and measurements
- Build both user and knowledge profiles
- Performance measures - speed/efficiency,
accuracy/quality, prior knowledge reuse,
knowledge elements changes, size and size change
of knowledge base, KA rate, mixed knowledge
contribution, etc... - Usability measures - collect both subjective and
objective data (includes the best expected data
and derived the worst expected data) - Test against gold, silver, or model answers
(e.g., Recall and Precision measures) - Create baselines
- Results analyzing
- Automatic results analyzing from evaluation
module - Further analyses by evaluator
80Generic Method for Tool Ablation Experiment
- Identify critical components need to be
evaluated - Create various versions (N) of tool
- Base version (lower bound - minimum system
requirements B) - Complete version (upper bound - full system
functionality C) - Intermediate versions (Base version one or
more critical components I1, I2,) - Organize experts in groups
- At least two groups (X Y)
- Minimum two experts per group is recommended
- Prepare M sets (1/2N lt M lt N) of
tasks/questions (M1, M2, ) - Tasks/questions will be designed as disjoint as
possible, but with similar level - of complexity
- Perform tasks based on the combination of
- XBM1, XI1M3, YI2M2, YCM4,
- Minimize transfer effects
- Monitor possible order effects
81Generic Method for Knowledge Ablation Experiment
- Identify knowledge elements need to be removed
or modified - At least three versions of knowledge bases
(ontology) - Complete knowledge version (W)
- Version without identified knowledge elements
(O) - Version with modified knowledge (M)
- Organize experts in groups
- At least two groups (A B)
- Minimum two experts per group is recommended
- Prepare the same sets of tasks/questions as
versions of KBs (e.g., X, Y, Z). - Tasks/questions will be designed as disjoint as
possible, but with similar level - of complexity
- Perform tasks based on the combination of
- AXW, AYO, BXO, BYW, and AZM or BZM
- Minimize transfer effects
- Monitor possible order effects
82Provides timely feedback on the research and
design
83Provides timely feedback on the research and
design
84- Activity analysis
- Profile examination
Laboratory Experiments Participants Design SME
Development team Environment Design
development laboratory Main tasks Import/modify
ontology KA create/refine/fix KBs KB extension
modification Problem Solving
Field Experiments Participants Non-design SME
Developer Environment Non-design laboratory
(near-operation environment) Main tasks Modify
ontology KA create/refine/fix KBsProblem
Solving
- Subjective Methods
- Thinking aloud
- Observation
- Questionnaires
- Interviews
- Focus group
- User feedback
- User diaries
- Teaching back
- Video taping
- Subjective Methods
- Design SMEs feedback
- Peer review/feedback
- Developer diaries/implementing note
- Objective Methods
- Usability testing
- Log Activities
- Build profiles
- Techniques
- Activity analysis
- Profile examination
- Create baselines
Usability
- Objective Methods
- Collect Best expected
- performance data
- (upper bound)
- Derive Worst expected
- performance data (lower bound)
- Log Activities
- Build profiles
Evaluation Utilities
- Techniques
- Activity analysis
- Profile examination
- Create baselines
- Automatic data collection functions
- Automatic data analysis utilities
- Generate/evaluate user knowledge level profiles
- Provide feedback (dynamic, based on request, or
phase- - end)
- Utilize scoring functions
- Automatic questionnaire answers collection and
analysis - Integrate KB verification software
- Consolidate evaluation results
- Subjective Methods
- Questionnaires
- User feedback
- Video taping
- Subjective Methods
- Questionnaires
- User feedback
- Objective Methods
- Knowledge-level study
- Tool Ablation Study
- Knowledge Ablation Study
- Build/apply profiles
- Objective Methods
- Knowledge-level study
- Tool Ablation Study
- Knowledge Ablation Study
- Sensitivity/degradation study
- Expert simulation
- Build/apply profiles
- Techniques
- Measure aspects directly relate to the system
- (e.g., knowledge increasing rate)
- Multiple iterations and domains
- Test against gold or model answers
- Recall and Precision measures
- Create baselines
- Controlled studies
- Compare results with similar approaches and
systems
- Techniques
- Multiple iterations and domains
- Measure aspects directly relate to the system
- Test against gold, silver, or model answers
- Score results
- Create baselines
- Controlled studies
- Apply large effect size (e.g., increase the KA
rate by one or two - orders of magnitude) to handle small sample size
problem. - Compare results with similar approaches and
systems
Performance
Existing, enhanced, and new techniques and
methods
85- Scoring methods for problem solving activities
- Score for all Correct or Partly
Correct Answers - Recall ----------------------------------
------------------------- - Number of
Original Model Answers - Score for all Answers to
Questions - Precision -------------------------------
---------------------------- - Number of System Answers
Provided -
- TotalScore (of a question Q) Wc Correctness
(Q) - Wj Justification (Q) Wi
Intelligibility (Q) - Ws Sources (Q) Wp
Proactivity (Q) - where Justification (of a question Q)
Wjp Present (Q) - Wjs
Soundness (Q) Wjd Detail (Q)
86Overview
Common Evaluation Issues
Evaluating Spoken Dialogue Agents with PARADISE
Some Ideas for Evaluation
Experiments Guidelines and Design
Conclusion
What is initiative ?
Initiative Selection and Experiments
87- Our approach provides some key concepts for
effectively and thoroughly evaluating
mix-initiative intelligent systems - Fully utilizing integrated evaluation utilities
for all time and automatic evaluation - Thoroughly designing laboratory and field
experiments ( e.g., multi-iteration and handle
sample size issue) - Creating baselines and using them for results
comparison - Applying various kinds of studies
- Collecting and analyzing subjective data and
objective data via many methods (e.g., examining
profiles, scoring functions and model answers) - Comparing subjective data with objective
88- Our approach provides some key concepts for
useful and timely feedback to users and
development - Building data collection, data analysis, and
evaluation utility in the agent - Generate and evaluate user knowledge data files
and profiles - Provide three kinds of feedback - dynamic, based
on request, and phase-end - to users, developers,
and the agent