Title: 1 of 49
1CSA4080Adaptive Hypertext Systems II
Topic 8 Evaluation Methods
- Dr. Christopher Staff
- Department of Computer Science AI
- University of Malta
2Aims and Objectives
- Background to evaluation methods in user-adaptive
systems - Brief overviews of the evaluation of IR, QA, User
Modelling, Recommender Systems, Intelligent
Tutoring Systems, Adaptive Hypertext Systems
3Background to Evaluation Methods
- Systems need to be evaluated to demonstrate
(prove) that the hypothesis on which they are
based is correct - In IR, we need to know that the system is
retrieving all and only relevant documents for
the given query
4Background to Evaluation Methods
- In QA, we need to know the correct answer to
questions, and measure performance - In User Modelling, we need to determine that the
model is an accurate reflection of information
needed to adapt to the user - In Recommender Systems, we need to associate user
preferences either with other similar users, or
with product features
5Background to Evaluation Methods
- In Intelligent Tutoring Systems we need to know
that learning through an ITS is beneficial or at
least not (too) harmful - In Adaptive Hypertext Systems, we need to measure
the systems ability to automatically represent
user interests, to direct the user to relevant
information, and to present the information in
the best way
6Measuring Performance
- Information Retrieval
- Recall and Precision (overall, and also at top-n)
- Question Answering
- Mean Reciprocal Rank
7Measuring Performance
- User Modelling
- Precision and Recall if user is given all and
only relevant info, or if system behaves exactly
as user needs, then model is probably correct - Accuracy and predicted probability to predict a
users actions, location, or goals - Utility the benefit derived from using system
8Measuring Performance
- Recommender Systems
- Content-based may be evaluated using precision
and recall - Collaborative is harder to evaluate, because it
depends on other users the system knows about - Quality of individual item prediction
- Precision and Recall at top-n
9Measuring Performance
- Intelligent Tutoring Systems
- Ideally, being able to show that student can
learn more efficiently using ITS than without - Usually, show that no harm is done
- Then, releasing the tutor and enabling
self-paced learning becomes a huge advantage - Difficult to evaluate
- Cannot compare same student with and without ITS
- Students who volunteer are usually very motivated
10Measuring Performance
- Adaptive Hypertext Systems
- Can mix UM, IR, RS (content-based) methods of
evaluation - Use empirical approach
- Different sets of users solve same task, one
group with adaptivity, the other without - How to choose participants?
11Evaluation Methods IR
- IR systems performance is normally measured
using precision and recall - Precision percentage of retrieved documents that
are relevant - Recall percentage of relevant documents that are
retrieved - Who decides which documents are relevant?
12Evaluation Methods IR
- Query Relevance Judgements
- For each test query, the document collection is
divided into two sets relevant and non-relevant - Systems are compared using precision and recall
- In early collections, humans would classify
documents (p3-cleverdon.pdf) - Cranfield collection 1400 documents/221 queries
- CACM 3204 documents/50 queries
13Evaluation Methods IR
- Do humans always agree on relevance judgements?
- No can vary considerably (mizzaro96relevance.pdf)
- So only use documents on which there is full
agreement
14Evaluation Methods IR
- TExt Retrieval Conference (TREC)
(http//trec.nist.gov) - Runs competitions every year
- QRels and document collection made available in a
number of tracks (e.g., ad hoc, routing, question
answering, cross-language, interactive, Web,
terabyte, ...)
15Evaluation Methods IR
- What happens when collection grows?
- E.g., Web track has 1GB of data! Terabyte track
in the pipeline - Pooling
- Give different systems same document collection
to index and queries - Take the top-n retrieved documents from each
- Documents that are present in all retrieved sets
are relevant, others not OR - Assessors judge the relevance of unique documents
in the pool
16Evaluation Methods IR
- Advantages
- Possible to compare system performance
- Relatively cheap
- QRels and document collection can be purchased
for moderate price rather than organising
expensive user trials - Can use standard IR systems (e.g., SMART) and
build another layer on top, or build new IR model - Automatic and Repeatable
17Evaluation Methods IR
- Common criticisms
- Judgements are subjective
- Same assessor may change judgement at different
times! - Doesnt effect ranking
- Judgements are binary
- Some relevant documents are missed by pooling
(QRels are incomplete) - Doesnt effect system performance
18Evaluation Methods IR
- Common criticisms (contd.)
- Queries are too long
- Queries under test conditions can have several
hundred terms - Average Web query length 2.35 terms
(p5-jansen.pdf)
19Evaluation Methods IR
- In massive document collections there may be
hundreds, thousands, or even millions of relevant
documents - Must all of them be retrieved?
- Measure precision at top-5, 10, 20, 50, 100, 500
and take weighted average over results (Mean
Average Precision)
20The E-Measure
- Combine Precision and Recall into
- one number
- (http//www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.htm
l)
P precision R recall b measure of relative
importance of P or R E.g, b 0.5 means user is
twice as interested in precision as recall
21Evaluation Methods QA
- The aim in Question Answering is not to ensure
that the overwhelming majority of relevant
documents are retrieved, but to return an
accurate answer - Precision and recall are not accurate enough
- Usual measure is Mean Reciprocal Rank
22Evaluation Methods QA
- MRR measures the average rank of the first
correct answer for each query (1/rank, or 0 if
correct answer is not in top-5) - Ideally, the first correct answer is put into
rank 1
qa_report.pdf
23Evaluation Methods UM
- Information Retrieval evaluation has matured to
the extent that it is very unusual to find an
academic publication without a standard approach
to evaluation - On the other hand, up to 2001, only one-third of
user models presented in UMUAI had been
evaluated and most of those were ITS related
(see later)
p181-chin.pdf
24Evaluation Methods UM
- Unlike IR systems, it is difficult to evaluate
UMs automatically - Unless they are stereotypes/course-grained
classification systems - So they tend to need to be evaluated empirically
- User studies
- Want to measure how well participants do with and
without a UM supporting their task
25Evaluation Methods UM
- Difficulties/problems include
- Ensuring a large enough number of participants to
make results statistically meaningful - Catering for participants improving during rounds
- Failure to use a control group
- Ensuring that nothing happens to modify
participants behaviour (e.g., thinking aloud)
26Evaluation Methods UM
- Difficulties/problems (contd.)
- Biasing the results
- Not using blind-/double-blind testing when needed
- ...
27Evaluation Methods UM
- Proposed reporting standards
- No., source, and relevant background of
participants - independent, dependent and covariant variables
- analysis method
- post-hoc probabilities
- raw data (in the paper, or on-line via WWW)
- effect size and power (at least 0.8)
p181-chin.pdf
28Evaluation Methods RS
- Recommender Systems
- Two types of recommender system
- Content-based
- Collaborative
- Both (tend to) use VSM to plot users/ product
features into n-dimensional space
29Evaluation Methods RS
- If we know the correct recommendations to make
to a user with a specific profile, then we can
use Precision, Recall, EMeasure, Fmeasure, Mean
Average Precision, MRR, etc.
30Evaluation Methods ITS
- Intelligent Tutoring Systems
- Evaluation to demonstrate that learning through
ITS is at least as effective as traditional
learning - Cost benefit of freeing up tutor, and permitting
self-paced learning - Show at a minimum that student is not harmed at
all or is minimally harmed
31Evaluation Methods ITS
- Difficult to prove that individual student
learns better/same/worse with ITS than without - Cannot make student unlearn material in between
experiments! - Attempt to use statistically significant number
of students, to show probable overall effect
32Evaluation Methods ITS
- Usually suffers from same problems as evaluating
UMs, and ubiquitous multimedia systems - Students volunteer to evaluate ITSs
- So are more likely to be motivated and so perform
better - Novelty of system is also a motivator
- Too many variables that are difficult to cater for
33Evaluation Methods ITS
- However, usually empirical evaluation is
performed - Volunteers work with system
- Pass rates, retention rates, etc., may be
compared to conventional learning environment
(quantitative analysis) - Volunteers asked for feedback about, e.g.,
usability (qualitative analysis)
34Evaluation Methods ITS
- Frequently, students are split into groups
(control and test) and performance measured
against each other - Control is usually ITS without the I - students
must find their own way through learning material - However, this is difficult to assess, because
performance of control group may be worse than
traditional learning!
35Evaluation Methods ITS
- Learner achievement metric (Muntean, 2004)
- How much has student learnt from ITS?
- Compare pre-learning knowledge to post-learning
knowledge - Can compare different systems (as long as they
use same learning material), but with different
users so same problem as before
36Evaluation Methods AHS
- Adaptive Hypertext Systems
- There are currently no standard metrics for
evaluating AHSs - Best practices are taken from fields like ITS,
IR, and UM and applied to AHS - Typical evaluation is experiences of using
system with and without adaptive features
37Evaluation Methods AHS
- If a test collection existed for AHS (like TREC)
what might it look like? - Descriptions of user models relevance
judgements for relevant links, relevant
documents, relevant presentation styles - Would we need a standard open user model
description? Are all user models capturing the
same information about the user?
38Evaluation Methods AHS
- What about following paths through hyperspace to
pre-specified points and then having the sets of
judgements? - Currently, adaptive hypertext systems appear to
be performing very different tasks, but even if
we take just one of the two things that can be
adapted (e.g., links), it appears to be beyond
our current ability to agree on how adapting
links should be evaluated, mainly due to UM!
39Evaluation Methods AHS
- HyperContext (HCT) (HCTCh8.pdf)
- HCT builds a short-term user model as a user
navigates through hyperspace - We evaluated HCTs ability to make See Also
recommendations - Ideally, we would have had hyperspace with
independent relevance judgements a particular
points in path of traversal
40Evaluation Methods AHS
- Instead, we used two mechanisms for deriving UM
(one using interpretation, the other using whole
document) - After 5 link traversals we automatically
generated a query from each user model, submitted
it to search engine and found a relevant
interpretation/document respectively
41Evaluation Methods AHS
- Users asked to read all documents in the path and
then give relevance judgement for each See Also
recommendation - Recommendations shown in random order
- Users didnt know which was HCT recommended and
which was not - Assumed that if user considered doc to be
relevant, then UM is accurate
42Evaluation Methods AHS
- Not really enough participants to make strong
claims about HCT approach to AH - Not really significant differences in RJs between
different ways of deriving UM (although both
performed reasonably well!) - However, significant findings if reading time is
indication of skim-/deep-reading!
43Evaluation Methods AHS
- Should users have been shown both documents?
- Could reading two documents, instead of just one,
have effected judgement of doc read second? - Were users disaffected because it wasnt a task
that they needed to perform?
44Evaluation Methods AHS
- Ideally, systems are tested in real world
conditions in which evaluators are performing
tasks - Normally, experimental set-ups require users to
perform artificial tasks, and it is difficult to
measure performance because relevance is
subjective!
45Evaluation Methods AHS
- This is one of the criticisms of the TREC
collections, but it does allow systems to be
compared - even if the story is completely
different once the system is in real use - Building a robust enough system for use in the
real world is expensive - But then, so is conducting lab based experiments
46Modular Evaluation of AUIs
- Adaptive User Interfaces, or User-Adaptive
Systems - Difficult to evaluate monolithic systems
- So break up UASs into modules that can be
evaluated separately
47Modular Evaluation of AUIs
- Paramythis, et. al. recommend
- identifying the evaluation objects - that can
be evaluated separately and in combination - presenting the evaluation purpose - the
rationale for the modules and criteria for their
evaluation - identifying the evaluation process - methods
and techniques for evaluating modules during the
AUI life cycle
paramythis.pdf
48Modular Evaluation of AUIs
49Modular Evaluation of AUIs