1 of 49

About This Presentation

Title:

1 of 49

Description:

If a test collection existed for AHS (like TREC) what might it look like? ... is one of the criticisms of the TREC collections, but it does allow systems to ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 50

Provided by: chr193

Category:

Tags: trec

more less

Transcript and Presenter's Notes

Title: 1 of 49

1
CSA4080Adaptive Hypertext Systems II
Topic 8 Evaluation Methods

Dr. Christopher Staff
Department of Computer Science AI
University of Malta

2
Aims and Objectives

Background to evaluation methods in user-adaptive
systems
Brief overviews of the evaluation of IR, QA, User
Modelling, Recommender Systems, Intelligent
Tutoring Systems, Adaptive Hypertext Systems

3
Background to Evaluation Methods

Systems need to be evaluated to demonstrate
(prove) that the hypothesis on which they are
based is correct
In IR, we need to know that the system is
retrieving all and only relevant documents for
the given query

4
Background to Evaluation Methods

In QA, we need to know the correct answer to
questions, and measure performance
In User Modelling, we need to determine that the
model is an accurate reflection of information
needed to adapt to the user
In Recommender Systems, we need to associate user
preferences either with other similar users, or
with product features

5
Background to Evaluation Methods

In Intelligent Tutoring Systems we need to know
that learning through an ITS is beneficial or at
least not (too) harmful
In Adaptive Hypertext Systems, we need to measure
the systems ability to automatically represent
user interests, to direct the user to relevant
information, and to present the information in
the best way

6
Measuring Performance

Information Retrieval
Recall and Precision (overall, and also at top-n)
Question Answering
Mean Reciprocal Rank

7
Measuring Performance

User Modelling
Precision and Recall if user is given all and
only relevant info, or if system behaves exactly
as user needs, then model is probably correct
Accuracy and predicted probability to predict a
users actions, location, or goals
Utility the benefit derived from using system

8
Measuring Performance

Recommender Systems
Content-based may be evaluated using precision
and recall
Collaborative is harder to evaluate, because it
depends on other users the system knows about
Quality of individual item prediction
Precision and Recall at top-n

9
Measuring Performance

Intelligent Tutoring Systems
Ideally, being able to show that student can
learn more efficiently using ITS than without
Usually, show that no harm is done
Then, releasing the tutor and enabling
self-paced learning becomes a huge advantage
Difficult to evaluate
Cannot compare same student with and without ITS
Students who volunteer are usually very motivated

10
Measuring Performance

Adaptive Hypertext Systems
Can mix UM, IR, RS (content-based) methods of
evaluation
Use empirical approach
Different sets of users solve same task, one
group with adaptivity, the other without
How to choose participants?

11
Evaluation Methods IR

IR systems performance is normally measured
using precision and recall
Precision percentage of retrieved documents that
are relevant
Recall percentage of relevant documents that are
retrieved
Who decides which documents are relevant?

12
Evaluation Methods IR

Query Relevance Judgements
For each test query, the document collection is
divided into two sets relevant and non-relevant
Systems are compared using precision and recall
In early collections, humans would classify
documents (p3-cleverdon.pdf)
Cranfield collection 1400 documents/221 queries
CACM 3204 documents/50 queries

13
Evaluation Methods IR

Do humans always agree on relevance judgements?
No can vary considerably (mizzaro96relevance.pdf)
So only use documents on which there is full
agreement

14
Evaluation Methods IR

TExt Retrieval Conference (TREC)
(http//trec.nist.gov)
Runs competitions every year
QRels and document collection made available in a
number of tracks (e.g., ad hoc, routing, question
answering, cross-language, interactive, Web,
terabyte, ...)

15
Evaluation Methods IR

What happens when collection grows?
E.g., Web track has 1GB of data! Terabyte track
in the pipeline
Pooling
Give different systems same document collection
to index and queries
Take the top-n retrieved documents from each
Documents that are present in all retrieved sets
are relevant, others not OR
Assessors judge the relevance of unique documents
in the pool

16
Evaluation Methods IR

Advantages
Possible to compare system performance
Relatively cheap
QRels and document collection can be purchased
for moderate price rather than organising
expensive user trials
Can use standard IR systems (e.g., SMART) and
build another layer on top, or build new IR model
Automatic and Repeatable

17
Evaluation Methods IR

Common criticisms
Judgements are subjective
Same assessor may change judgement at different
times!
Doesnt effect ranking
Judgements are binary
Some relevant documents are missed by pooling
(QRels are incomplete)
Doesnt effect system performance

18
Evaluation Methods IR

Common criticisms (contd.)
Queries are too long
Queries under test conditions can have several
hundred terms
Average Web query length 2.35 terms
(p5-jansen.pdf)

19
Evaluation Methods IR

In massive document collections there may be
hundreds, thousands, or even millions of relevant
documents
Must all of them be retrieved?
Measure precision at top-5, 10, 20, 50, 100, 500
and take weighted average over results (Mean
Average Precision)

20
The E-Measure

Combine Precision and Recall into
one number
(http//www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.htm
l)

P precision R recall b measure of relative
importance of P or R E.g, b 0.5 means user is
twice as interested in precision as recall
21
Evaluation Methods QA

The aim in Question Answering is not to ensure
that the overwhelming majority of relevant
documents are retrieved, but to return an
accurate answer
Precision and recall are not accurate enough
Usual measure is Mean Reciprocal Rank

22
Evaluation Methods QA

MRR measures the average rank of the first
correct answer for each query (1/rank, or 0 if
correct answer is not in top-5)
Ideally, the first correct answer is put into
rank 1

qa_report.pdf
23
Evaluation Methods UM

Information Retrieval evaluation has matured to
the extent that it is very unusual to find an
academic publication without a standard approach
to evaluation
On the other hand, up to 2001, only one-third of
user models presented in UMUAI had been
evaluated and most of those were ITS related
(see later)

p181-chin.pdf
24
Evaluation Methods UM

Unlike IR systems, it is difficult to evaluate
UMs automatically
Unless they are stereotypes/course-grained
classification systems
So they tend to need to be evaluated empirically
User studies
Want to measure how well participants do with and
without a UM supporting their task

25
Evaluation Methods UM

Difficulties/problems include
Ensuring a large enough number of participants to
make results statistically meaningful
Catering for participants improving during rounds
Failure to use a control group
Ensuring that nothing happens to modify
participants behaviour (e.g., thinking aloud)

26
Evaluation Methods UM

Difficulties/problems (contd.)
Biasing the results
Not using blind-/double-blind testing when needed
...

27
Evaluation Methods UM

Proposed reporting standards
No., source, and relevant background of
participants
independent, dependent and covariant variables
analysis method
post-hoc probabilities
raw data (in the paper, or on-line via WWW)
effect size and power (at least 0.8)

p181-chin.pdf
28
Evaluation Methods RS

Recommender Systems
Two types of recommender system
Content-based
Collaborative
Both (tend to) use VSM to plot users/ product
features into n-dimensional space

29
Evaluation Methods RS

If we know the correct recommendations to make
to a user with a specific profile, then we can
use Precision, Recall, EMeasure, Fmeasure, Mean
Average Precision, MRR, etc.

30
Evaluation Methods ITS

Intelligent Tutoring Systems
Evaluation to demonstrate that learning through
ITS is at least as effective as traditional
learning
Cost benefit of freeing up tutor, and permitting
self-paced learning
Show at a minimum that student is not harmed at
all or is minimally harmed

31
Evaluation Methods ITS

Difficult to prove that individual student
learns better/same/worse with ITS than without
Cannot make student unlearn material in between
experiments!
Attempt to use statistically significant number
of students, to show probable overall effect

32
Evaluation Methods ITS

Usually suffers from same problems as evaluating
UMs, and ubiquitous multimedia systems
Students volunteer to evaluate ITSs
So are more likely to be motivated and so perform
better
Novelty of system is also a motivator
Too many variables that are difficult to cater for

33
Evaluation Methods ITS

However, usually empirical evaluation is
performed
Volunteers work with system
Pass rates, retention rates, etc., may be
compared to conventional learning environment
(quantitative analysis)
Volunteers asked for feedback about, e.g.,
usability (qualitative analysis)

34
Evaluation Methods ITS

Frequently, students are split into groups
(control and test) and performance measured
against each other
Control is usually ITS without the I - students
must find their own way through learning material
However, this is difficult to assess, because
performance of control group may be worse than
traditional learning!

35
Evaluation Methods ITS

Learner achievement metric (Muntean, 2004)
How much has student learnt from ITS?
Compare pre-learning knowledge to post-learning
knowledge
Can compare different systems (as long as they
use same learning material), but with different
users so same problem as before

36
Evaluation Methods AHS

Adaptive Hypertext Systems
There are currently no standard metrics for
evaluating AHSs
Best practices are taken from fields like ITS,
IR, and UM and applied to AHS
Typical evaluation is experiences of using
system with and without adaptive features

37
Evaluation Methods AHS

If a test collection existed for AHS (like TREC)
what might it look like?
Descriptions of user models relevance
judgements for relevant links, relevant
documents, relevant presentation styles
Would we need a standard open user model
description? Are all user models capturing the
same information about the user?

38
Evaluation Methods AHS

What about following paths through hyperspace to
pre-specified points and then having the sets of
judgements?
Currently, adaptive hypertext systems appear to
be performing very different tasks, but even if
we take just one of the two things that can be
adapted (e.g., links), it appears to be beyond
our current ability to agree on how adapting
links should be evaluated, mainly due to UM!

39
Evaluation Methods AHS

HyperContext (HCT) (HCTCh8.pdf)
HCT builds a short-term user model as a user
navigates through hyperspace
We evaluated HCTs ability to make See Also
recommendations
Ideally, we would have had hyperspace with
independent relevance judgements a particular
points in path of traversal

40
Evaluation Methods AHS

Instead, we used two mechanisms for deriving UM
(one using interpretation, the other using whole
document)
After 5 link traversals we automatically
generated a query from each user model, submitted
it to search engine and found a relevant
interpretation/document respectively

41
Evaluation Methods AHS

Users asked to read all documents in the path and
then give relevance judgement for each See Also
recommendation
Recommendations shown in random order
Users didnt know which was HCT recommended and
which was not
Assumed that if user considered doc to be
relevant, then UM is accurate

42
Evaluation Methods AHS

Not really enough participants to make strong
claims about HCT approach to AH
Not really significant differences in RJs between
different ways of deriving UM (although both
performed reasonably well!)
However, significant findings if reading time is
indication of skim-/deep-reading!

43
Evaluation Methods AHS

Should users have been shown both documents?
Could reading two documents, instead of just one,
have effected judgement of doc read second?
Were users disaffected because it wasnt a task
that they needed to perform?

44
Evaluation Methods AHS

Ideally, systems are tested in real world
conditions in which evaluators are performing
tasks
Normally, experimental set-ups require users to
perform artificial tasks, and it is difficult to
measure performance because relevance is
subjective!

45
Evaluation Methods AHS

This is one of the criticisms of the TREC
collections, but it does allow systems to be
compared - even if the story is completely
different once the system is in real use
Building a robust enough system for use in the
real world is expensive
But then, so is conducting lab based experiments

46
Modular Evaluation of AUIs

Adaptive User Interfaces, or User-Adaptive
Systems
Difficult to evaluate monolithic systems
So break up UASs into modules that can be
evaluated separately

47
Modular Evaluation of AUIs

Paramythis, et. al. recommend
identifying the evaluation objects - that can
be evaluated separately and in combination
presenting the evaluation purpose - the
rationale for the modules and criteria for their
evaluation
identifying the evaluation process - methods
and techniques for evaluating modules during the
AUI life cycle

paramythis.pdf
48
Modular Evaluation of AUIs
49
Modular Evaluation of AUIs

Write a Comment

User Comments (0)