1 of 49 - PowerPoint PPT Presentation

About This Presentation
Title:

1 of 49

Description:

If a test collection existed for AHS (like TREC) what might it look like? ... is one of the criticisms of the TREC collections, but it does allow systems to ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 50
Provided by: chr193
Category:
Tags: trec

less

Transcript and Presenter's Notes

Title: 1 of 49


1
CSA4080Adaptive Hypertext Systems II
Topic 8 Evaluation Methods
  • Dr. Christopher Staff
  • Department of Computer Science AI
  • University of Malta

2
Aims and Objectives
  • Background to evaluation methods in user-adaptive
    systems
  • Brief overviews of the evaluation of IR, QA, User
    Modelling, Recommender Systems, Intelligent
    Tutoring Systems, Adaptive Hypertext Systems

3
Background to Evaluation Methods
  • Systems need to be evaluated to demonstrate
    (prove) that the hypothesis on which they are
    based is correct
  • In IR, we need to know that the system is
    retrieving all and only relevant documents for
    the given query

4
Background to Evaluation Methods
  • In QA, we need to know the correct answer to
    questions, and measure performance
  • In User Modelling, we need to determine that the
    model is an accurate reflection of information
    needed to adapt to the user
  • In Recommender Systems, we need to associate user
    preferences either with other similar users, or
    with product features

5
Background to Evaluation Methods
  • In Intelligent Tutoring Systems we need to know
    that learning through an ITS is beneficial or at
    least not (too) harmful
  • In Adaptive Hypertext Systems, we need to measure
    the systems ability to automatically represent
    user interests, to direct the user to relevant
    information, and to present the information in
    the best way

6
Measuring Performance
  • Information Retrieval
  • Recall and Precision (overall, and also at top-n)
  • Question Answering
  • Mean Reciprocal Rank

7
Measuring Performance
  • User Modelling
  • Precision and Recall if user is given all and
    only relevant info, or if system behaves exactly
    as user needs, then model is probably correct
  • Accuracy and predicted probability to predict a
    users actions, location, or goals
  • Utility the benefit derived from using system

8
Measuring Performance
  • Recommender Systems
  • Content-based may be evaluated using precision
    and recall
  • Collaborative is harder to evaluate, because it
    depends on other users the system knows about
  • Quality of individual item prediction
  • Precision and Recall at top-n

9
Measuring Performance
  • Intelligent Tutoring Systems
  • Ideally, being able to show that student can
    learn more efficiently using ITS than without
  • Usually, show that no harm is done
  • Then, releasing the tutor and enabling
    self-paced learning becomes a huge advantage
  • Difficult to evaluate
  • Cannot compare same student with and without ITS
  • Students who volunteer are usually very motivated

10
Measuring Performance
  • Adaptive Hypertext Systems
  • Can mix UM, IR, RS (content-based) methods of
    evaluation
  • Use empirical approach
  • Different sets of users solve same task, one
    group with adaptivity, the other without
  • How to choose participants?

11
Evaluation Methods IR
  • IR systems performance is normally measured
    using precision and recall
  • Precision percentage of retrieved documents that
    are relevant
  • Recall percentage of relevant documents that are
    retrieved
  • Who decides which documents are relevant?

12
Evaluation Methods IR
  • Query Relevance Judgements
  • For each test query, the document collection is
    divided into two sets relevant and non-relevant
  • Systems are compared using precision and recall
  • In early collections, humans would classify
    documents (p3-cleverdon.pdf)
  • Cranfield collection 1400 documents/221 queries
  • CACM 3204 documents/50 queries

13
Evaluation Methods IR
  • Do humans always agree on relevance judgements?
  • No can vary considerably (mizzaro96relevance.pdf)
  • So only use documents on which there is full
    agreement

14
Evaluation Methods IR
  • TExt Retrieval Conference (TREC)
    (http//trec.nist.gov)
  • Runs competitions every year
  • QRels and document collection made available in a
    number of tracks (e.g., ad hoc, routing, question
    answering, cross-language, interactive, Web,
    terabyte, ...)

15
Evaluation Methods IR
  • What happens when collection grows?
  • E.g., Web track has 1GB of data! Terabyte track
    in the pipeline
  • Pooling
  • Give different systems same document collection
    to index and queries
  • Take the top-n retrieved documents from each
  • Documents that are present in all retrieved sets
    are relevant, others not OR
  • Assessors judge the relevance of unique documents
    in the pool

16
Evaluation Methods IR
  • Advantages
  • Possible to compare system performance
  • Relatively cheap
  • QRels and document collection can be purchased
    for moderate price rather than organising
    expensive user trials
  • Can use standard IR systems (e.g., SMART) and
    build another layer on top, or build new IR model
  • Automatic and Repeatable

17
Evaluation Methods IR
  • Common criticisms
  • Judgements are subjective
  • Same assessor may change judgement at different
    times!
  • Doesnt effect ranking
  • Judgements are binary
  • Some relevant documents are missed by pooling
    (QRels are incomplete)
  • Doesnt effect system performance

18
Evaluation Methods IR
  • Common criticisms (contd.)
  • Queries are too long
  • Queries under test conditions can have several
    hundred terms
  • Average Web query length 2.35 terms
    (p5-jansen.pdf)

19
Evaluation Methods IR
  • In massive document collections there may be
    hundreds, thousands, or even millions of relevant
    documents
  • Must all of them be retrieved?
  • Measure precision at top-5, 10, 20, 50, 100, 500
    and take weighted average over results (Mean
    Average Precision)

20
The E-Measure
  • Combine Precision and Recall into
  • one number
  • (http//www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.htm
    l)

P precision R recall b measure of relative
importance of P or R E.g, b 0.5 means user is
twice as interested in precision as recall
21
Evaluation Methods QA
  • The aim in Question Answering is not to ensure
    that the overwhelming majority of relevant
    documents are retrieved, but to return an
    accurate answer
  • Precision and recall are not accurate enough
  • Usual measure is Mean Reciprocal Rank

22
Evaluation Methods QA
  • MRR measures the average rank of the first
    correct answer for each query (1/rank, or 0 if
    correct answer is not in top-5)
  • Ideally, the first correct answer is put into
    rank 1

qa_report.pdf
23
Evaluation Methods UM
  • Information Retrieval evaluation has matured to
    the extent that it is very unusual to find an
    academic publication without a standard approach
    to evaluation
  • On the other hand, up to 2001, only one-third of
    user models presented in UMUAI had been
    evaluated and most of those were ITS related
    (see later)

p181-chin.pdf
24
Evaluation Methods UM
  • Unlike IR systems, it is difficult to evaluate
    UMs automatically
  • Unless they are stereotypes/course-grained
    classification systems
  • So they tend to need to be evaluated empirically
  • User studies
  • Want to measure how well participants do with and
    without a UM supporting their task

25
Evaluation Methods UM
  • Difficulties/problems include
  • Ensuring a large enough number of participants to
    make results statistically meaningful
  • Catering for participants improving during rounds
  • Failure to use a control group
  • Ensuring that nothing happens to modify
    participants behaviour (e.g., thinking aloud)

26
Evaluation Methods UM
  • Difficulties/problems (contd.)
  • Biasing the results
  • Not using blind-/double-blind testing when needed
  • ...

27
Evaluation Methods UM
  • Proposed reporting standards
  • No., source, and relevant background of
    participants
  • independent, dependent and covariant variables
  • analysis method
  • post-hoc probabilities
  • raw data (in the paper, or on-line via WWW)
  • effect size and power (at least 0.8)

p181-chin.pdf
28
Evaluation Methods RS
  • Recommender Systems
  • Two types of recommender system
  • Content-based
  • Collaborative
  • Both (tend to) use VSM to plot users/ product
    features into n-dimensional space

29
Evaluation Methods RS
  • If we know the correct recommendations to make
    to a user with a specific profile, then we can
    use Precision, Recall, EMeasure, Fmeasure, Mean
    Average Precision, MRR, etc.

30
Evaluation Methods ITS
  • Intelligent Tutoring Systems
  • Evaluation to demonstrate that learning through
    ITS is at least as effective as traditional
    learning
  • Cost benefit of freeing up tutor, and permitting
    self-paced learning
  • Show at a minimum that student is not harmed at
    all or is minimally harmed

31
Evaluation Methods ITS
  • Difficult to prove that individual student
    learns better/same/worse with ITS than without
  • Cannot make student unlearn material in between
    experiments!
  • Attempt to use statistically significant number
    of students, to show probable overall effect

32
Evaluation Methods ITS
  • Usually suffers from same problems as evaluating
    UMs, and ubiquitous multimedia systems
  • Students volunteer to evaluate ITSs
  • So are more likely to be motivated and so perform
    better
  • Novelty of system is also a motivator
  • Too many variables that are difficult to cater for

33
Evaluation Methods ITS
  • However, usually empirical evaluation is
    performed
  • Volunteers work with system
  • Pass rates, retention rates, etc., may be
    compared to conventional learning environment
    (quantitative analysis)
  • Volunteers asked for feedback about, e.g.,
    usability (qualitative analysis)

34
Evaluation Methods ITS
  • Frequently, students are split into groups
    (control and test) and performance measured
    against each other
  • Control is usually ITS without the I - students
    must find their own way through learning material
  • However, this is difficult to assess, because
    performance of control group may be worse than
    traditional learning!

35
Evaluation Methods ITS
  • Learner achievement metric (Muntean, 2004)
  • How much has student learnt from ITS?
  • Compare pre-learning knowledge to post-learning
    knowledge
  • Can compare different systems (as long as they
    use same learning material), but with different
    users so same problem as before

36
Evaluation Methods AHS
  • Adaptive Hypertext Systems
  • There are currently no standard metrics for
    evaluating AHSs
  • Best practices are taken from fields like ITS,
    IR, and UM and applied to AHS
  • Typical evaluation is experiences of using
    system with and without adaptive features

37
Evaluation Methods AHS
  • If a test collection existed for AHS (like TREC)
    what might it look like?
  • Descriptions of user models relevance
    judgements for relevant links, relevant
    documents, relevant presentation styles
  • Would we need a standard open user model
    description? Are all user models capturing the
    same information about the user?

38
Evaluation Methods AHS
  • What about following paths through hyperspace to
    pre-specified points and then having the sets of
    judgements?
  • Currently, adaptive hypertext systems appear to
    be performing very different tasks, but even if
    we take just one of the two things that can be
    adapted (e.g., links), it appears to be beyond
    our current ability to agree on how adapting
    links should be evaluated, mainly due to UM!

39
Evaluation Methods AHS
  • HyperContext (HCT) (HCTCh8.pdf)
  • HCT builds a short-term user model as a user
    navigates through hyperspace
  • We evaluated HCTs ability to make See Also
    recommendations
  • Ideally, we would have had hyperspace with
    independent relevance judgements a particular
    points in path of traversal

40
Evaluation Methods AHS
  • Instead, we used two mechanisms for deriving UM
    (one using interpretation, the other using whole
    document)
  • After 5 link traversals we automatically
    generated a query from each user model, submitted
    it to search engine and found a relevant
    interpretation/document respectively

41
Evaluation Methods AHS
  • Users asked to read all documents in the path and
    then give relevance judgement for each See Also
    recommendation
  • Recommendations shown in random order
  • Users didnt know which was HCT recommended and
    which was not
  • Assumed that if user considered doc to be
    relevant, then UM is accurate

42
Evaluation Methods AHS
  • Not really enough participants to make strong
    claims about HCT approach to AH
  • Not really significant differences in RJs between
    different ways of deriving UM (although both
    performed reasonably well!)
  • However, significant findings if reading time is
    indication of skim-/deep-reading!

43
Evaluation Methods AHS
  • Should users have been shown both documents?
  • Could reading two documents, instead of just one,
    have effected judgement of doc read second?
  • Were users disaffected because it wasnt a task
    that they needed to perform?

44
Evaluation Methods AHS
  • Ideally, systems are tested in real world
    conditions in which evaluators are performing
    tasks
  • Normally, experimental set-ups require users to
    perform artificial tasks, and it is difficult to
    measure performance because relevance is
    subjective!

45
Evaluation Methods AHS
  • This is one of the criticisms of the TREC
    collections, but it does allow systems to be
    compared - even if the story is completely
    different once the system is in real use
  • Building a robust enough system for use in the
    real world is expensive
  • But then, so is conducting lab based experiments

46
Modular Evaluation of AUIs
  • Adaptive User Interfaces, or User-Adaptive
    Systems
  • Difficult to evaluate monolithic systems
  • So break up UASs into modules that can be
    evaluated separately

47
Modular Evaluation of AUIs
  • Paramythis, et. al. recommend
  • identifying the evaluation objects - that can
    be evaluated separately and in combination
  • presenting the evaluation purpose - the
    rationale for the modules and criteria for their
    evaluation
  • identifying the evaluation process - methods
    and techniques for evaluating modules during the
    AUI life cycle

paramythis.pdf
48
Modular Evaluation of AUIs
49
Modular Evaluation of AUIs
Write a Comment
User Comments (0)
About PowerShow.com