CS276A Text Information Retrieval, Mining, and Exploitation - PowerPoint PPT Presentation

About This Presentation
Title:

CS276A Text Information Retrieval, Mining, and Exploitation

Description:

.com / .org / .net / international urls. cnn.com vs. www.cnn.com ... TREC-6 interactive TREC report. 9 participating groups (US, Europe, Australia) ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 78
Provided by: christo394
Learn more at: https://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: CS276A Text Information Retrieval, Mining, and Exploitation


1
CS276AText Information Retrieval, Mining, and
Exploitation
  • Lecture 10
  • 7 Nov 2002

2
Information Access in Context
Analyze
Synthesize
High-Level Goal
Done?
User
no
yes
Stop
3
Exercise
  • Observe your own information seeking behavior
  • WWW
  • University library
  • Grocery store
  • Are you a searcher or a browser?
  • How do you reformulate your query?
  • Read bad hits, then minus terms
  • Read good hits, then plus terms
  • Try a completely different query

4
CorrectionAddress Field vs. Search Box
  • Are users typing urls into the search box
    ignorant?
  • .com / .org / .net / international urls
  • cnn.com vs. www.cnn.com
  • Full url with protocol qualifier vs. partial url

5
Todays Topics
  • Information design and visualization
  • Evaluation measures and test collections
  • Evaluation of interactive information retrieval
  • Evaluation gotchas

6
Information Visualization and Exploration
  • Tufte
  • Shneiderman
  • Information foraging Xerox PARC / PARC Inc.

7
Edward Tufte
  • Information design bible The visual display of
    quantitative information
  • The art and science of how to display
    (quantitative) information visually
  • Significant influence on User Interface design

8
The Challenger Accident
  • On January 28, 1986, the space shuttle Challenger
    explodes shortly after takeoff.
  • Seven crew members die.
  • One of the causes an O ring failed due to cold
    temperatures.
  • How could this happen?

9
How O-Rings were presented
  • Time scale is shown instead of temperature
    scale!
  • Needless junk (rockets dont show information)
  • Graphic does not help answer question why do
    o-rings fail?

10
Tufte Principles for Information Design
  • Omit needless junk 
  • Show what you mean 
  • Don't obscure the meaning and order of scales 
  • Make comparisons of related images possible 
  • Claim authorship, and think twice when others
    don't 
  • Seek truth 

11
Tuftes O-Ring Visualization
12
Tufte Summary
  • Like poor writing, bad graphical displays
    distort or obscure the data, make it harder to
    understand or compare, or otherwise thwart the
    communicative effect which the graph should
    convey.
  • Bad decisions are made based on bad information
    design.
  • Tuftes influence on UI design
  • Examples of the best and worst in information
    visualization http//www.math.yorku.ca/SCS/Galler
    y/noframes.html

13
Shneiderman Information Visualization
  • How to design user interfaces
  • How to engineer user interfaces for software
  • Task by type taxonomy

14
Shneiderman on HCI
  • Well-designed interactive computer systems
    promote
  • Positive feelings of success, competence, and
    mastery.
  • Allow users to concentrate on their work, rather
    than on the system.

Marti Hearst
15
Task by Type Taxonomy Data Types
  • 1-D linear seesoft
  • 2-D map multidimensional scaling (terms, docs,
    etc)
  • 3-D world cat-a-cone
  • Multi-dim table lens
  • Temporal topic detection
  • Tree hierarchies a la Yahoo
  • Network network graphs of sites (kartoo)

16
Task by Type Taxonomy Tasks
  • Overview gain an overview of the entire
    collection
  • Zoom zoom in on items of interest
  • Filter filter out uninteresting items
  • Details-on-demand select an item or group and
    get details when needed
  • Relate view relationships among items
  • History keep a history of actions to support,
    undo, replay
  • Extract allow extraction of subcollections and
    the query parameters

17
Exercise
  • If your project has a UI component
  • Which data types are being displayed?
  • Which tasks are you supporting?

18
Xerox PARC Information Foraging
  • Metaphor from ecology/biology
  • People looking for information animals foraging
    for food
  • Predictive model that allows principled way of
    designing user interfaces
  • The main focus is
  • What will the user do next?
  • How can we support a good choice for the next
    action?
  • Rather than
  • Evaluation of a single user-system interaction

19
Foraging Paradigm
Energy
Food Foraging Biological, behavioral, and
cultural designs are adaptive to the extent
they optimize the rate of energy intake.
George Robertson, Microsoft
20
Information Foraging Paradigm
Information
Information Foraging Information access and
visualization technologies are adaptive to the
extent they optimize the rate of gain of valuable
information
George Robertson, Microsoft
21
Searching Patches
George Robertson, Microsoft
22
Information Foraging Theory
  • G information/food gained
  • g average gain per doc/patch
  • TB total time between docs/patches
  • tb average time between docs/patches
  • TW total time within docs/patches
  • tw average time to process doc/patch
  • lambda 1/tb prevalence of information/food

23
Information Foraging Theory
  • R G / (TB TW) rate of gain
  • R lambda TB g / ( TB lambda TB tw)
  • R lambda g / ( 1 lambda tw)
  • Goodness measure of UI R rate of gain
  • Optimize UI by increasing R
  • Increase prevalence lambda (asymptotic
    improvement)
  • Decrease tw (time it takes to absorb doc/food)
  • Better model different types of docs/patches
  • Model can be used to find optimal UI parameters

24
Cost-of-Knowledge Characteristic Function
  • Improve productivity Less time or more output

Card, Pirolli, and Mackinlay
25
Creating Test Collectionsfor IR Evaluation
26
Test Corpora
27
Kappa Measure
  • Kappa measures
  • Agreement among coders
  • Designed for categorical judgments
  • Corrects for chance agreement
  • Kappa P(A) P(E) / 1 P(E)
  • P(A) proportion of time coders agree
  • P(E) what agreement would be by chance
  • Kappa 0 for chance agreement, 1 for total
    agreement.

28
Kappa Measure Example
P(A)? P(E)?
29
Kappa Example
  • P(A) 370/400 0.925
  • P(nonrelevant) (10207070)/800 0.2125
  • P(relevant) (1020300300)/800 0.7878
  • P(E) 0.21252 0.78782 0.665
  • Kappa (0.925 0.665)/(1-0.665) 0.776
  • For gt2 judges average pairwise kappas

30
Kappa Measure
  • Kappa gt 0.8 good agreement
  • 0.67 lt Kappa lt 0.8 -gt tentative conclusions
    (Carletta 96)
  • Depends on purpose of study

31
Interjudge Disagreement TREC 3
32
(No Transcript)
33
Impact of Interjudge Disagreement
  • Impact on absolute performance measure can be
    significant (0.32 vs 0.39)
  • Little impact on ranking of different systems or
    relative performance

34
Evaluation Measures
35
Recap Precision/Recall
  • Evaluation of ranked results
  • You can return any number of results ordered by
    similarity
  • By taking various numbers of documents (levels of
    recall), you can produce a precision-recall curve
  • Precision correctretrieved/retrieved
  • Recall correctretrieved/correct
  • The truth, the whole truth, and nothing but the
    truth. Recall 1.0 the whole truth, precision
    1.0 nothing but the truth.

36
Recap Precision-recall curves
37
F Measure
  • F measure is the harmonic mean of precision and
    recall (strictly speaking F1)
  • 1/F ½ (1/P 1/R)
  • Use F measure if you need to optimize a single
    measure that balances precision and recall.

38
F-Measure
F1(0.956) max 0.96
39
Breakeven Point
  • Breakeven point is the point where precision
    equals recall.
  • Alternative single measure of IR effectiveness.
  • How do you compute it?

40
Area under the ROC Curve
  • True positive rate recall sensitivity
  • False positive rate fp/(tnfp). Related to
    precision. fpr0 lt-gt p1
  • Why is the blue line worthless?

41
Precision Recall Graph vs ROC
42
Unit of Evaluation
  • We can compute precision, recall, F, and ROC
    curve for different units.
  • Possible units
  • Documents (most common)
  • Facts (used in some TREC evaluations)
  • Entities (e.g., car companies)
  • May produce different results. Why?

43
Critique of Pure ReasonRelevance
  • Relevance vs Marginal Relevance
  • A document can be redundant even if it is highly
    relevant
  • Duplicates
  • The same information from different sources
  • Marginal relevance is a better measure of utility
    for the user.
  • Using facts/entities as evaluation units more
    directly measures true relevance.
  • But harder to create evaluation set
  • See Carbonell reference

44
Evaluation ofInteractive Information Retrieval
45
Evaluating Interactive IR
  • Evaluating interactive IR poses special
    challenges
  • Obtaining experimental data is more expensive
  • Experiments involving humans require careful
    design.
  • Control for confounding variables
  • Questionnaire to collect relevant subject data
  • Ensure that experimental setup is close to
    intended real world scenario
  • Approval for human subjects research

46
IIR Evaluation Case Study 1
  • TREC-6 interactive TREC report
  • 9 participating groups (US, Europe, Australia)
  • Control system (simple IR system)
  • Each group ran their system and the control
    system
  • 4 users at each site
  • 6 queries ( topics)
  • Goal of evaluation Find best performing system
  • Why do you need control system for comparing
    groups?

47
Queries ( Topics)
48
Latin Square Design
49
Analysis of Variance
50
Analysis of Variance
51
Analysis of Variance
52
Observations
  • Query effect is largest std for each site
  • High degree of query variability
  • Searcher effect negligible for 4 our of 10 sites
  • Best Model
  • Interactions are small compared too overall
    error.
  • None of the 10 sites statistically better than
    control system!

53
IIR Evaluation Case Study 2
  • Evaluation of relevance feedback
  • Koenemann Belkin 1996

54
Why Evaluate Relevance Feedback?
55
Questions being InvestigatedKoenemann Belkin 96
  • How well do users work with statistical ranking
    on full text?
  • Does relevance feedback improve results?
  • Is user control over operation of relevance
    feedback helpful?
  • How do different levels of user control effect
    results?

Credit Marti Hearst
56
How much of the guts should the user see?
  • Opaque (black box)
  • (like web search engines)
  • Transparent
  • (see available terms after the r.f. )
  • Penetrable
  • (see suggested terms before the r.f.)
  • Which do you think worked best?

Credit Marti Hearst
57
Credit Marti Hearst
58
Terms available for relevance feedback made
visible(from Koenemann Belkin)
Credit Marti Hearst
59
Details on User StudyKoenemann Belkin 96
  • Subjects have a tutorial session to learn the
    system
  • Their goal is to keep modifying the query until
    theyve developed one that gets high precision
  • This is an example of a routing query (as opposed
    to ad hoc)
  • Reweighting
  • They did not reweight query terms
  • Instead, only term expansion
  • pool all terms in rel docs
  • take top N terms, where
  • n 3 (number-marked-relevant-docs2)
  • (the more marked docs, the more terms added to
    the query)

Credit Marti Hearst
60
Details on User StudyKoenemann Belkin 96
  • 64 novice searchers
  • 43 female, 21 male, native English
  • TREC test bed
  • Wall Street Journal subset
  • Two search topics
  • Automobile Recalls
  • Tobacco Advertising and the Young
  • Relevance judgements from TREC and experimenter
  • System was INQUERY (vector space with some bells
    and whistles)

Credit Marti Hearst
61
Sample TREC query
Credit Marti Hearst
62
Evaluation
  • Precision at 30 documents
  • Baseline (Trial 1)
  • How well does initial search go?
  • One topic has more relevant docs than the other
  • Experimental condition (Trial 2)
  • Subjects get tutorial on relevance feedback
  • Modify query in one of four modes
  • no r.f., opaque, transparent, penetration

Credit Marti Hearst
63
Precision vs. RF condition (from Koenemann
Belkin 96)
Can we conclude from this chart that RF is better?
Credit Marti Hearst
64
Effectiveness Results
  • Subjects with R.F. did 17-34 better performance
    than no R.F.
  • Subjects with penetration case did 15 better as
    a group than those in opaque and transparent
    cases.

Credit Marti Hearst
65
Number of iterations in formulating queries (from
Koenemann Belkin 96)
Credit Marti Hearst
66
Number of terms in created queries (from
Koenemann Belkin 96)
Credit Marti Hearst
67
Behavior Results
  • Search times approximately equal
  • Precision increased in first few iterations
  • Penetration case required fewer iterations to
    make a good query than transparent and opaque
  • R.F. queries much longer
  • but fewer terms in penetrable case -- users were
    more selective about which terms were added in.

Credit Marti Hearst
68
Evaluation Gotchas
  • No statistical test (!)
  • Lots of pairwise tests
  • Wrong evaluation measure
  • Query variability
  • Unintentionally biased evaluation

69
Gotchas Evaluation Measures
  • KDD cup 2002
  • Optimize model parameter balance factor
  • Area under ROC curve and BEP have different
    behaviors
  • These two measures intuitively measure the same
    property.

70
Gotchas Query variability
  • Eichmann et al. claim that for their approach to
    CLIR French is harder than Spanish.
  • French average precision 0.149
  • Spanish average precision 0.173

71
Gotchas Query variability
  • Queries with Spanish gt baseline 14
  • Queries with Spanish ? baseline 40
  • Queries with Spanish lt baseline 53
  • Queries with French gt baseline 20
  • Queries with French ? baseline 22
  • Queries with French lt baseline 64

72
Gotchas Biased Evaluation
  • Compare two IR algorithms
  • 1. send query, present results
  • 2. send query, cluster results, present clusters
  • Experiment was simulated (no users)
  • Results were clustered into 5 clusters
  • Clusters were ranked according to percentage
    relevant documents
  • Documents within clusters were ranked according
    to similarity to query

73
Sim-Ranked vs. Cluster-Ranked
Does this show superiority of cluster ranking?
74
Relevance Density of Clusters
75
Summary
  • Information Visualization A good visualization
    is worth a thousand pictures.
  • But to make information visualization work for
    text is hard.
  • Evaluation Measures F measure, break-even point,
    area under the ROC curve
  • Evaluating interactive systems is harder than
    evaluating algorithms.
  • Evaluation gotchas Begin with the end in mind

76
Resources
  • FOA 4.3
  • MIR Ch. 10.8 10.10
  • Ellen Voorhees, Variations in Relevance Judgments
    and the Measurement of Retrieval Effectiveness,
    ACM Sigir 98
  • Harman, D.K. Overview of the Third REtrieval
    Conference (TREC-3). In Overview of The Third
    Text REtrieval Conference (TREC-3). Harman, D.K.
    (Ed.). NIST Special Publication 500-225, 1995,
    pp.l-19.
  • "Assessing agreement on classification tasks the
    kappa statistic", Jean Carletta, Computational
    Linguistics 22(2)249-254, 1996
  • Reexamining the Cluster Hypothesis
    Scatter/Gather on Retrieval Results (1996)  Marti
    A. Hearst, Jan O. Pedersen
  • Proceedings of SIGIR-96,
  • http//gim.unmc.edu/dxtests/ROC3.htm
  • Pirolli, P. and Card, S. K. (1999). Information
    Foraging. Psychological Review 106(4) 643-675.
  • Paul Over, TREC-6 Interactive Track Report, NIST,
    1998.

77
Resources
  • http//www.acm.org/sigchi/chi96/proceedings/papers
    /Koenemann/jk1_txt.htm
  • http//otal.umd.edu/olive
  • Jaime Carbonell , Jade Goldstein, The use of MMR,
    diversity-based reranking for reordering
    documents and producing summaries, Proceedings of
    the 21st annual international ACM SIGIR
    conference on Research and development in
    information retrieval, p.335-336, August 24-28,
    1998, Melbourne, Australia
Write a Comment
User Comments (0)
About PowerShow.com