CS276A Text Information Retrieval, Mining, and Exploitation - PowerPoint PPT Presentation

About This Presentation

Title:

CS276A Text Information Retrieval, Mining, and Exploitation

Description:

.com / .org / .net / international urls. cnn.com vs. www.cnn.com ... TREC-6 interactive TREC report. 9 participating groups (US, Europe, Australia) ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 78

Provided by: christo394

Learn more at: https://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS276A Text Information Retrieval, Mining, and Exploitation

1
CS276AText Information Retrieval, Mining, and
Exploitation

Lecture 10
7 Nov 2002

2
Information Access in Context
Analyze
Synthesize
High-Level Goal
Done?
User
no
yes
Stop
3
Exercise

Observe your own information seeking behavior
WWW
University library
Grocery store
Are you a searcher or a browser?
How do you reformulate your query?
Read bad hits, then minus terms
Read good hits, then plus terms
Try a completely different query

4
CorrectionAddress Field vs. Search Box

Are users typing urls into the search box
ignorant?
.com / .org / .net / international urls
cnn.com vs. www.cnn.com
Full url with protocol qualifier vs. partial url

5
Todays Topics

Information design and visualization
Evaluation measures and test collections
Evaluation of interactive information retrieval
Evaluation gotchas

6
Information Visualization and Exploration

Tufte
Shneiderman
Information foraging Xerox PARC / PARC Inc.

7
Edward Tufte

Information design bible The visual display of
quantitative information
The art and science of how to display
(quantitative) information visually
Significant influence on User Interface design

8
The Challenger Accident

On January 28, 1986, the space shuttle Challenger
explodes shortly after takeoff.
Seven crew members die.
One of the causes an O ring failed due to cold
temperatures.
How could this happen?

9
How O-Rings were presented

Time scale is shown instead of temperature
scale!
Needless junk (rockets dont show information)
Graphic does not help answer question why do
o-rings fail?

10
Tufte Principles for Information Design

Omit needless junk
Show what you mean
Don't obscure the meaning and order of scales
Make comparisons of related images possible
Claim authorship, and think twice when others
don't
Seek truth

11
Tuftes O-Ring Visualization
12
Tufte Summary

Like poor writing, bad graphical displays
distort or obscure the data, make it harder to
understand or compare, or otherwise thwart the
communicative effect which the graph should
convey.
Bad decisions are made based on bad information
design.
Tuftes influence on UI design
Examples of the best and worst in information
visualization http//www.math.yorku.ca/SCS/Galler
y/noframes.html

13
Shneiderman Information Visualization

How to design user interfaces
How to engineer user interfaces for software
Task by type taxonomy

14
Shneiderman on HCI

Well-designed interactive computer systems
promote
Positive feelings of success, competence, and
mastery.
Allow users to concentrate on their work, rather
than on the system.

Marti Hearst
15
Task by Type Taxonomy Data Types

1-D linear seesoft
2-D map multidimensional scaling (terms, docs,
etc)
3-D world cat-a-cone
Multi-dim table lens
Temporal topic detection
Tree hierarchies a la Yahoo
Network network graphs of sites (kartoo)

16
Task by Type Taxonomy Tasks

Overview gain an overview of the entire
collection
Zoom zoom in on items of interest
Filter filter out uninteresting items
Details-on-demand select an item or group and
get details when needed
Relate view relationships among items
History keep a history of actions to support,
undo, replay
Extract allow extraction of subcollections and
the query parameters

17
Exercise

If your project has a UI component
Which data types are being displayed?
Which tasks are you supporting?

18
Xerox PARC Information Foraging

Metaphor from ecology/biology
People looking for information animals foraging
for food
Predictive model that allows principled way of
designing user interfaces
The main focus is
What will the user do next?
How can we support a good choice for the next
action?
Rather than
Evaluation of a single user-system interaction

19
Foraging Paradigm
Energy
Food Foraging Biological, behavioral, and
cultural designs are adaptive to the extent
they optimize the rate of energy intake.
George Robertson, Microsoft
20
Information Foraging Paradigm
Information
Information Foraging Information access and
visualization technologies are adaptive to the
extent they optimize the rate of gain of valuable
information
George Robertson, Microsoft
21
Searching Patches
George Robertson, Microsoft
22
Information Foraging Theory

G information/food gained
g average gain per doc/patch
TB total time between docs/patches
tb average time between docs/patches
TW total time within docs/patches
tw average time to process doc/patch
lambda 1/tb prevalence of information/food

23
Information Foraging Theory

R G / (TB TW) rate of gain
R lambda TB g / ( TB lambda TB tw)
R lambda g / ( 1 lambda tw)
Goodness measure of UI R rate of gain
Optimize UI by increasing R
Increase prevalence lambda (asymptotic
improvement)
Decrease tw (time it takes to absorb doc/food)
Better model different types of docs/patches
Model can be used to find optimal UI parameters

24
Cost-of-Knowledge Characteristic Function

Improve productivity Less time or more output

Card, Pirolli, and Mackinlay
25
Creating Test Collectionsfor IR Evaluation
26
Test Corpora
27
Kappa Measure

Kappa measures
Agreement among coders
Designed for categorical judgments
Corrects for chance agreement
Kappa P(A) P(E) / 1 P(E)
P(A) proportion of time coders agree
P(E) what agreement would be by chance
Kappa 0 for chance agreement, 1 for total
agreement.

28
Kappa Measure Example
P(A)? P(E)?
29
Kappa Example

P(A) 370/400 0.925
P(nonrelevant) (10207070)/800 0.2125
P(relevant) (1020300300)/800 0.7878
P(E) 0.21252 0.78782 0.665
Kappa (0.925 0.665)/(1-0.665) 0.776
For gt2 judges average pairwise kappas

30
Kappa Measure

Kappa gt 0.8 good agreement
0.67 lt Kappa lt 0.8 -gt tentative conclusions
(Carletta 96)
Depends on purpose of study

31
Interjudge Disagreement TREC 3
32
(No Transcript)
33
Impact of Interjudge Disagreement

Impact on absolute performance measure can be
significant (0.32 vs 0.39)
Little impact on ranking of different systems or
relative performance

34
Evaluation Measures
35
Recap Precision/Recall

Evaluation of ranked results
You can return any number of results ordered by
similarity
By taking various numbers of documents (levels of
recall), you can produce a precision-recall curve
Precision correctretrieved/retrieved
Recall correctretrieved/correct
The truth, the whole truth, and nothing but the
truth. Recall 1.0 the whole truth, precision
1.0 nothing but the truth.

36
Recap Precision-recall curves
37
F Measure

F measure is the harmonic mean of precision and
recall (strictly speaking F1)
1/F ½ (1/P 1/R)
Use F measure if you need to optimize a single
measure that balances precision and recall.

38
F-Measure
F1(0.956) max 0.96
39
Breakeven Point

Breakeven point is the point where precision
equals recall.
Alternative single measure of IR effectiveness.
How do you compute it?

40
Area under the ROC Curve

True positive rate recall sensitivity
False positive rate fp/(tnfp). Related to
precision. fpr0 lt-gt p1
Why is the blue line worthless?

41
Precision Recall Graph vs ROC
42
Unit of Evaluation

We can compute precision, recall, F, and ROC
curve for different units.
Possible units
Documents (most common)
Facts (used in some TREC evaluations)
Entities (e.g., car companies)
May produce different results. Why?

43
Critique of Pure ReasonRelevance

Relevance vs Marginal Relevance
A document can be redundant even if it is highly
relevant
Duplicates
The same information from different sources
Marginal relevance is a better measure of utility
for the user.
Using facts/entities as evaluation units more
directly measures true relevance.
But harder to create evaluation set
See Carbonell reference

44
Evaluation ofInteractive Information Retrieval
45
Evaluating Interactive IR

Evaluating interactive IR poses special
challenges
Obtaining experimental data is more expensive
Experiments involving humans require careful
design.
Control for confounding variables
Questionnaire to collect relevant subject data
Ensure that experimental setup is close to
intended real world scenario
Approval for human subjects research

46
IIR Evaluation Case Study 1

TREC-6 interactive TREC report
9 participating groups (US, Europe, Australia)
Control system (simple IR system)
Each group ran their system and the control
system
4 users at each site
6 queries ( topics)
Goal of evaluation Find best performing system
Why do you need control system for comparing
groups?

47
Queries ( Topics)
48
Latin Square Design
49
Analysis of Variance
50
Analysis of Variance
51
Analysis of Variance
52
Observations

Query effect is largest std for each site
High degree of query variability
Searcher effect negligible for 4 our of 10 sites
Best Model
Interactions are small compared too overall
error.
None of the 10 sites statistically better than
control system!

53
IIR Evaluation Case Study 2

Evaluation of relevance feedback
Koenemann Belkin 1996

54
Why Evaluate Relevance Feedback?
55
Questions being InvestigatedKoenemann Belkin 96

How well do users work with statistical ranking
on full text?
Does relevance feedback improve results?
Is user control over operation of relevance
feedback helpful?
How do different levels of user control effect
results?

Credit Marti Hearst
56
How much of the guts should the user see?

Opaque (black box)
(like web search engines)
Transparent
(see available terms after the r.f. )
Penetrable
(see suggested terms before the r.f.)
Which do you think worked best?

Credit Marti Hearst
57
Credit Marti Hearst
58
Terms available for relevance feedback made
visible(from Koenemann Belkin)
Credit Marti Hearst
59
Details on User StudyKoenemann Belkin 96

Subjects have a tutorial session to learn the
system
Their goal is to keep modifying the query until
theyve developed one that gets high precision
This is an example of a routing query (as opposed
to ad hoc)
Reweighting
They did not reweight query terms
Instead, only term expansion
pool all terms in rel docs
take top N terms, where
n 3 (number-marked-relevant-docs2)
(the more marked docs, the more terms added to
the query)

Credit Marti Hearst
60
Details on User StudyKoenemann Belkin 96

64 novice searchers
43 female, 21 male, native English
TREC test bed
Wall Street Journal subset
Two search topics
Automobile Recalls
Tobacco Advertising and the Young
Relevance judgements from TREC and experimenter
System was INQUERY (vector space with some bells
and whistles)

Credit Marti Hearst
61
Sample TREC query
Credit Marti Hearst
62
Evaluation

Precision at 30 documents
Baseline (Trial 1)
How well does initial search go?
One topic has more relevant docs than the other
Experimental condition (Trial 2)
Subjects get tutorial on relevance feedback
Modify query in one of four modes
no r.f., opaque, transparent, penetration

Credit Marti Hearst
63
Precision vs. RF condition (from Koenemann
Belkin 96)
Can we conclude from this chart that RF is better?
Credit Marti Hearst
64
Effectiveness Results

Subjects with R.F. did 17-34 better performance
than no R.F.
Subjects with penetration case did 15 better as
a group than those in opaque and transparent
cases.

Credit Marti Hearst
65
Number of iterations in formulating queries (from
Koenemann Belkin 96)
Credit Marti Hearst
66
Number of terms in created queries (from
Koenemann Belkin 96)
Credit Marti Hearst
67
Behavior Results

Search times approximately equal
Precision increased in first few iterations
Penetration case required fewer iterations to
make a good query than transparent and opaque
R.F. queries much longer
but fewer terms in penetrable case -- users were
more selective about which terms were added in.

Credit Marti Hearst
68
Evaluation Gotchas

No statistical test (!)
Lots of pairwise tests
Wrong evaluation measure
Query variability
Unintentionally biased evaluation

69
Gotchas Evaluation Measures

KDD cup 2002
Optimize model parameter balance factor
Area under ROC curve and BEP have different
behaviors
These two measures intuitively measure the same
property.

70
Gotchas Query variability

Eichmann et al. claim that for their approach to
CLIR French is harder than Spanish.
French average precision 0.149
Spanish average precision 0.173

71
Gotchas Query variability

Queries with Spanish gt baseline 14
Queries with Spanish ? baseline 40
Queries with Spanish lt baseline 53
Queries with French gt baseline 20
Queries with French ? baseline 22
Queries with French lt baseline 64

72
Gotchas Biased Evaluation

Compare two IR algorithms
1. send query, present results
2. send query, cluster results, present clusters
Experiment was simulated (no users)
Results were clustered into 5 clusters
Clusters were ranked according to percentage
relevant documents
Documents within clusters were ranked according
to similarity to query

73
Sim-Ranked vs. Cluster-Ranked
Does this show superiority of cluster ranking?
74
Relevance Density of Clusters
75
Summary

Information Visualization A good visualization
is worth a thousand pictures.
But to make information visualization work for
text is hard.
Evaluation Measures F measure, break-even point,
area under the ROC curve
Evaluating interactive systems is harder than
evaluating algorithms.
Evaluation gotchas Begin with the end in mind

76
Resources

FOA 4.3
MIR Ch. 10.8 10.10
Ellen Voorhees, Variations in Relevance Judgments
and the Measurement of Retrieval Effectiveness,
ACM Sigir 98
Harman, D.K. Overview of the Third REtrieval
Conference (TREC-3). In Overview of The Third
Text REtrieval Conference (TREC-3). Harman, D.K.
(Ed.). NIST Special Publication 500-225, 1995,
pp.l-19.
"Assessing agreement on classification tasks the
kappa statistic", Jean Carletta, Computational
Linguistics 22(2)249-254, 1996
Reexamining the Cluster Hypothesis
Scatter/Gather on Retrieval Results (1996) Marti
A. Hearst, Jan O. Pedersen
Proceedings of SIGIR-96,
http//gim.unmc.edu/dxtests/ROC3.htm
Pirolli, P. and Card, S. K. (1999). Information
Foraging. Psychological Review 106(4) 643-675.
Paul Over, TREC-6 Interactive Track Report, NIST,
1998.

77
Resources

http//www.acm.org/sigchi/chi96/proceedings/papers
/Koenemann/jk1_txt.htm
http//otal.umd.edu/olive
Jaime Carbonell , Jade Goldstein, The use of MMR,
diversity-based reranking for reordering
documents and producing summaries, Proceedings of
the 21st annual international ACM SIGIR
conference on Research and development in
information retrieval, p.335-336, August 24-28,
1998, Melbourne, Australia