Search and Retrieval: Term Weighting and Document Ranking - PowerPoint PPT Presentation

About This Presentation
Title:

Search and Retrieval: Term Weighting and Document Ranking

Description:

Title: Text Data Mining Author: hearst Last modified by: SIMS Network Administrator Created Date: 9/22/1997 12:59:07 AM Document presentation format – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 25
Provided by: hear81
Category:

less

Transcript and Presenter's Notes

Title: Search and Retrieval: Term Weighting and Document Ranking


1
Search and RetrievalTerm Weighting and Document
Ranking
  • Prof. Marti Hearst
  • SIMS 202, Lecture 21

2
Today
  • Review Evaluation from last time
  • Term Weights and Document Ranking
  • Go over Midterm

3
Last Time
  • Relevance
  • Evaluation of IR Systems
  • Precision vs. Recall
  • Cutoff Points, E-measure, search length
  • Test Collections/TREC
  • Blair Maron Study

4
What to Evaluate?
  • What can be measured that reflects users ability
    to use system? (Cleverdon 66)
  • Coverage of Information
  • Form of Presentation
  • Effort required/Ease of Use
  • Time and Space Efficiency
  • Recall
  • proportion of relevant material actually
    retrieved
  • Precision
  • proportion of retrieved material actually relevant

5
Retrieved vs. Relevant Documents
High Recall (hypothetical)
Retrieved
High Precision (hypothetical)
Relevant
6
Standard IR Evaluation
Retrieved Documents
  • Precision
  • Recall

retrieved (pink blue)
Collection
7
Precision/Recall Curves
  • There is a tradeoff between Precision and Recall
  • So measure Precision at different levels of Recall

precision
x
x
x
x
recall
8
Expected Search Length
  • Documents are presented in order of predicted
    relevance
  • Search length number of non-relevant documents
    that user must scan through in order to have
    their information need satisfied
  • The shorter the better
  • below n2, search length 2 n3, search length
    6

What is the correction for this?
9
Expected Search Length
  • Documents are presented in order of predicted
    relevance
  • Search length number of non-relevant documents
    that user must scan through in order to have
    their information need satisfied
  • The shorter the better
  • below n2, search length 2 n3, search length
    6

Correction n6, search length 3
10
The E-Measure
  • Combine Precision and Recall into one number (van
    Rijsbergen 79)

P precision R recall b measure of relative
importance of P or R
11
The E and F-Measures
  • Combine Precision and Recall into one number (van
    Rijsbergen 79)

P precision R recall
With the F-measure, larger values are better.
12
The F-Measure
13
TREC
  • Text REtrieval Conference/Competition
  • Run by NIST (National Institute of Standards
    Technology)
  • 1997 was the 6th year
  • Collection 3 Gigabytes, gt1 Million Docs
  • Newswire full text news (AP, WSJ, Ziff)
  • Government documents (federal register)
  • Queries Relevance Judgments
  • Queries devised and judged by Information
    Specialists
  • Relevance judgments done only for those documents
    retrieved -- not entire collection!
  • Competition
  • Various research and commercial groups compete
  • Results judged on precision and recall, going up
    to a recall level of 1000 documents

14
Other facts about TREC
  • Recall is only an estimate based on the documents
    that have actually been assigned judgments
  • Recall is judged up to a maximum of 1000 relevant
    documents per query
  • In the standard ad hoc situation, everything is
    automated no human intervention is allowed.
  • ad hoc a one-shot query
  • Standing query/routing a query that is asked
    over and over again can train the system to
    react to it.

15
Finding Out About
  • Three phases
  • Asking of a question
  • Construction of an answer
  • Assessment of the answer
  • Part of an iterative process

16
Problems with Boolean
  • Difficult to control
  • and get back too few
  • or get back too many
  • No inherent ordering

17
Ranking Algorithms
  • Assign weights to the terms in the query.
  • Assign weights to the terms in the documents.
  • Compare the weighted query terms to the weighted
    document terms.
  • Rank order the results.

18
Information need
Collections
text input
19
Vector Representation (revisited see Salton
article in Science)
  • Documents and Queries are represented as vectors.
  • Position 1 corresponds to term 1, position 2 to
    term 2, position t to term t

20
Assigning Weights
  • Recall the Zipf distribution
  • Want to weight terms highly if they are
  • frequent in relevant documents BUT
  • infrequent in the collection as a whole

21
Assigning Weights
  • tf x idf measure
  • term frequency (tf)
  • inverse document frequency (idf)

22
tf x idf
  • Normalize the term weights (so longer documents
    are not unfairly given more weight)

23
Vector Space Similarity Measurecombine tf x idf
into a similarity measure
24
To Think About
  • How does this ranking algorithm behave?
  • Make a set of hypothetical documents consisting
    of terms and their weights
  • Create some hypothetical queries
  • How are the documents ranked, depending on the
    weights of their terms and the queries terms?
Write a Comment
User Comments (0)
About PowerShow.com