Comparing%20and%20Ranking%20Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Comparing%20and%20Ranking%20Documents

Description:

Can be 'stuffed' by web designers. Position: Lead paragraphs summarize content ... Harder to 'stuff'; can only have a few keywords near beginning. Keywords for ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 14
Provided by: paulaam
Category:

less

Transcript and Presenter's Notes

Title: Comparing%20and%20Ranking%20Documents


1
Comparing and Ranking Documents
  • Once our search engine has retrieved a set of
    documents, we may want to
  • Rank them by relevance
  • Which are the best fit to my query?
  • This involves determining what the query is about
    and how well the document answers it
  • Compare them
  • Show me more like this.
  • This involves determining what the document is
    about.

2
Determining Relevance by Keyword
  • The typical web query consists entirely of
    keywords.
  • Retrieval can be binary present or absent
  • More sophisticated is to look for degree of
    relatedness how much does this document reflect
    what the query is about?
  • Simple strategies
  • How many times does word occur in document?
  • How close to head of document?
  • If multiple keywords, how close together?

3
Keywords for Relevance Ranking
  • Count repetition is an indicaiton of emphasis
  • Very fast (usually in the index)
  • Reasonable heuristic
  • Unduly influenced by document length
  • Can be "stuffed" by web designers
  • Position Lead paragraphs summarize content
  • Requires more computation
  • Also reasonably heuristic
  • Less influenced by document length
  • Harder to "stuff" can only have a few keywords
    near beginning

4
Keywords for Relevant Ranking
  • Proximity for multiple keywords
  • Requires even more computation
  • Obviously relevant only if have multiple keywords
  • Effectiveness of heuristic varies with
    information need typically either excellent or
    not very helpful at all
  • Very hard to "stuff"
  • All keyword methods
  • Are computationally simple and adequately fast
  • Are effective heuristics
  • typically perform as well as in-depth natural
    language methods for standard search

5
Comparing Documents
  • "Find me more like this one" really means that we
    are using the document as a query.
  • This requires that we have some conception of
    what a document is about overall.
  • Depends on context of query. We need to
  • Characterize the entire content of this document
  • Discriminate between this document and others in
    the corpus

6
Comparing Documents cont
  • Two very general approaches
  • statistical
  • semantic
  • We will discuss semantic approaches more in text
    mining
  • Statistical approach still focuses on keywords
  • To what extent does each term characterize this
    document?
  • To what extent does each term discriminate this
    document from other documents?

7
Characterizing a Document Term Frequency
  • A document can be treated as a sequence of words.
  • Each word characterizes that document to some
    extent.
  • When we have eliminated stop words, the most
    frequent words tend to be what the document is
    about
  • Therefore fkd ( of occurrences of word K in
    document d) will be an important measure.
  • Also called the term frequency

8
Characterizing a Document Document Frequency
  • What makes this document distinct from others in
    the corpus?
  • The terms which discriminate best are not those
    which occur with high frequency!
  • Therefore Dk ( of documents in which word K
    occurs) will also be an important measure.
  • Also called the document frequency

9
TFIDF
  • This can all be summarized as
  • Words are best discriminators when they
  • occur often in this document (term frequency)
  • dont occur in a lot of documents (document
    frequency)
  • One very common measure of the importance of a
    word to a document is TFIDF term frequency
    inverse document frequency
  • There are multiple formulas for actually
    computing this the book gives Robertson and
    Jones. The underlying concept is the same in all
    of them.

10
Describing an Entire Document
  • So what is a document about?
  • TFIDF can simply list keywords in order of
    their TFIDF values
  • Document is about all of them to some degree it
    is at some point in some vector space of meaning

11
Vector Space
  • Any corpus has defined set of terms (index)
  • These terms define a knowledge space
  • Every document is somewhere in that knowledge
    space -- it is or is not about each of those
    terms.
  • Consider each term as a vector. Then
  • We have an n-dimensional vector space
  • Where n is the number of terms (very large!)
  • Each document is a point in that vector space
  • The document position in this vector space can be
    treated as what the document is about.

12
Similarity Between Documents
  • How similar are two documents?
  • Measures of association
  • How much do the feature sets overlap?
  • Modified for length DICE coefficient
  • DICE coefficient terms compared to
    intersection
  • Simple Matching coefficient take into account
    exclusions
  • Cosine similarity
  • similarity of angle of the two document vectors
  • not sensitive to vector length

13
Bag of Words
  • All of these techniques are what is known as bag
    of words approaches.
  • Keywords treated in isolation
  • Difference between "man bites dog" and "dog bites
    man" non-existent
  • In text mining will discuss linguistic approaches
    which pay attention to semantics
Write a Comment
User Comments (0)
About PowerShow.com