Comparing%20and%20Ranking%20Documents - PowerPoint PPT Presentation

About This Presentation

Title:

Comparing%20and%20Ranking%20Documents

Description:

Number of Views:15

Avg rating:3.0/5.0

Slides: 14

Provided by: paulaam

Learn more at: http://www.csc.villanova.edu

Category:

Tags: 20documents | 20ranking | 20and | comparing | stuffed

Transcript and Presenter's Notes

Title: Comparing%20and%20Ranking%20Documents

1
Comparing and Ranking Documents

Once our search engine has retrieved a set of
documents, we may want to
Rank them by relevance
Which are the best fit to my query?
This involves determining what the query is about
and how well the document answers it
Compare them
Show me more like this.
This involves determining what the document is
about.

2
Determining Relevance by Keyword

The typical web query consists entirely of
keywords.
Retrieval can be binary present or absent
More sophisticated is to look for degree of
relatedness how much does this document reflect
what the query is about?
Simple strategies
How many times does word occur in document?
How close to head of document?
If multiple keywords, how close together?

3
Keywords for Relevance Ranking

4
Keywords for Relevant Ranking

Proximity for multiple keywords
Requires even more computation
Obviously relevant only if have multiple keywords
Effectiveness of heuristic varies with
information need typically either excellent or
not very helpful at all
Very hard to "stuff"
All keyword methods
Are computationally simple and adequately fast
Are effective heuristics
typically perform as well as in-depth natural
language methods for standard search

5
Comparing Documents

"Find me more like this one" really means that we
are using the document as a query.
This requires that we have some conception of
what a document is about overall.
Depends on context of query. We need to
Characterize the entire content of this document
Discriminate between this document and others in
the corpus

6
Comparing Documents cont

7
Characterizing a Document Term Frequency

A document can be treated as a sequence of words.
Each word characterizes that document to some
extent.
When we have eliminated stop words, the most
frequent words tend to be what the document is
about
Therefore fkd ( of occurrences of word K in
document d) will be an important measure.
Also called the term frequency

8
Characterizing a Document Document Frequency

What makes this document distinct from others in
the corpus?
The terms which discriminate best are not those
which occur with high frequency!
Therefore Dk ( of documents in which word K
occurs) will also be an important measure.
Also called the document frequency

9
TFIDF

This can all be summarized as
Words are best discriminators when they
occur often in this document (term frequency)
dont occur in a lot of documents (document
frequency)
One very common measure of the importance of a
word to a document is TFIDF term frequency
inverse document frequency
There are multiple formulas for actually
computing this the book gives Robertson and
Jones. The underlying concept is the same in all
of them.

10
Describing an Entire Document

So what is a document about?
TFIDF can simply list keywords in order of
their TFIDF values
Document is about all of them to some degree it
is at some point in some vector space of meaning

11
Vector Space

Any corpus has defined set of terms (index)
These terms define a knowledge space
Every document is somewhere in that knowledge
space -- it is or is not about each of those
terms.
Consider each term as a vector. Then
We have an n-dimensional vector space
Where n is the number of terms (very large!)
Each document is a point in that vector space
The document position in this vector space can be
treated as what the document is about.

12
Similarity Between Documents

13
Bag of Words

All of these techniques are what is known as bag
of words approaches.
Keywords treated in isolation
Difference between "man bites dog" and "dog bites
man" non-existent
In text mining will discuss linguistic approaches
which pay attention to semantics

Write a Comment

User Comments (0)