Weighting and Matching against Indices - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Weighting and Matching against Indices

Description:

Number of Views:35

Avg rating:3.0/5.0

Slides: 24

Provided by: osirisSun

Category:

Tags: against | ait | indices | matching | weighting

Transcript and Presenter's Notes

Title: Weighting and Matching against Indices

1
Weighting and Matching against Indices

2
Zipfs Law

In any corpus, such as the AIT, we can count how
often each word occurs in the corpus as a whole
word frequency, F(w).
Now imagine that weve sorted the vocabulary
according to frequency, so that the most
frequently occurring word will have rank 1, the
next most frequent word will have rank 2, and
so on.
Zipf (1949) found the following empirical
relation
F(w) C / rank(w) to the power a, where a 1, C
20000
If a 1, rank frequency is approx. constant.

3
(No Transcript)
4
Consequences of lexical decisions on word
frequencies

Noise words occur frequently
external keywords also frequent (which tell you
what the corpus is about, but do not help index
individual documents).
Zipfs Law seen with and without stemming.

5
Other applications of Zipfs Law.

6
Resolving Power (1)

Luhn (1957) It is hereby proposed that the
frequency of word occurrence in an article
furnishes a useful measurement of word
significance.
If a word is found frequently, more frequently
than we would expect, in a document, then it is
reflecting emphasis on the part of the author
about that document.
But the raw frequency of occurrence in a document
is only one of two critical statistics
recommending good keywords.
For example, almost every article in AIT contains
the words ARTIFICIAL INTELLIGENCE

7
Resolving Power (2)

Thus we prefer keywords which discriminate
between documents (i.e found only in some
documents).
Resolving power is the ability to discriminate
content
Mid frequency terms
Luhn did not provide a method of establishing the
maximal and minimal occurrence thresholds
Simple methods frequency of stop list words
upper limit, words which appear only once can
only index one document.

8
(No Transcript)
9
Exhaustivity and Specificity

An index is exhaustive if it includes many
topics.
An index is specific if users can precisely
identify their information needs.
Trade off high recall is easiest when an index
is exhaustive but not very specific high
precision is best accomplished when the index is
highly specific but not very exhaustive the best
index will strive for a balance.
If a document is indexed with many keywords, it
will be retrieved more often (representation
bias) we can expect higher recall, but
precision will suffer.
We can also analyse the problem from a
query-oriented perspective how well do the
query terms discriminate one document from
another?

10
(No Transcript)
11
Weighting the Index Relation

The simplest notion of an index is binary
either a keyword is associated with a document or
it is not but it is natural to imagine degrees
of aboutness.
We will use a single real number, a weight,
capturing the strength of association between
keyword and document.
The retrieval method can exploit these weights
directly.

12
Weighting (2)

One way to describe what this weight means is
probabilistic. We seek a measure of a documents
relevance, conditionalised on the belief that a
keyword is relevant
Wkd is proportional to Pr(d relevant k
relevant).
This is a directed relation we may or may not
believe that the symmetric relation
Wdk is proportional to Pr(k relevant d
relevant) is the same.
Unless otherwise specified, when we speak of a
weight w we mean Wkd.

13
Weighting (3)

In order to compute statistical estimates for
such probabilities we define several important
quantities
Fkd number of occurrences of keyword k in
document d
Fk total number of occurrences of keyword k
across the entire corpus
Dk number of documents containing keyword k

14
Weighting (4)

We will make two demands on the weight reflecting
the degree to which a document is about a
particular keyword or topic.
1. Repetition is an indicator of emphasis. If an
author uses a word frequently, it is because she
or he thinks its important. (Fkd)
2. A keyword must be a useful discriminator
within then context of the corpus. Capturing this
notion statistically is more difficult for now
we just give it the name discrim_k.
Because we care about both, we will cause our
weight to depend on the two factors
Wkd a Fkd discrim_k
Various index weighting schemes exist they all
use Fkd, but differ in how they quantify
discrim_k

15
(No Transcript)
16
Inverse document frequency (IDF)

Karen Sparck Jones said that from a
discrimination point of view, we need to know the
number of documents which contain a particular
word.
The value of a keyword varies inversely with the
log of the number of documents in which it
occurs
Wkd Fkd log( NDoc / Dk ) 1
Where NDoc is the total number of documents in
the corpus.
Variations on this formula exist.

17
Vector Space Model (1)

In a library, closely related books are
physically close together in three dimensional
space.
Search engines consider the abstract notion of
semantic space, where documents about the same
topic remain close together.
We will consider abstract spaces of thousands of
dimensions.
We start with the index matrix relating each
document in the corpus to all of its keywords.
Each and every keyword of the vocabulary is a
separate dimension of a vector space. The
dimensionality of the vector space is the size of
our vocabulary.

18
Vector Space Model (2)

In addition to the vectors representing the
documents, another vector corresponds to a query.
Because documents and queries exist within a
common vector space, we seek those documents that
are close to our query vector.
A simple (unnormalised) measure of proximity is
the inner (or dot ) product of query and
document vectors
Sim( q, d ) q . d
e.g. 1 2 3 .10 20 30 10 40 90 140

19
(No Transcript)
20
Vector Length Normalisation

Making weights sensitive to document length
Using the dot product alone, longer documents,
containing more words (more verbose), are more
likely to match the query than shorter ones, even
if the scope (amount of actual information
covered) is the same.
One solution is to use the cosine measure of
similarity.

21
(No Transcript)
22
(No Transcript)
23
Summary