Weighting and Matching against Indices - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Weighting and Matching against Indices

Description:

In any corpus, such as the AIT, we can count how often each word occurs in the ... For example, almost every article in AIT contains the words ARTIFICIAL INTELLIGENCE ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 24
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Weighting and Matching against Indices


1
Weighting and Matching against Indices

2
Zipfs Law
  • In any corpus, such as the AIT, we can count how
    often each word occurs in the corpus as a whole
    word frequency, F(w).
  • Now imagine that weve sorted the vocabulary
    according to frequency, so that the most
    frequently occurring word will have rank 1, the
    next most frequent word will have rank 2, and
    so on.
  • Zipf (1949) found the following empirical
    relation
  • F(w) C / rank(w) to the power a, where a 1, C
    20000
  • If a 1, rank frequency is approx. constant.

3
(No Transcript)
4
Consequences of lexical decisions on word
frequencies
  • Noise words occur frequently
  • external keywords also frequent (which tell you
    what the corpus is about, but do not help index
    individual documents).
  • Zipfs Law seen with and without stemming.

5
Other applications of Zipfs Law.
  • Number of unique visitors vs. rank of website.
  • Number of speakers of each Language
  • Prize money won by golfers
  • Frequency of DNA codons
  • Size of avalanches of grains of sand
  • Frequency of English surnames

6
Resolving Power (1)
  • Luhn (1957) It is hereby proposed that the
    frequency of word occurrence in an article
    furnishes a useful measurement of word
    significance.
  • If a word is found frequently, more frequently
    than we would expect, in a document, then it is
    reflecting emphasis on the part of the author
    about that document.
  • But the raw frequency of occurrence in a document
    is only one of two critical statistics
    recommending good keywords.
  • For example, almost every article in AIT contains
    the words ARTIFICIAL INTELLIGENCE

7
Resolving Power (2)
  • Thus we prefer keywords which discriminate
    between documents (i.e found only in some
    documents).
  • Resolving power is the ability to discriminate
    content
  • Mid frequency terms
  • Luhn did not provide a method of establishing the
    maximal and minimal occurrence thresholds
  • Simple methods frequency of stop list words
    upper limit, words which appear only once can
    only index one document.

8
(No Transcript)
9
Exhaustivity and Specificity
  • An index is exhaustive if it includes many
    topics.
  • An index is specific if users can precisely
    identify their information needs.
  • Trade off high recall is easiest when an index
    is exhaustive but not very specific high
    precision is best accomplished when the index is
    highly specific but not very exhaustive the best
    index will strive for a balance.
  • If a document is indexed with many keywords, it
    will be retrieved more often (representation
    bias) we can expect higher recall, but
    precision will suffer.
  • We can also analyse the problem from a
    query-oriented perspective how well do the
    query terms discriminate one document from
    another?

10
(No Transcript)
11
Weighting the Index Relation
  • The simplest notion of an index is binary
    either a keyword is associated with a document or
    it is not but it is natural to imagine degrees
    of aboutness.
  • We will use a single real number, a weight,
    capturing the strength of association between
    keyword and document.
  • The retrieval method can exploit these weights
    directly.

12
Weighting (2)
  • One way to describe what this weight means is
    probabilistic. We seek a measure of a documents
    relevance, conditionalised on the belief that a
    keyword is relevant
  • Wkd is proportional to Pr(d relevant k
    relevant).
  • This is a directed relation we may or may not
    believe that the symmetric relation
  • Wdk is proportional to Pr(k relevant d
    relevant) is the same.
  • Unless otherwise specified, when we speak of a
    weight w we mean Wkd.

13
Weighting (3)
  • In order to compute statistical estimates for
    such probabilities we define several important
    quantities
  • Fkd number of occurrences of keyword k in
    document d
  • Fk total number of occurrences of keyword k
    across the entire corpus
  • Dk number of documents containing keyword k

14
Weighting (4)
  • We will make two demands on the weight reflecting
    the degree to which a document is about a
    particular keyword or topic.
  • 1. Repetition is an indicator of emphasis. If an
    author uses a word frequently, it is because she
    or he thinks its important. (Fkd)
  • 2. A keyword must be a useful discriminator
    within then context of the corpus. Capturing this
    notion statistically is more difficult for now
    we just give it the name discrim_k.
  • Because we care about both, we will cause our
    weight to depend on the two factors
  • Wkd a Fkd discrim_k
  • Various index weighting schemes exist they all
    use Fkd, but differ in how they quantify
    discrim_k

15
(No Transcript)
16
Inverse document frequency (IDF)
  • Karen Sparck Jones said that from a
    discrimination point of view, we need to know the
    number of documents which contain a particular
    word.
  • The value of a keyword varies inversely with the
    log of the number of documents in which it
    occurs
  • Wkd Fkd log( NDoc / Dk ) 1
  • Where NDoc is the total number of documents in
    the corpus.
  • Variations on this formula exist.

17
Vector Space Model (1)
  • In a library, closely related books are
    physically close together in three dimensional
    space.
  • Search engines consider the abstract notion of
    semantic space, where documents about the same
    topic remain close together.
  • We will consider abstract spaces of thousands of
    dimensions.
  • We start with the index matrix relating each
    document in the corpus to all of its keywords.
  • Each and every keyword of the vocabulary is a
    separate dimension of a vector space. The
    dimensionality of the vector space is the size of
    our vocabulary.

18
Vector Space Model (2)
  • In addition to the vectors representing the
    documents, another vector corresponds to a query.
  • Because documents and queries exist within a
    common vector space, we seek those documents that
    are close to our query vector.
  • A simple (unnormalised) measure of proximity is
    the inner (or dot ) product of query and
    document vectors
  • Sim( q, d ) q . d
  • e.g. 1 2 3 .10 20 30 10 40 90 140

19
(No Transcript)
20
Vector Length Normalisation
  • Making weights sensitive to document length
  • Using the dot product alone, longer documents,
    containing more words (more verbose), are more
    likely to match the query than shorter ones, even
    if the scope (amount of actual information
    covered) is the same.
  • One solution is to use the cosine measure of
    similarity.

21
(No Transcript)
22
(No Transcript)
23
Summary
  • Zipfs law frequency rank constant
  • Resolving power of keywords TF IDF
  • Exhaustivity vs. specificity
  • Vector space model
  • Cosine Similarity measure
Write a Comment
User Comments (0)
About PowerShow.com