Term Frequency and IR - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Term Frequency and IR

Description:

Too seldom get too few matches. Need to know the distribution of terms! ... many of them criticized on Capitol Hill for helping Enron construct off-the ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 35
Provided by: carolyn97
Category:
Tags: frequency | term

less

Transcript and Presenter's Notes

Title: Term Frequency and IR


1
Term Frequency and IR
2
What is a good index term
  • Occurs only in some documents
  • ???
  • Too often get whole doc set
  • Too seldom get too few matches
  • Need to know the distribution of terms!
  • Goal get rid of poor index terms
  • Why
  • faster to process
  • Smaller indices
  • Better results

3
Look at
  • Index term extraction
  • Term distribution
  • Entropy
  • Growth of vocabulary
  • Collocation of terms

4
Important Words?
  • Enron Ruling Leaves Corporate Advisers Open to
    Lawsuits
  • By KURT EICHENWALD A ruling last week by a
    federal judge in Houston may well have
    accomplished what a year's worth of reform by
    lawmakers and regulators has failed to achieve
    preventing the circumstances that led to Enron's
    stunning collapse from happening again.
  • To casual observers, Friday's decision
    by the judge, Melinda F. Harmon, may seem
    innocuous and not surprising. In it, she held
    that banks, law firms and investment houses
    many of them criticized on Capitol Hill for
    helping Enron construct off-the-books
    partnerships that led to its implosion could be
    sued by investors who are seeking to regain
    billions of dollars they lost in the debacle.

5
1. Term ExtractionPreprocessing
  • Lexical normalization (get terms)
  • Stop lists (get rid of terms)
  • Stemming (collapse terms)
  • Thesaurus or categorization construction (replace
    terms)

6
Lexical Normalization
  • Stream of characters ? index terms
  • Problems??
  • Numbers good index terms?
  • Hyphens online on line on-line
  • Punctuation remove?
  • Letter case ?
  • Treat the query terms the same way

7
Stop Lists
  • 10 most frequent words gt 20 occurrences
  • Standard list of 28 words gt 30

8
Stemming
  • Plurals car/cars
  • Variants react/reaction/reacted/reacting
  • Category based adheres/adhesion/adhesive
  • Errors
  • Understemming division/divide
  • Overstemming experiment/experience
  • Divine/divide

9
Thesaurus
  • Control the vocabulary
  • Automobile (car, suv, sedan, convertible, van,
    roadster, )
  • Problems?

10
What terms make good index/query terms?
  • Resolving power or selection power
  • Most frequent?
  • Least frequent?
  • In between?
  • Why not use all of them?

11
Resolving Power
12
2. Choosing Terms2a.Distribution of Terms in Text
  • What terms occur very frequently
  • What terms occur only once or twice
  • What is the general distribution of terms in a
    document set

13
(No Transcript)
14
Time magazine sample243,836 word occurrences
15
Zipfs Relationship
  • Frequency of the ith most frequent term is
    inversely related to the frequency of the most
    frequent word
  • fi f1
  • iq
  • where q depends on the text (1-2)
  • Rank x Frequency constant
  • constant .1

16
Principle of Least effort
  • Describe the weather today
  • Easier to use the same words!

17
Word frequency vocab growth
F
D
rank
Corpus size
18
Zipfs Law
  • A few words occur a lot
  • Top 10 words about 20 occurrences
  • A lot of words occur rarely
  • Almost half of the terms occur only once

19
Actual Zipfs Law
  • Rank x frequency constant
  • Frequency, pr , is probability that a word taken
    at random from N occurrences will have rank r
  • Given D unique words Sum (pr) 1
  • r x pr A
  • A 0.1

20
(No Transcript)
21
Using Zipf to predict frequencies
  • r x pr A
  • Word occurring n times has rank rn
  • rn AN/n
  • But several words may occur n times
  • We say rn refers to last word that occurs n times
  • So rn words occur n or more times
  • Number of unique terms,D, is highest rank with
    n1
  • D AN/1
  • Number of words occurring n times, In
  • In rn- rn1 AN/(n(n1))

22
Zipf and Power Law
  • Power law uses ykxc
  • Zipf is a power law with c -1
  • r(AN)n-1
  • On log-log plot expect straight line with slope
    c
  • So how does our Reuters data do?

23
Time magazine sample243,836 word occurrences
24
Zipf log-log curve
Slope c
Log freq
Log rank
25
DO Lab 1!!!
26
2b. Vocabulary Growth
  • How quickly does the vocabulary grow as the size
    of the data corpus grows?
  • Upper bound?
  • Rate of increase of new words?
  • Can be derived from Zipfs law

27
Calculation
  • Given n term occurrences in corpus
  • D knb
  • Where 0ltblt1, typically between .4 and .6
  • k usually between 10 and 100
  • (n is size of corpus in words)

28
(No Transcript)
29
2c. Entropy and Choosing Terms
  • Recall that entropy measures information
    content/uncertainty low is good
  • So we want to find terms that reduce the entropy
    of the document
  • Where are these on our Zipf curve?
  • maximum likelihood estimate (MLE), uses the
    frequency of a term in a document and the sum of
    all term frequencies in the document.
  •  

30
3. Collocation of Terms
  • Bag of word indexing is based on term
    independence
  • Why do we do this?
  • Should we do this?
  • What could we do if we kept collocation?

31
What is collocation
  • Next to
  • Tape deck
  • Day pass
  • Ordered
  • Pass day
  • Deck tape
  • Adjacency

32
What data do you need to use collocation?
  • Word position
  • Relative to?
  • What about updates?

33
Queries and Collocation
  • Information retrieval
  • Information (- 2) retrieval
  • ??

34
Summary
  • We can use general knowledge about term
    distribution to
  • Design more efficient systems
  • Choose effective indexing terms
  • Map queries to document indexes
  • Now what??
  • Using keywords in IR systems
  • IR models
Write a Comment
User Comments (0)
About PowerShow.com