Term Frequency and IR - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Term Frequency and IR

Description:

3. Collocation of Terms. Bag of word indexing is based on term independence. Why do we do this? ... Queries and Collocation 'Information retrieval' Information ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 33
Provided by: carolyn97
Category:

less

Transcript and Presenter's Notes

Title: Term Frequency and IR


1
Term Frequency and IR
2
What is a good index term
  • Occurs only in some documents
  • ???
  • Too often get whole doc set
  • Too seldom get too few matches
  • Need to know the distribution of terms!
  • Goal get rid of poor index terms
  • Why
  • faster to process
  • Smaller indices
  • Better results

3
Look at
  • Index term extraction
  • Term distribution
  • Growth of vocabulary
  • Collocation of terms

4
Important Words?
  • Enron Ruling Leaves Corporate Advisers Open to
    Lawsuits
  • By KURT EICHENWALD A ruling last week by a
    federal judge in Houston may well have
    accomplished what a year's worth of reform by
    lawmakers and regulators has failed to achieve
    preventing the circumstances that led to Enron's
    stunning collapse from happening again.
  • To casual observers, Friday's decision
    by the judge, Melinda F. Harmon, may seem
    innocuous and not surprising. In it, she held
    that banks, law firms and investment houses
    many of them criticized on Capitol Hill for
    helping Enron construct off-the-books
    partnerships that led to its implosion could be
    sued by investors who are seeking to regain
    billions of dollars they lost in the debacle.

5
Index term Preprocessing
  • Lexical normalization (get terms)
  • Stop lists (get rid of terms)
  • Stemming (collapse terms)
  • Thesaurus or categorization construction (replace
    terms)

6
Lexical Normalization
  • Stream of characters ? index terms
  • Problems??
  • Numbers good index terms?
  • Hyphens online on line on-line
  • Punctuation remove?
  • Letter case ?
  • Treat the query terms the same way

7
Stop Lists
  • 10 most frequent words gt 20 occurrences
  • Standard list of 28 words gt 30
  • Look at 10 most frequent words in applet 1

8
Stemming
  • Plurals car/cars
  • Variants react/reaction/reacted/reacting
  • Category based adheres/adhesion/adhesive
  • Errors
  • Understemming division/divide
  • Overstemming experiment/experience
  • Divine/divide

9
Thesaurus
  • Control the vocabulary
  • Automobile (car, suv, sedan, convertible, van,
    roadster, )
  • Problems?

10
What terms make good index terms?
  • Resolving power or selection power?
  • Most frequent?
  • Least frequent?
  • In between?
  • Why not use all of them?

11
Resolving Power
12
2. Distribution of Terms in Text
  • What terms occur very frequently
  • What terms occur only once or twice
  • What is the general distribution of terms in a
    document set

13
Time magazine sample243,836 word occurrences
14
Zipfs Relationship
  • Frequency of the ith most frequent term is
    inversely related to the frequency of the most
    frequent word
  • fi f1
  • iq
  • where q depends on the text (1-2)
  • Rank x Frequency constant
  • constant .1

15
Principle of Least effort
  • Describe the weather today
  • Easier to use the same words!

16
(No Transcript)
17
Word frequency vocab growth
F
D
rank
Corpus size
18
Zipfs Law
  • A few words occur a lot
  • Top 10 words about 20 occurrences
  • A lot of words occur rarely
  • Almost half of the terms occur only once

19
Actual Zipfs Law
  • Rank x frequency constant
  • Frequency, pr , is probability that a word taken
    at random from N occurrences will have rank r
  • Given D unique words Sum (pr) 1
  • r x pr A
  • A 0.1

20
Time magazine sample243,836 word occurrences
21
(No Transcript)
22
Using Zipf to predict frequencies
  • r x pr A
  • Word occurring n times has rank rn
  • rn AN/n
  • But several words may occur n times
  • We say rn refers to last word that occurs n times
  • So rn words occur n or more times
  • Number of unique terms,D, is highest rank with
    n1
  • D AN/1
  • Number of words occurring n times, In
  • In rn- rn1 AN/(n(n1))

23
Zipf and Power Law
  • Power law uses ykxc
  • Zipf is a power law with c -1
  • r(AN)n-1
  • On log-log plot expect straight line with slope
    c
  • So how does our Reuters data do?

24
Zipf log-log curve
Slope c
Log freq
Log rank
25
2. Vocabulary Growth
  • How quickly does the vocabulary grow as the size
    of the data corpus grows?
  • Upper bound?
  • Rate of increase of new words?
  • Can be derived from Zipfs law

26
Calculation
  • Given n term occurrences in corpus
  • D knb
  • Where 0ltblt1, typically between .4 and .6
  • k usually between 10 and 100
  • (n is size of corpus in words)

27
(No Transcript)
28
3. Collocation of Terms
  • Bag of word indexing is based on term
    independence
  • Why do we do this?
  • Should we do this?
  • What could we do if we kept collocation?

29
What is collocation
  • Next to
  • Tape deck
  • Day pass
  • Ordered
  • Pass day
  • Deck tape
  • Adjacency

30
What data do you need to use collocation?
  • Word position
  • Relative to?
  • What about updates?

31
Queries and Collocation
  • Information retrieval
  • Information (- 2) retrieval
  • ??

32
Summary
  • We can use general knowledge about term
    distribution to
  • Design more efficient systems
  • Choose effective indexing terms
  • Map queries to document indexes
  • Now what??
  • Using keywords in IR systems
  • Most common IR models
Write a Comment
User Comments (0)
About PowerShow.com