Chap 14 Ranking Algorithm - PowerPoint PPT Presentation

About This Presentation
Title:

Chap 14 Ranking Algorithm

Description:

... Providing very poor service for end-users who use the system infrequently The ranking approach Inputting a natural ... sort step of the ... Merge postings from ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 26
Provided by: jackwu
Category:

less

Transcript and Presenter's Notes

Title: Chap 14 Ranking Algorithm


1
Chap 14 Ranking Algorithm
  • ???? ??? ??
  • ?? ??? ???

2
Outline
  • Introduction
  • Ranking models
  • Selecting ranking techniques
  • Data structures and algorithms
  • The creation of an inverted file
  • Searching the inverted file
  • Stemmed and unstemmed query terms
  • A Boolean systems with ranking
  • Pruning

3
Introduction
  • Boolean systems
  • Providing powerful on-line search capabilities
    for librarians and other trained intermediaries
  • Providing very poor service for end-users who use
    the system infrequently
  • The ranking approach
  • Inputting a natural language query without
    Boolean syntax
  • Producing a list of ranked records that answer
    the query
  • More oriented toward end-users

4
Introduction (cont.)
  • Natural language/ranking approach
  • is more effective for end-users
  • The results being ranked based on co-occurrence
    of query terms
  • modified by statistical term-weighting
  • eliminating the often-wrong Boolean syntax used
    by end-users
  • providing some results even if a query term is
    incorrect

5
Figure 14.1 Statistical ranking
Term Factors Information Help Human Operation Retrieval systems
Qry. Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems
Vtr. 1 1 0 1 0 1 1
Rec1. Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval
Vtr. 1 1 0 1 0 1 0
Rec2. Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems
Vtr. 1 0 1 1 0 0 1
Rec3. Factors, operation, systems Factors, operation, systems Factors, operation, systems Factors, operation, systems Factors, operation, systems Factors, operation, systems Factors, operation, systems
Vtr. 1 0 0 0 1 0 1
6
Figure 14.1 Statistical ranking
  • Simple Match
  • Query (1 1 0 1 0 1 1)
  • Rec1 (1 1 0 1 0 1 0)
  • (1 1 0 1 0 1 0) 4
  • Query (1 1 0 1 0 1 1)
  • Rec2 (1 0 1 1 0 0 1)
  • (1 0 0 1 0 0 1) 3
  • Query (1 1 0 1 0 1 1)
  • Rec3 (1 0 0 0 1 0 1)
  • (1 0 0 0 0 0 1) 2
  • Weighted Match
  • Query (1 1 0 1 0 1 1)
  • Rec1 (2 3 0 5 0 3 0)
  • (2 3 0 5 0 3 0) 13
  • Query (1 1 0 1 0 1 1)
  • Rec2 (2 0 4 5 0 0 1)
  • (2 0 0 5 0 0 1) 8
  • Query (1 1 0 1 0 1 1)
  • Rec3 (2 0 0 0 2 0 1)
  • (2 0 0 0 0 0 1) 3

7
Ranking models
  • Two types of ranking models
  • ranking the query against Individual documents
  • Vector space model
  • Probabilistic model
  • ranking the query against entire sets of related
    documents

8
Ranking models (cont.)
  • Vector space model
  • Using cosine correlation to compute similarity
  • Early experiments
  • SMART system (overlap similarity function)
  • Results
  • Within document frequency weighting gt no term
    weighting
  • Cosine correlation with frequency term weighting
    gt overlap similarity function
  • Salton Yang (1973) (Relying on term importance
    within an entire collection)
  • Results
  • Significant performance improvement using the
    within-document frequency weighting the
    inverted document frequency (IDF)

9
Ranking models (cont.)
  • Probabilistic model
  • Terms appearing in previously retrieved relevant
    documents was given a higher weight
  • Croft and Harper (1979)
  • Probabilistic indexing without any relevance
    information
  • Assuming all query terms have equal probability
  • Deriving a term-weighting formula

10
Ranking models (cont.)
  • Probabilistic model
  • Croft (1983)
  • Incorporating within-document frequency weights
  • Using a tuning factor K
  • Result
  • Significant improvement over both the IDF
    weighting alone and the combination weighting

11
Other experiments involving ranking
  • Direct comparison of similarity measures and
    term-weighting schemes
  • 4 types of term frequency weightings (Sparch
    Jones,1973)
  • Term frequency within a document
  • Term frequency within a collection
  • Term postings within a document (a binary
    measure)
  • Term postings within a collection
  • Indexing was taken from manually extracted
    keywords
  • Results
  • Using the term frequency (or postings) within a
    collection always improved performance
  • Using term frequency ( or postings) within a
    document improved performance only for some
    collections

12
Other experiments involving ranking (cont.)
  • Harman(1986)
  • Four term-weighting factors
  • (a) The number of matches between a document a
    query
  • (b) The distribution of a term within a document
    collection
  • IDF noise measure
  • (c) The frequency of a term within a document
  • (d) The length of the document
  • Results
  • Using the single measures alone, the distribution
    of the term within the collection 2 (c)
  • Combining the within-document frequency with
    either the IDF or noise measure 2 (using the
    IDF or noise alone)

13
Other experiments involving ranking (cont.)
  • Ranking based on document structure
  • Not only using weights based on term importance
    both within an entire collection and within a
    given document (Bernstein and Williamson, 1984)
  • But also using the structural position of the
    term
  • Summary versus text paragraphs
  • In SIBRIS, increasing term-weights for terms in
    titles of documents and decreasing term-weights
    for terms added to a query from a thesaurus

14
Selecting ranking techniques
  • Using term-weighting based on the distribution of
    a term within a collection
  • always improves performance
  • Within-document frequency IDF weight
  • often provides even more improvement
  • Within-document frequency (Several methods) IDF
    measure
  • Adding additional weight for document structure
  • Eg. higher weightings for terms appearing in the
    title or abstract vs. those appearing only in the
    text
  • Relevance weighting (Chap 11)

15
The creation of an inverted file
  • Implications for supporting inverted file
    structures
  • Only the record id has to be stored (smaller
    index)
  • Using strategies that increase recall at the
    expense of precision
  • Inverted file is usually split into two pieces
    for searching
  • The dictionary containing the term, along with
    statistics about that term such as no. of
    postings and IDF, and a pointer to the location
    of the postings file for term
  • The postings file containing the record ids and
    the weights for all occurrences of the term

16
The creation of an inverted file (cont.)
  • 4 major options for storing weights in the
    postings file
  • Store the raw frequency
  • Slowest search
  • Most flexible
  • Store a normalized frequency
  • Not suitable for use with the cosine similarity
    function
  • Updating would not change the postings

17
The creation of an inverted file (cont.)
  • Store the completely weighted term
  • Any of the combination weighting schemes are
    suitable
  • Disadvantage updating requires changing all
    postings
  • If no within-record weighting is used, then the
    postings records do not have to store weights

18
Searching the inverted file
  • Figure 14.4 flowchart of search engine

query
parser
Dictionary Lookup
Dictionary entry
Get Weights
Record numbers on a per term basis
Accumulator
Record numbers. Total weights
Sort by weight
Ranked record numbers
19
Searching the inverted file (cont.)
  • Inefficiencies of this technique
  • The I/O needs to be minimized
  • A single read for all the postings of a given
    term, and then separating the buffer into record
    ids and weights
  • Time savings can be gained at the expense of some
    memory space
  • Direct access to memory rather than through
    hashing
  • A final major bottleneck can be the sort step of
    the accumulators for large data sets
  • Fast sort of thousands of records is very time
    consuming

20
Stemmed and unstemmed query terms
  • If query terms were automatically stemmed in a
    ranking system, users generally got better
    results (Frakes, 1984 Canadela, 1990)
  • In some cases, a stem is produced that leads to
    improper results
  • the original record terms are not stored in the
    inverted file only their stems are used

21
Stemmed and unstemmed query terms (cont.)
  • Harman Candela (1990)
  • 2 separate inverted files could be created and
    stored
  • Stem terms normal query
  • Unstemmed terms dont stem
  • Hybrid inverted file
  • Saving no space in the dictionary part
  • Saving considerable storage (2 versions of
    posting)
  • At the expense of some additional search time

22
A Boolean systems with ranking
  • SIRE system
  • Full Boolean capability a variation of the
    basic search process
  • Accepts queries that are either Boolean logic
    strings or natural language queries (implicit OR)
  • Major modification to the basic search process
  • Merge postings from the query terms before
    ranking is done
  • Performance
  • Faster response time for Boolean queries
  • No increase in response time for natural language
    queries

23
Pruning
  • A major time bottleneck in the basic search
    process
  • The sort of the accumulators for large data sets
  • Changed search algorithm with pruning
  • Sort all query terms (stems) by decreasing IDF
    value
  • Do a binary search for the first term (i.e., the
    highest IDF) and get the address of the postings
    list for that term
  • Read the entire postings file for that term into
    a buffer and add the term weights for each record
    id into the contents of the unique accumulator
    for the record id

24
Pruning (cont.)
  1. Check the IDF of the next query term. If the IDF
    gt 1/3 (max IDF of any term in the data set)
    then repeat steps 2, 3, and 4 otherwise
    repeat steps 2, 3, and 4, but do not add weights
    to zero weight accumulators
  2. Sort the accumulators with nonzero weights to
    produce the final ranked record list
  3. If a query has only high-frequency terms, then
    pruning cannot be done.

25
  • Thanks
Write a Comment
User Comments (0)
About PowerShow.com