CS 430: Information Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

CS 430: Information Discovery

Description:

'the actor has an abacus' ... Query: (abacus or asp?) and actor ... 'actor' Merge these posting lists. ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 31
Provided by: wya1
Category:

less

Transcript and Presenter's Notes

Title: CS 430: Information Discovery


1
CS 430 Information Discovery
Lecture 11 The Boolean Model
2
Course Administration
Assignment 2 A revised version of the assignment
has been posted today. Assignment 1 If you have
questions about your grading, send me email. The
following are reasonable requests the wrong
files were graded, points were added up wrongly,
comments are unclear, etc. We are not prepared
to argue over details of judgment. If you ask
for a regrade, the final grade may be lower than
the original!

3
Boolean Diagram
not (A or B)
A and B
A
B
A or B
4
Query Languages Boolean Queries
Boolean query two or more search terms, related
by logical operators, e.g., and or not Exam
ples abacus and actor abacus or
actor (abacus and actor) or (abacus and
atoll) not actor
5
Query Languages Proximity Operators
abacus adj actor Terms abacus and actor are
adjacent to each other as in the string "abacus
actor" abacus near 4 actor Terms abacus and
actor are near to each other as in the string
"the actor has an abacus" Some systems support
other operators, such as with (two terms in the
same sentence) or same (two terms in the same
paragraph).
6
Query Languages Conventions
By convention, stop words and punctuation are
ignored. swan adj 41 matches "John Swan, 41
Main Street." information adj retrieval matches
" ... information on retrieval methods ..."
7
Query Languages Pattern Matching
Prefix "comp?" matches any word beginning
"comp" Suffix "?tal" matches any word ending
"tal" Ranges "1920...1925" matches any number
between 1920 and 1925
8
Evaluation of Boolean Operators
Precedence of operators must be defined adj,
near high and, not or low Example A and B or
C and B is evaluated as (A and B) or (C and B)
9
Thoughts on Evaluating a Boolean Expression
General Approach Specify the grammar for valid
expressions Write an interpreter defined by this
grammar -gt parse query to create an
expression -gt evaluate each document Simple
approach for Assignment 2 For each document -gt
scan expression for highest priority operator and
evaluate -gt repeat until all operators have been
evaluated
10
Use of Postings File for Query Matching
  • 1 abacus
  • 3 94
  • 19 7
  • 19 212
  • 22 56
  • 2 actor
  • 66
  • 19 213
  • 29 45

3 aspen 5 43
  • 4 atoll
  • 3
  • 70
  • 34 40

11
Query Matching Vector Ranking Methods
  • Query abacus asp?
  • 1. From the index file (word list), find the
    postings file for
  • "abacus"
  • every word that begins "asp"
  • Merge these posting lists. Calculate the
    similarity to the query for each document that
    occurs in any of the postings lists.
  • Sort the similarities to obtain the results in
    ranked order.
  • Steps 2 and 3 should be carried out in a single
    pass.

12
Query Matching Boolean Methods
  • Query (abacus or asp?) and actor
  • 1. From the index file (word list), find the
    postings file for
  • "abacus"
  • every word that begins "asp"
  • "actor"
  • Merge these posting lists. For each document
    that occurs in any of the postings lists,
    evaluate the Boolean expression to see if it is
    true or false.
  • Step 2 should be carried out in a single pass.

13
Query Languages Regular Expressions
Regular expression A pattern built up by simple
strings (which are matched as substrings) and
operators Union If e1 and e2 are regular
expressions, then (e1 e2) matches whatever
matches e1 or e2. Concatenation If e1 and e2
are regular expressions, the occurrences of (e1
e2) are formed by the occurrences of e1 followed
immediately by e2. Repetition If e is a regular
expression, then e matches a sequence of zero or
more contiguous occurrences of e.
14
Regular Expression Examples
(wild card) matches "wildcard" travel l ed
matches "traveled" or "travelled", but not
"traveed" 192 (0 1 2 3 4 5) matches any
string in the range "1920" to "1925" Techniques
for processing regular expressions are taught in
CS 381 and CS 481.
15
Problems with the Boolean model
Counter-intuitive results Query q A and B and
C and D and E Document d has terms A, B, C and
D, but not E Intuitively, d is quite a good match
for q, but it is rejected by the Boolean model.
Query q A or B or C or D or E Document d1 has
terms A, B, C, D and E Document d2 has term A,
but not B, C, D or E Intuitively, d1 is a much
better match than d2, but the Boolean model ranks
them as equal.
16
Problems with the Boolean model (continued)
Boolean is all or nothing Boolean model has no
way to rank documents. Boolean model allows for
no uncertainty in assigning index terms to
documents. The Boolean model has no provision
for adjusting the importance of query terms.
17
Boolean model as sets
d is either in the set A or not in A.
d
A
18
Extending the Boolean model
Term weighting Give weights to terms in
documents and/or queries. Combine standard
Boolean retrieval with vector ranking of
results Fuzzy sets Relax the boundaries of the
sets used in Boolean retrieval
19
Ranking methods in Boolean systems
SIRE (Syracuse Information Retrieval
Experiment) Term weights Add term weights to
documents Weights calculated by the standard
method of term frequency inverse document
frequency. Ranking Calculate results set by
standard Boolean methods Rank results by
vector distances
20
Relevance feedback in SIRE
SIRE (Syracuse Information Retrieval
Experiment) Relevance feedback is particularly
important with Boolean retrieval because it
allow the results set to be expanded Results
set is created by standard Boolean retrieval
User selects one document from results set
Other documents in collection are ranked by
vector distance from this document
21
Boolean model as fuzzy sets
d is more or less in A.
d
A
22
Basic concept
A document has a term weight associated with
each index term. The term weight measures the
degree to which that term characterizes the
document. Term weights are in the range 0, 1.
(In the standard Boolean model all weights are
either 0 or 1.) For a given query, calculate
the similarity between the query and each
document in the collection. This calculation
is needed for every document that has a non-zero
weight for any of the terms in the query.
23
MMM Mixed Min and Max model
Fuzzy set theory dA is the degree of membership
of an element to set A intersection (and) dA?B
min(dA, dB) union (or) dA?B max(dA, dB)
24
MMM Mixed Min and Max model
Fuzzy set theory example standard
fuzzy set theory set
theory dA 1 1 0 0 0.5 0.5 0 0 dB 1 0 1 0 0.7 0
0.7 0 and dA?B 1 0 0 0 0.5 0 0 0 or
dA?B 1 1 1 0 0.7 0.5 0.7 0
25
MMM Mixed Min and Max model
Terms A1, A2, . . . , An Document D, with
index-term weights dA1, dA2, . . . , dAn
Qor (A1 or A2 or . . . or
An) Query-document similarity S(Qor, D) Cor1
max(dA1, dA2,.. , dAn) Cor2 min(dA1, dA2,.. ,
dAn) where Cor1 Cor2 1
26
MMM Mixed Min and Max model
Terms A1, A2, . . . , An Document D, with
index-term weights dA1, dA2, . . . , dAn Qand
(A1 and A2 and . . . and An) Query-document
similarity S(Qand, D) Cand1 min(dA1,.. ,
dAn) Cand2 max(dA1,.. , dAn) where Cand1
Cand2 1
27
MMM Mixed Min and Max model
Experimental values Cand1 in range 0.5,
0.8 Cor1 gt 0.2 Computational cost is low.
Retrieval performance much improved.
28
Other Models
Paice model The MMM model considers only the
maximum and minimum document weights. The Paice
model takes into account all of the document
weights. Computational cost is higher than MMM.
P-norm model Document D, with term weights
dA1, dA2, . . . , dAn Query terms are given
weights, a1, a2, . . . ,an Operators have
coefficients that indicate degree of
strictness Query-document similarity is
calculated by considering each document and query
as a point in n space.
29
Test data
CISI CACM INSPEC P-norm 79 106 210 Paice 77 104 2
06 MMM 68 109 195
Percentage improvement over standard Boolean
model (average best precision) Lee and Fox, 1988
30
Reading
E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended
Boolean Models, Frake, Chapter 15 Methods based
on fuzzy set concepts
Write a Comment
User Comments (0)
About PowerShow.com