Searching - PowerPoint PPT Presentation

About This Presentation

Title:

Searching

Description:

Title: Searching Author: Last modified by: Created Date: 4/11/1999 7:25:56 PM Document presentation format: Company – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 25

Provided by: 66459

Category:

more less

Transcript and Presenter's Notes

Title: Searching

1
Searching

Binding of search statements
Boolean queries
Boolean queries in weighted systems
Weighted Boolean queries in non-weighted systems
Similarity measures
Well-known measures
Thresholds
Ranking
Relevance feedback

2
Binding of Search Statements

Search statements are generated by users to
describe their information needs.
Typically,a search statement uses Boolean logic
or natural language.
Three level of binding may be observed. At
each level the query statement becomes more
specific.
1.At the first level, the user attempts to
specify the information
needed, using his/her vocabulary and
past experience.
Example Find me information on the impact of
oil spills in Alaska on the
price of oil

3
Binding of Search Statements(cont.)

2. At the next level,the system translates the
query to its own internal language.
This process is similar to that of processing
(indexing) a new document. Example
impact, oil (petroleum), spills (accidents),
Alaska,price(cost,value)
3. At the final level, the system reconsiders
the query based upon the specific database.
For example, assigning weights to the terms
based upon the document frequency of each term.
Example impact(.308),oil(.606),petroleum(.65
),spills(.12),
accidents(.23),Alaska(.45),price(.16),cost(.25),va
lue(.10)

4
Boolean Queries

Boolean queries are natural in systems where
weights are binary. A term either applies or
does not apply to a query. Each term T is
associated with the set of documents DT
A AND B Retrieve all documents for which both
A and B are relevant (
)
A OR B Retrieve all documents for which either
A or B are relevant (
)
A NOT B Retrieve all documents for which A is
relevant and B is not
relevant ( )
Consider two unnatural situations
Boolean queries in systems that index documents
with weighted terms.
Weighted Boolean queries in systems that use
non-weighted (binary) terms.

5
Boolean Queries in weighted Systems

Environment
A weighted system, where the relevance of a term
to a document is expressed with a weight.
Boolean queries, involving AND and OR.
Possible approach use threshold to convert all
weights to binary representations.
Possible approach
Transform the query to disjunctive normalform
Sets of conjunctions of the form T1 T2
T3..
Connected by a operators.
Given a document D
First, its relevance to each conjunct is computed
as the minimum weight of any document term that
appears in the conjunct.
Then, the document relevance for the complete
query is the maximum of the conjunct weights.

6
Boolean Queries in Weighted Systems(cont.)

Example Two documents indexed by 3 terms
Doc1 Term1 / 0.2, Term2 / 0.5, term3 / 0.6
Doc2 Term1 / 0.7, Term2 / 0.2, term3 / 0.1
Query ( Term1 AND Term2 ) OR Term3
Relevance of Doc1 to the query is 0.6
Relevance of Doc2 to the query is 0.2

7
Weighted Boolean Queries in Non-weighted Systems

Environment
A conventional system, where a term is either
relevant or non-relevant to a document.
Boolean queries, in which users associate a
weight (importance) with each term.
Possible approach
OR
A1 B1 includes all the documents in DA DB.
A1 B0 includes all the documents in DA.
As the weight of B changes from 0 to 1, Documents
from DB - DA are added to DA.
AND
A1 B1 includes all the documents in DA
DB.
A1 B0 includes all the documents in DA.
As the weight of B changes from 1to 0, Documents
from DA - DB are added to DA DB.

8
Weighted Boolean Queries in Non-weighted Systems
(cont.)

NOT
A1 B1 includes all the documents in DA -
DB.
A1 B0 includes all the documents in DA.
As the weight of B changes from 1 to 0, Documents
from DA DB are added to DA - DB.
Algorithm
determine the documents that satisfy the either
of the extreme interpretations.
Determine the centroid of the inner set.
Calculate the similarity of the documents outside
of the inner set and the centroid.
Determine the number of document of documents to
be added, by multiplying the actual weight B (a
value between 0 and 1 ) by the number of
documents outside of the inner set.
Select the documents to be added as those most
similar to the centroid.

9
Similarity Measures

Typical, similarity measures are used when both
queries and documents are described by vectors
A similarity measures gauges the similarity
between two documents (for the purpose of search
we do not consider here them similarity,but many
of the consideration are identical)
The measure increases as similarity grows(0
reflects total dissimilarity)
A variety of similarity measures has been
proposed and experimented with.
As queries are analogous to documents,the same
similarity measures can be used to measure
document-document similarity (used in document
clustering)
document-query similarity (used in searching)
query-query similarity(?)

10
Similarity MeasuresInner Product

Consider again
SIM(Di,Dj)
Where the weights Wik are simple frequency counts
The problem with this simple measure is that it
is not normalized to account for variances in the
length of documents
This might be corrected by dividing each
frequency count by the length of the document
It may be also be corrected by dividing each
frequency count by the maximum frequency count
for the document
Additional normalization is often performed to
force all similarity values to the range between
0 and 1

11
Similarity MeasuresInner Product(cont.)

This is a refinement of the previous measure
(alternatively,the measure remains the inner
product,but the representation are different)
SIM(Q,D)
where
m is the number of documents in the collection
n is the number of indexing terms
Each document is a sequence of n weights D
(d1,,dn)
A query is also a sequence of n weights Q
(q1,,qn)
Each weight qk or dk IDFkTFk / MF
IDFk The inverse document frequency for term Tk
that is, a value that decreases as the
frequency of the term in the collection
increases
for example,log2m/Dfki1,where DFk counts the
number of documents in which term Tk appears)
TFk/MF The frequency of term Tk in this
document,divided by the maximal frequency of any
term in this document
There are other constants for fine-tuning the
formulas performance

12
Similarity MeasuresCosine

A document or a query are treated as
n-dimensional vectors
SIM(Q,D)
Formula measures the cosine of the angle between
the two vectors
As cosine approaches 1, the two vectors become
coincident(document and query represent
unrelated concepts)
Problem Does not take into account the length of
the vectors
Consider
Query (4,8,0)
Doc1 (1,2,0)
Doc2 (3,6,0)
SIM(Query,Doc1) and SIM(Query,Doc2)are identical,
even though Doc2 has significantly higher
weights in the terms in common

13
Similarity Measures Summary

Four well-known measures of vector similarity
Similarity Measure Evaluation for
Binary Evaluation for Weighted
sim(X, Y) Term
Vectors Term Vectors
Inner product
Dice coefficient
Cosine coefficient
Jaccard coefficient

14
Similarity Measures Summary(cont)

Observations
All four measures use the same inner product as
nominator.
The denominators of the last three maybe viewed
as normalizations of the inner product.
The definitions for binary term vectors are more
intuitive.
All measures are 1 when X Y and 0 when X and Y
are disjoint

15
Thresholds

Use of similarity measures may return the entire
database as a search result, because the
similarity measure might yield close-to-zero
values for most, if not all, of the documents.
Similarity measures must be used with thresholds
Threshold a value that the similarity measure
must exceed
It might also be a limit on the size of the
answer.
Example
Terms
American, geography, lake, Mexico, painter, oil,
reserve, subject.
Doc1 geography of Mexico suggests oil reserves
are available.
Doc1 ( 0, 1, 0, 1, 0, 1, 1, 0)
Doc2 American geography has lakes available
everywhere.
Doc2 (1, 1, 1, 0, 0, 0, 0, 0)
Doc3 painter suggest Mexico lakes as subjects.
Doc3 (0, 0, 1, 1, 1, 0, 0, 1)
Query oil reserves in Mexico
Query (0, 0, 0, 1, 0, 1, 1, 0)

16
Thresholds(cont.)

Example(cont.)
Using the inner product measures
SIM(Query, Doc1) 3
SIM(Query, Doc2) 0
SIM(Query, Doc3) 1
If a threshold of 2 is selected, then only Doc1
is retrieved.
Use of thresholds may decrease recall when
documents are clustered, and search compares
queries to centroids.
There may be documents in a cluster that are not
retrieved, even though they are similar enough to
the query, because their cluster centroid is not
close enough to the query.
The risk increases as the deviation in the
cluster increases (the documents are not
clustered around the centroid the centroid -- bad
cluster)

17
Ranking

Similarity measures provide a means for ranking
the set of retrieved documents
Ordering the documents from the most likely to
satisfy the query to the least likely.
Ranking reduces the user overhead.
Because similarity measures are not accurate,
precise ranking may be misleading documents may
be grouped into sets, and the documents sets are
ranked in order of relevance.

18
Relevance Feedback

An initial query might not provide an accurate
description of the users needs
Users lack of knowledge of the domain.
Users vocabulary does not match authors
vocabulary.
After examining the result of his query, a user
can often improve the description of his needs
Querying is an iterative process.
Further iterations are generated either manually,
or automatically.
Relevance feedback Knowledge of which returned
documents are relevant and which are not, is used
to generate the next query.
Assumption the documents relevant to a query
resemble each other(similar vectors).
Hence, if a document is known to be relevant, the
query can be improved by increasing its
similarity to that document.
Similarly, if a document is known to be
non-relevant, the query can be improved by
decreasing its similarity to that document.

19
Relevance Feedback(cont.)

Given a query (a vector) we
add to it the average (centroid) of the relevant
documents in the result, and
subtract from it the average (centroid) of the
non-relevant documents in the result.
A vector algebra expression
where
Qi The present query.
Qi1 The revised query.
D A document in the result.
R The relevant documents in the result(r
cardinality or R)
NR The non-relevant documents in the result(nr
cardinality of NR)

20
Relevance Feedback (cont.)

A revised formula, giving more control over the
various components
where
a,b, g Tuning constants for example, 1.0,
0.5, 0.25
Positive feedback factor. Uses the
users judgments on relevant documents to
increase the values of terms. Moves the query to
retrieve documents similar to relevant documents
retrieved (in the direction of more relevant
documents).
Negative feedback factor. Uses the
users judgments on non-relevant documents to
decrease the values of terms. Moves the query
away from non-relevant documents.
Positive feedback often weighted significantly
more than negative feedback often, only positive
feedback is used.

21
Relevance Feedback(Cont.)

Illustration Impact of relevance feedback.
Illustration shows the effect of positive
feedback only or negative feedback only
Boxes filled present query hollow
modified query.
Oval set of documents retrieved by present
query.
Circles filled non-relevant documents
hollow relevant.

22
Relevance Feedback(Cont.)

Example
Assume query Q (3,0,0,2,0) retrieved three
documents Doc1, Doc2, Doc3.
Assume Doc1 and Doc2 are judged relevant and Doc3
is judged non-relevant.
Assume the constants used are 1.0, 0.5, 0.25.
The revised query is
Q (3, 0, 0, 2, 0)
0.5 ((21)/2, (43)/2, (00)/2, (00)/2,
(20)/2)
- 0.25 (0, 0, 4, 3, 2)
(3.75, 1.75, -1, 1.25, 0) (3.75, 1.75,
0, 1.25, 0)

23
Relevance Feedback(Cont.)

Example(cont.)
Using the similarity formula
we can compare the similarity of Q and Q to the
three documents
Compared to the original query, new query is more
similar to Doc1 and Doc2(judged relevant), and
less similar to Doc3(judged non-relevant).
Notice how the new query added Term2, which was
not in the original query.
For example, a user may be searching for word
processor to be used on a PC, and the revised
query may introduce the term Mac.

24
Relevance Feedback(Cont.)

Problem Relevance feedback may not operate
satisfactorily, if the identified relevant
documents do not form a tight cluster.
Possible solution Cluster the identified
relevant documents, then split the original query
into several, by constructing a new query for
each cluster.
Problem Some of the query terms might not be
found in any of the retrieved documents. This
will lead to reduction of their relative weight
in the modified query(or even elimination).
Undesirable, because these terms might still be
found in future iterations.
Possible solutions Ensure the original terms
are kept or present all modified queries to the
user for review.
Fully automatic relevance feedbackThe rank
values for the documents in the first answer are
used as relevance feedback to automatically
generate the second query(no human judgment).
The highest ranking documents are assumed to be
relevant(positive feedback only).