Searching - PowerPoint PPT Presentation

About This Presentation
Title:

Searching

Description:

Title: Searching Author: Last modified by: Created Date: 4/11/1999 7:25:56 PM Document presentation format: Company – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 25
Provided by: 66459
Category:
Tags: searching | spills

less

Transcript and Presenter's Notes

Title: Searching


1
Searching
  • Binding of search statements
  • Boolean queries
  • Boolean queries in weighted systems
  • Weighted Boolean queries in non-weighted systems
  • Similarity measures
  • Well-known measures
  • Thresholds
  • Ranking
  • Relevance feedback

2
Binding of Search Statements
  • Search statements are generated by users to
    describe their information needs.
  • Typically,a search statement uses Boolean logic
    or natural language.
  • Three level of binding may be observed. At
    each level the query statement becomes more
    specific.
  • 1.At the first level, the user attempts to
    specify the information
  • needed, using his/her vocabulary and
    past experience.
  • Example Find me information on the impact of
    oil spills in Alaska on the
    price of oil

3
Binding of Search Statements(cont.)
  • 2. At the next level,the system translates the
    query to its own internal language.
  • This process is similar to that of processing
    (indexing) a new document. Example
    impact, oil (petroleum), spills (accidents),
    Alaska,price(cost,value)
  • 3. At the final level, the system reconsiders
    the query based upon the specific database.
    For example, assigning weights to the terms
    based upon the document frequency of each term.
  • Example impact(.308),oil(.606),petroleum(.65
    ),spills(.12),
    accidents(.23),Alaska(.45),price(.16),cost(.25),va
    lue(.10)

4
Boolean Queries
  • Boolean queries are natural in systems where
    weights are binary. A term either applies or
    does not apply to a query. Each term T is
    associated with the set of documents DT
  • A AND B Retrieve all documents for which both
    A and B are relevant (
    )
  • A OR B Retrieve all documents for which either
    A or B are relevant (
    )
  • A NOT B Retrieve all documents for which A is
    relevant and B is not
    relevant ( )
  • Consider two unnatural situations
  • Boolean queries in systems that index documents
    with weighted terms.
  • Weighted Boolean queries in systems that use
    non-weighted (binary) terms.

5
Boolean Queries in weighted Systems
  • Environment
  • A weighted system, where the relevance of a term
    to a document is expressed with a weight.
  • Boolean queries, involving AND and OR.
  • Possible approach use threshold to convert all
    weights to binary representations.
  • Possible approach
  • Transform the query to disjunctive normalform
  • Sets of conjunctions of the form T1 T2
    T3..
  • Connected by a operators.
  • Given a document D
  • First, its relevance to each conjunct is computed
    as the minimum weight of any document term that
    appears in the conjunct.
  • Then, the document relevance for the complete
    query is the maximum of the conjunct weights.

6
Boolean Queries in Weighted Systems(cont.)
  • Example Two documents indexed by 3 terms
  • Doc1 Term1 / 0.2, Term2 / 0.5, term3 / 0.6
  • Doc2 Term1 / 0.7, Term2 / 0.2, term3 / 0.1
  • Query ( Term1 AND Term2 ) OR Term3
  • Relevance of Doc1 to the query is 0.6
  • Relevance of Doc2 to the query is 0.2

7
Weighted Boolean Queries in Non-weighted Systems
  • Environment
  • A conventional system, where a term is either
    relevant or non-relevant to a document.
  • Boolean queries, in which users associate a
    weight (importance) with each term.
  • Possible approach
  • OR
  • A1 B1 includes all the documents in DA DB.
  • A1 B0 includes all the documents in DA.
  • As the weight of B changes from 0 to 1, Documents
    from DB - DA are added to DA.
  • AND
  • A1 B1 includes all the documents in DA
    DB.
  • A1 B0 includes all the documents in DA.
  • As the weight of B changes from 1to 0, Documents
    from DA - DB are added to DA DB.

8
Weighted Boolean Queries in Non-weighted Systems
(cont.)
  • NOT
  • A1 B1 includes all the documents in DA -
    DB.
  • A1 B0 includes all the documents in DA.
  • As the weight of B changes from 1 to 0, Documents
    from DA DB are added to DA - DB.
  • Algorithm
  • determine the documents that satisfy the either
    of the extreme interpretations.
  • Determine the centroid of the inner set.
  • Calculate the similarity of the documents outside
    of the inner set and the centroid.
  • Determine the number of document of documents to
    be added, by multiplying the actual weight B (a
    value between 0 and 1 ) by the number of
    documents outside of the inner set.
  • Select the documents to be added as those most
    similar to the centroid.

9
Similarity Measures
  • Typical, similarity measures are used when both
    queries and documents are described by vectors
  • A similarity measures gauges the similarity
    between two documents (for the purpose of search
    we do not consider here them similarity,but many
    of the consideration are identical)
  • The measure increases as similarity grows(0
    reflects total dissimilarity)
  • A variety of similarity measures has been
    proposed and experimented with.
  • As queries are analogous to documents,the same
    similarity measures can be used to measure
  • document-document similarity (used in document
    clustering)
  • document-query similarity (used in searching)
  • query-query similarity(?)

10
Similarity MeasuresInner Product
  • Consider again
  • SIM(Di,Dj)
  • Where the weights Wik are simple frequency counts
  • The problem with this simple measure is that it
    is not normalized to account for variances in the
    length of documents
  • This might be corrected by dividing each
    frequency count by the length of the document
  • It may be also be corrected by dividing each
    frequency count by the maximum frequency count
    for the document
  • Additional normalization is often performed to
    force all similarity values to the range between
    0 and 1

11
Similarity MeasuresInner Product(cont.)
  • This is a refinement of the previous measure
  • (alternatively,the measure remains the inner
    product,but the representation are different)
  • SIM(Q,D)
  • where
  • m is the number of documents in the collection
  • n is the number of indexing terms
  • Each document is a sequence of n weights D
    (d1,,dn)
  • A query is also a sequence of n weights Q
    (q1,,qn)
  • Each weight qk or dk IDFkTFk / MF
  • IDFk The inverse document frequency for term Tk
    that is, a value that decreases as the
    frequency of the term in the collection
    increases
  • for example,log2m/Dfki1,where DFk counts the
    number of documents in which term Tk appears)
  • TFk/MF The frequency of term Tk in this
    document,divided by the maximal frequency of any
    term in this document
  • There are other constants for fine-tuning the
    formulas performance

12
Similarity MeasuresCosine
  • A document or a query are treated as
    n-dimensional vectors
  • SIM(Q,D)
  • Formula measures the cosine of the angle between
    the two vectors
  • As cosine approaches 1, the two vectors become
    coincident(document and query represent
    unrelated concepts)
  • Problem Does not take into account the length of
    the vectors
  • Consider
  • Query (4,8,0)
  • Doc1 (1,2,0)
  • Doc2 (3,6,0)
  • SIM(Query,Doc1) and SIM(Query,Doc2)are identical,
    even though Doc2 has significantly higher
    weights in the terms in common

13
Similarity Measures Summary
  • Four well-known measures of vector similarity
  • Similarity Measure Evaluation for
    Binary Evaluation for Weighted
  • sim(X, Y) Term
    Vectors Term Vectors
  • Inner product
  • Dice coefficient
  • Cosine coefficient
  • Jaccard coefficient

14
Similarity Measures Summary(cont)
  • Observations
  • All four measures use the same inner product as
    nominator.
  • The denominators of the last three maybe viewed
    as normalizations of the inner product.
  • The definitions for binary term vectors are more
    intuitive.
  • All measures are 1 when X Y and 0 when X and Y
    are disjoint

15
Thresholds
  • Use of similarity measures may return the entire
    database as a search result, because the
    similarity measure might yield close-to-zero
    values for most, if not all, of the documents.
  • Similarity measures must be used with thresholds
  • Threshold a value that the similarity measure
    must exceed
  • It might also be a limit on the size of the
    answer.
  • Example
  • Terms
  • American, geography, lake, Mexico, painter, oil,
    reserve, subject.
  • Doc1 geography of Mexico suggests oil reserves
    are available.
  • Doc1 ( 0, 1, 0, 1, 0, 1, 1, 0)
  • Doc2 American geography has lakes available
    everywhere.
  • Doc2 (1, 1, 1, 0, 0, 0, 0, 0)
  • Doc3 painter suggest Mexico lakes as subjects.
  • Doc3 (0, 0, 1, 1, 1, 0, 0, 1)
  • Query oil reserves in Mexico
  • Query (0, 0, 0, 1, 0, 1, 1, 0)

16
Thresholds(cont.)
  • Example(cont.)
  • Using the inner product measures
  • SIM(Query, Doc1) 3
  • SIM(Query, Doc2) 0
  • SIM(Query, Doc3) 1
  • If a threshold of 2 is selected, then only Doc1
    is retrieved.
  • Use of thresholds may decrease recall when
    documents are clustered, and search compares
    queries to centroids.
  • There may be documents in a cluster that are not
    retrieved, even though they are similar enough to
    the query, because their cluster centroid is not
    close enough to the query.
  • The risk increases as the deviation in the
    cluster increases (the documents are not
    clustered around the centroid the centroid -- bad
    cluster)

17
Ranking
  • Similarity measures provide a means for ranking
    the set of retrieved documents
  • Ordering the documents from the most likely to
    satisfy the query to the least likely.
  • Ranking reduces the user overhead.
  • Because similarity measures are not accurate,
    precise ranking may be misleading documents may
    be grouped into sets, and the documents sets are
    ranked in order of relevance.

18
Relevance Feedback
  • An initial query might not provide an accurate
    description of the users needs
  • Users lack of knowledge of the domain.
  • Users vocabulary does not match authors
    vocabulary.
  • After examining the result of his query, a user
    can often improve the description of his needs
  • Querying is an iterative process.
  • Further iterations are generated either manually,
    or automatically.
  • Relevance feedback Knowledge of which returned
    documents are relevant and which are not, is used
    to generate the next query.
  • Assumption the documents relevant to a query
    resemble each other(similar vectors).
  • Hence, if a document is known to be relevant, the
    query can be improved by increasing its
    similarity to that document.
  • Similarly, if a document is known to be
    non-relevant, the query can be improved by
    decreasing its similarity to that document.

19
Relevance Feedback(cont.)
  • Given a query (a vector) we
  • add to it the average (centroid) of the relevant
    documents in the result, and
  • subtract from it the average (centroid) of the
    non-relevant documents in the result.
  • A vector algebra expression
  • where
  • Qi The present query.
  • Qi1 The revised query.
  • D A document in the result.
  • R The relevant documents in the result(r
    cardinality or R)
  • NR The non-relevant documents in the result(nr
    cardinality of NR)

20
Relevance Feedback (cont.)
  • A revised formula, giving more control over the
    various components
  • where
  • a,b, g Tuning constants for example, 1.0,
    0.5, 0.25
  • Positive feedback factor. Uses the
    users judgments on relevant documents to
    increase the values of terms. Moves the query to
    retrieve documents similar to relevant documents
    retrieved (in the direction of more relevant
    documents).
  • Negative feedback factor. Uses the
    users judgments on non-relevant documents to
    decrease the values of terms. Moves the query
    away from non-relevant documents.
  • Positive feedback often weighted significantly
    more than negative feedback often, only positive
    feedback is used.

21
Relevance Feedback(Cont.)
  • Illustration Impact of relevance feedback.
    Illustration shows the effect of positive
    feedback only or negative feedback only
  • Boxes filled present query hollow
    modified query.
  • Oval set of documents retrieved by present
    query.
  • Circles filled non-relevant documents
    hollow relevant.

22
Relevance Feedback(Cont.)
  • Example
  • Assume query Q (3,0,0,2,0) retrieved three
    documents Doc1, Doc2, Doc3.
  • Assume Doc1 and Doc2 are judged relevant and Doc3
    is judged non-relevant.
  • Assume the constants used are 1.0, 0.5, 0.25.
  • The revised query is
  • Q (3, 0, 0, 2, 0)
  • 0.5 ((21)/2, (43)/2, (00)/2, (00)/2,
    (20)/2)
  • - 0.25 (0, 0, 4, 3, 2)
  • (3.75, 1.75, -1, 1.25, 0) (3.75, 1.75,
    0, 1.25, 0)

23
Relevance Feedback(Cont.)
  • Example(cont.)
  • Using the similarity formula
  • we can compare the similarity of Q and Q to the
    three documents
  • Compared to the original query, new query is more
    similar to Doc1 and Doc2(judged relevant), and
    less similar to Doc3(judged non-relevant).
  • Notice how the new query added Term2, which was
    not in the original query.
  • For example, a user may be searching for word
    processor to be used on a PC, and the revised
    query may introduce the term Mac.

24
Relevance Feedback(Cont.)
  • Problem Relevance feedback may not operate
    satisfactorily, if the identified relevant
    documents do not form a tight cluster.
  • Possible solution Cluster the identified
    relevant documents, then split the original query
    into several, by constructing a new query for
    each cluster.
  • Problem Some of the query terms might not be
    found in any of the retrieved documents. This
    will lead to reduction of their relative weight
    in the modified query(or even elimination).
    Undesirable, because these terms might still be
    found in future iterations.
  • Possible solutions Ensure the original terms
    are kept or present all modified queries to the
    user for review.
  • Fully automatic relevance feedbackThe rank
    values for the documents in the first answer are
    used as relevance feedback to automatically
    generate the second query(no human judgment).
  • The highest ranking documents are assumed to be
    relevant(positive feedback only).
Write a Comment
User Comments (0)
About PowerShow.com