VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges

1 / 93
About This Presentation
Title:

VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges

Description:

Search engine as a document retrieval system. no control on web ... additional information for each page (time last modified, organization publishing it, etc. ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 94
Provided by: meng84

less

Transcript and Presenter's Notes

Title: VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges


1
VLDB'99 TUTORIAL Metasearch Engines
Solutions and Challenges
  • Clement Yu
    Weiyi Meng
  • Dept. of EECS Dept. of
    Computer Science
  • U. of Illinois at Chicago SUNY at
    Binghamton
  • Chicago, IL 60607 Binghamton, NY
    13902
  • yu_at_eecs.uic.edu meng_at_cs.binghamton.e
    du

2
The Problem
How am I going to find the 5 best pages on
Internet Security?
  • search search
    search
  • engine 1 engine 2
    engine n

  • . . . . . .
  • text text
    text
  • source 1 source 2
    source n


3
Metasearch Engine Solution
  • user
  • user interface
  • query dispatcher result
    merger
  • search search
    search
  • engine 1 engine 2
    engine n

  • . . . . . .
  • text text
    text
  • source 1 source 2
    source n

query
result
4
Some Observations
  • most sources are not useful for a given query
  • sending a query to a useless source would
  • incur unnecessary network traffic
  • waste local resources for evaluating the query
  • increase the cost of merging the results
  • retrieving too many documents from a source is
    inefficient

5
A More Efficient Metasearch Engine
  • user
  • user interface
  • database selector document
    selector
  • query dispatcher result
    merger
  • search search
    search
  • engine 1 engine 2
    engine n

  • . . . . . .
  • text text
    text
  • source 1 source 2
    source n

query
result
6
Tutorial Outline
  • 1. Introduction to Text Retrieval
  • consider only Vector Space Model
  • 2. Search Engines on the Web
  • 3. Introduction to Metasearch Engine
  • 4. Database Selection
  • 5. Document Selection
  • 6. Result Merging
  • 7. New Challenges

7
Introduction to Text Retrieval (1)
  • Document representation
  • remove stopwords of, the, ...
  • stemming stemming stem
  • d (d1 , ..., di , ..., dn)
  • di weight of ith term in d
  • tf idf formula for computing di
  • Example consider term t of document d in a
    database of N documents.
  • tf weight of t in d if tf gt 0 0.5
    0.5tf/max_tf
  • idf weight of t log(N/df)
  • weight of t in d (0.5 0.5tf/max_tf)log(
    N/df)

8
Introduction to Text Retrieval (2)
  • Query representation
  • q (q1 , ..., qi , ..., qn)
  • qi weight of ith term in q
  • compute qi tf weight only
  • alternative use idf weight for query terms
  • not document terms
  • query expansion (e.g., add related terms)

9
Introduction to Text Retrieval (3)
  • Similarity Functions
  • simple dot product
  • favor long documents
  • Cosine function
  • other similarity functions exist
  • normalized similarities 0, 1.0

q
?
d
10
Introduction to Text Retrieval (4)
  • Retrieval Effectiveness
  • relevant documents documents useful to the user
    of query
  • recall percentage of relevant documents
    retrieved
  • precision percentage of retrieved documents that
    are relevant

precision
recall
11
Search Engines on the Web (1)
  • Search engine as a document retrieval system
  • no control on web pages that can be searched
  • web pages have rich structures and semantics
  • web pages are extensively linked
  • additional information for each page (time last
    modified, organization publishing it, etc.)
  • databases are dynamic and can be very large
  • few general-purpose search engines and numerous
    special-purpose search engines

12
Search Engines on the Web (2)
  • New indexing techniques
  • partial-text indexing to improve scalability
  • ignore and/or discount spamming terms
  • use anchor terms to index linked pages
  • e.g. WWWW McBr94, Google BrPa98,
  • Webor CSM97

Page 2 http//travelocity.com/
Page 1
. . . . . . airplane ticket and
hotel . . . . . .
13
Search Engines on the Web (3)
  • New term weighting schemes
  • higher weights to terms enclosed by special tags
  • title (SIBRIS WaWJ89, Altavista, HotBot, Yahoo)
  • special fonts (Google BrPa98)
  • special fonts tags (LASER BoFJ96)
  • Webor CSM97 approach
  • partition tags into disjoint classes (title,
    header, strong, anchor, list, plain text)
  • assign different importance factors to terms in
    different classes
  • determine optimal importance factors

14
Search Engines on the Web (4)
  • New document ranking methods
  • Vector Spreading Activation YuLe96
  • add a fraction of parents' similarities
  • Example Suppose for query q
  • sim(q, d1) 0.4 sim(q, d2) 0.2 sim(q,
    d3) 0.2
  • final score of d3 0.2 0.10.4 0.10.2
    0.26

d1
d3
d2
15
Search Engines on the Web (5)
  • New document ranking methods
  • combine similarity with rank
  • PageRank PaBr98 an important page is linked to
    by many pages and/or by important pages
  • combine similarity with authority score
  • authority Klei98 an important content page is
    highly linked to among initially retrieved pages
    and their neighbors

16
Introduction to Metasearch Engine (1)
  • An Example
  • Query Internet
    Security
  • Databases NYT ...
    WP ...
  • DB
    ... DB ...
  • Retrieved results t1, t2, ...
    p1, p2,
  • Merged results p1, t1, ...

17
Introduction to Metasearch Engine (2)
  • Database Selection Problem
  • Select potentially useful databases for a given
    query
  • essential if the number of local databases is
    large
  • reduce network traffic
  • avoid wasting local resources

query
18
Introduction to Metasearch Engine (3)
  • Potentially useful database contain potentially
    useful documents
  • Potentially useful documents
  • global similarity above a threshold
  • global similarity among m highest
  • Need some knowledge about each database in
    advance in order to perform database selection
  • Database Representative

19
Introduction to Metasearch Engine (4)
  • Document Selection Problem
  • Select potentially useful documents from each
    selected local database efficiently
  • Step 1 Retrieve all potentially useful documents
    while minimizing the retrieval of useless
    documents
  • from global similarity threshold to tightest
    local similarity threshold
  • want all d Gsim(q, d) gt GT
  • retrieve d from DBk Lsim(q, d) gt LTk
  • LTk is largest Gsim(q, d) gt GT
    Lsim(q, d) gt LTk

20
Introduction to Metasearch Engine (5)
  • Efficient Document Selection
  • Step 2 Transmit all potentially useful documents
    to result merger while minimizing the
    transmission of useless documents
  • further filtering to reduce transmission cost and
    merge cost
  • Example
  • local
  • DBk

retrieve
transmit
filter
d1 , , ds
d2, d7, d10
21
Introduction to Metasearch Engine (6)
  • Result Merging Problem
  • Objective Merge returned documents from multiple
    sources into a single ranked list.
  • Difficulty Local document similarities may be
    incomparable or not available.
  • Solutions Generate "global similarities for
    ranking.

d11, d12, ...
DB1
. . . . . .
Merger
d12, d54, ...
dN1, dN2, ...
DBN
22
Introduction to Metasearch Engine (7)
  • An Ideal Metasearch Engine
  • Retrieval effectiveness same as that as if all
    documents were in the same collection.
  • Efficiency optimize the retrieval process
  • Implications should aimed at
  • selecting only useful search engines
  • retrieving and transmitting only useful documents
  • ranking documents according to their degrees of
    relevance

23
Introduction to Metasearch Engine (8)
  • Main Sources of Difficulties MYL99
  • autonomy of local search engines
  • design autonomy
  • maintenance autonomy
  • heterogeneities among local search engines
  • indexing method
  • document/query term weighting schemes
  • similarity/ranking function
  • document database
  • document version
  • result presentation

24
Introduction to Metasearch Engine (9)
  • Impact of Autonomy and Heterogeneities MLY99
  • unwilling to provide database representatives or
    provide different types of representatives
  • difficult to find potentially useful documents
  • difficult to merge documents from multiple sources

25
Database Selection Basic Idea
  • Goal Identify potentially useful databases for
    each user query.
  • General approach
  • use representative to indicate approximately the
    content of each database
  • use these representatives to select databases for
    each query
  • Diversity of solutions
  • different types of representatives
  • different algorithms using the representatives

26
Solution Classification
  • Naive Approach
  • select all databases (e.g. MetaCrawler,
    NCSTRL)
  • Qualitative Approaches estimate the quality of
    each local database
  • based on rough representatives
  • based on detailed representatives
  • Quantitative Approaches estimate quantities that
    measure the quality of each local database more
    directly and explicitly
  • Learning-based Approaches database
    representatives are obtained through training or
    learning

27
Qualitative Approaches Using Rough Representatives
  • typical representative
  • a few words or a few paragraphs in certain format
  • manual construction often needed
  • can work well for special-purpose local search
    engines
  • very scalable storage requirement
  • selection can be inaccurate as the description is
    too rough

28
Qualitative Approaches Using Rough Representatives
  • Example 1 ALIWEB Kost94
  • Representative has a fixed format site
    containing files for the Perl Language
  • Template-Type DOCUMENT
  • Title Perl
  • Description Information on the Perl
    Programming
  • Language. Includes a
    local Hypertext
  • Perl Manual, and the
    latest FAQ in
  • Hypertext.
  • Keywords perl, perl-faq, language
  • user query can match against one or more fields

29
Qualitative Approaches Using Rough Representatives
  • Example 2 NetSerf ChHa95
  • Representative has a WordNet based structure
    site for world facts listed by country
  • topic country
  • synset nation, nationality, land, country,
    a_people
  • synset state, nation, country, land,
  • commonwealth, res_publica,
    body_politic
  • synset country, state, land, nation
  • info-type facts
  • user query is transformed to similar structure
    before match

30
Qualitative Approaches Using Detailed
Representatives
  • Use detailed statistical information for each
    term
  • employ special measures to estimate the
    usefulness/quality of each search engine for each
    query
  • the measures reflect the usefulness in a less
    direct/explicit way compared to those used in
    quantitative approaches.
  • scalability starts to become an issue

31
Qualitative Approaches Using Detailed
Representatives
  • Example 1 gGlOSS GrGa95
  • representative for term ti
  • -- document frequency of ti
  • -- the sum of weights of ti in all
    documents
  • database usefulness sum of high similarities
  • usefulness(q, D, T)

32
gGlOSS (continued)
  • Suppose for query q , we have
  • D1 d11 0.6, d12 0.5
  • D2 d21 0.3, d22 0.3, d23 0.2
  • D3 d31 0.7, d32 0.1, d33 0.1
  • usefulness(q, D1, 0.3) 1.1
  • usefulness(q, D2, 0.3) 0.6
  • usefulness(q, D3, 0.3) 0.7

33
gGlOSS (continued)
  • gGlOSS usefulness is estimated for two cases
  • high-correlation case if dfi ? dfj , then every
    document having ti also has tj .
  • Example Consider q (1, 1, 1) with df1 2,
    df2 3, df3 4, W1 0.6, W2 0.6 and W3
    1.2.
  • t1 t2 t3
    t1 t2 t3
  • d1 0.2 0.1 0.3
    0.3 0.2 0.3
  • d2 0.4 0.3 0.2
    0.3 0.2 0.3
  • d3 0 0.2 0.4
    0 0.2 0.3
  • d4 0 0 0.3
    0 0 0.3
  • usefulness(q, D, 0.5) W1 W2 df2W3/df3
    2.1

34
gGlOSS (continued)
  • disjoint case for any two query terms ti and tj
    , no document contains both ti and tj .
  • Example Consider q (1, 1, 1) with df1 2,
    df2 1, df3 1, W1 0.5, W2 0.2 and W3 0.4
    .
  • t1 t2 t3
    t1 t2 t3
  • d1 0.2 0 0
    0.25 0 0
  • d2 0 0.2 0
    0 0.2 0
  • d3 0.3 0 0
    0.25 0 0
  • d4 0 0 0.4
    0 0 0.4
  • usefulness(q, D, 0.3)
    W3 0.4

35
gGlOSS (continued)
  • Some observations
  • usefulness dependent on threshold
  • representative has two quantities per term
  • strong assumptions are used
  • high-correlation tends to overestimate
  • disjoint tends to underestimate
  • the two estimates tend to form bounds to the sum
    of the similarities ? T

36
Qualitative Approaches Using Detailed
Representatives
  • Example 2 CORI Net CaLC95
  • representative (dfi , cfi ) for term ti
  • dfi -- document frequency of ti
  • cfi -- collection frequency of ti
  • cfi can be shared by all databases
  • database usefulness
  • usefulness(q, D) sim(q, representative of
    D)
  • usefulness similarity
  • dfi
    tfi
  • cfi
    dfi

37
CORI Net (continued)
  • Some observations
  • estimates independent of threshold
  • representative has less than two quantities per
    term
  • similarity is computed based on inference network
  • same method for ranking documents and ranking
    databases

38
Qualitative Approaches Using Detailed
Representatives
  • Example 3 D-WISE YuLe97
  • representative dfi,j for term tj in database
    Di
  • database usefulness a measure of query term
    concentration in different databases
  • usefulness(q, Di)
  • k number of query terms
  • CVVj cue validity variance of term tj
    across all
  • databases larger CVVj
    tj is more
  • useful in distinguishing
    different databases

39
D-WISE (continued)
N number of databases
  • ACVj average cue validity of tj over all
    databases
  • Observations
  • estimates independent of threshold
  • representative has one quantity per term
  • measure is difficult to understand

ni number of documents in database Di
40
Quantitative Approaches
  • Two types of quantities may be estimated wrt
    query q
  • the number of documents in a database D with
    similarities higher than a threshold T
  • NoDoc(q, D, T) d d ? D and sim(q, d)
    gt T
  • the global similarity of the most similar
    document in D
  • msim(q, D) max sim(q, d)
  • d?D
  • can be used to rank databases in descending order
    of similarity (or any desirability measure)

41
Estimating NoDoc(q, D, T)
  • Basic Approach MLYW98
  • representative (pi , wi ) for term ti
  • pi probability that ti appears in a
    document
  • wi average weight of ti among documents
  • having ti
  • Example normalized weights of ti in 10
    documents are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4,
    0.6, 0.6).
  • pi 0.6, wi 0.4

42
Estimating NoDoc(q, D, T)
  • Basic Approach (continued)
  • Example Consider query q (1, 1).
  • Suppose p1 0.2, w1 2, p2 0.4, w2 1.
  • A generating function
  • (0.2 X 2 0.8) (0.4 X 0.6)
  • 0.08 X 3 0.12 X 2 0.32 X 0.48
  • a X b a is the probability that a document
    in D has
  • similarity b with q
  • NoDoc(q, D, 1) 10(0.08 0.12) 2

43
Estimating NoDoc(q, D, T)
  • Basic Approach (continued)
  • Consider query q (q1, ..., qr).
  • Proposition. If the terms are independent and
    the weight of term ti whenever present in a
    document is wi (the average weight), 1 ? i ? r,
    then the coefficient of X s in the following
    generating function is the probability that a
    document in D has similarity s with q.

44
Estimating NoDoc(q, D, T)
  • Subrange-based Approach MLYW99
  • overcome the uniform term weight assumption
  • additional information for term ti
  • ?i standard deviation of weights of ti
    in all
  • documents
  • mnwi maximum normalized weight of ti

45
Estimating NoDoc(q, D, T)
  • Example weights of term ti 4, 4, 1, 1, 1, 1,
    0, 0, 0, 0
  • generating function (factor) using average
    weight
  • 0.6X 2 0.4
  • a more accurate function using subranges of
    weights
  • 0.2X 4 0.4X 0.4
  • In general, weights are partitioned to k
    subranges
  • pi1X mi1 ... pikX mik (1 - pi)
  • Probability pij and median mij can be
    estimated
  • using di and the average of weights of ti
    .
  • A special implementation Use the maximum
    normalized weight as the first subrange by itself.

46
Estimating NoDoc(q, D, T)
  • Combined-term Approach LYMW99
  • relieve the term independence assumption
  • Example Consider query Chinese medicine .
  • Suppose generating function for
  • Chinese 0.1X3 0.3X 0.6
  • medicine 0.2X2 0.4 X 0.4
  • Chinese medicine 0.02 X5 0.04 X4
    0.1X3
  • Chinese medicine 0.05 Xw ...

47
Estimating NoDoc(q, D, T)
  • Criteria for combining Chinese and medicine
  • The maximum normalized weight of the combined
    term is higher than the maximum normalized weight
    of each of the two individual terms (w gt 3)
  • The sum of estimated probabilities of terms with
    exponents ? w under the term independence
    assumption is very different from 1/N, N is the
    number of documents in database
  • They are adjacent terms in previous queries.

48
Database Selection Using msim(q,D)
  • Optimal Ranking of Databases YLWM99b
  • User for query q, find the m most similar
    documents or with the m largest degrees of
    relevance
  • Definition Databases D1, D2, , Dp are
    optimally ranked with respect to q if there
    exists a k such that each of the databases D1, ,
    Dk contains one of the m most similar documents,
    and all of these m documents are contained in
    these k databases.

49
Database Selection Using msim(q,D)
  • Optimal Ranking of Databases
  • Example For a given query q
  • D1 d1 0.8, d2 0.5,
    d3 0.2, ...
  • D2 d9 0.7, d2 0.6,
    d10 0.4, ...
  • D3 d8 0.9, d12 0.3,
  • other databases have documents with
    small
  • similarities
  • When m 5 pick D1, D2, D3

50
Database Selection Using msim(q,D)
  • Proposition Databases D1, D2, , Dp are
    optimally ranked with respect to a query q if and
    only if
  • msim(q, Di) ? msim(q, Dj), i lt j
  • Example D1 d1 0.8,
  • D2 d9 0.7,
  • D3 d8 0.9,
  • Optimal rank D3, D1, D2,

51
Estimating msim(q, D)
  • Use subrange-based or combined-term method.
  • Example Suppose 100 documents in a database.
  • For query q, the generating function is
  • 0.002 X4 0.009 X3
  • Since 100(0.002 0.009) ? 1, the global
    similarity of the most similar document is
    estimated to be 3.
  • Weakness of this approach
  • require large storage for database representative
  • exponential computation complexity

52
Estimating msim(q, D)
  • A more efficient method
  • global database representative global dfi of
    term ti
  • local database representative
  • anwi average normalized weight of ti
  • mnwi maximum normalized weight of ti
  • Example term ti d1 0.3, d2 0.4, d3 0,
    d4 0.74
  • anwi (0.3 0.4 0 0.7)/4
    0.35
  • mnwi 0.74

53
Estimating msim(q, D)
  • A more efficient method (continued)
  • term weighting scheme
  • query term tfgidf
  • document term tf
  • query q (q1, q2)
  • msim(q, D)
  • max q1gidf1mnw1 q2gidf2anw2 ,
  • q2gidf2mnw2 q1gidf1anw1
  • linear computation complexity

54
Estimating msim(q, D)
  • Combine terms to improve estimation accuracy
  • Restrictions for combining terms ti and tj
    into tij
  • ti and tj are adjacent query terms
  • mnwij gt max mnwi anwj , mnwj anwi
  • Given a query having ti , tj and tk in this
    order, decide which terms to combine if they
    should combine.
  • Combine ti and tj if
  • mnwij gt max mnwi anwj , mnwj
    anwi
  • and mnwij - max mnwi anwj , mnwj anwi
  • gt mnwkj - max mnwk anwj , mnwj
    anwk

55
Learning-based Approaches
  • Use past retrieval experiences to determine
    usefulness
  • Assume no or little global database or local
    database statistics
  • Static learning learning based on static
    training queries
  • Dynamic learning learning based on evaluated
    user queries
  • Combined learning learned knowledge based on
    training queries will be adjusted based on user
    queries

56
Static Learning
  • Example MRDD (Modeling Relevant Document
    Distribution) VoGJ95
  • record the result of each training query for each
    local database
  • ltr1, ..., rsgt ri indicates the minimum
    number of
  • top-ranked documents
    to retrieve in
  • order to obtain i
    relevant documents
  • lt2, 5, gt need to retrieve 2 documents in
    order
  • to obtain 1 relevant
    document

57
MRDD (continued)
  • For a new query
  • identify the k most similar training queries
  • obtain the average distribution vector from the k
    training queries for each database
  • use these vectors to determine databases to
    search and documents to retrieve to maximize
    precision
  • Example Suppose for query q, three average
    distribution are obtained
  • D1 lt1, 4, 6, 7, 10, 12, 17gt
  • D2 lt1, 5, 7, 9, 15, 20gt
  • D3 lt2, 3, 6, 9, 11, 16gt
  • To retrieve two relevant documents select D1 and
    D2.

58
Dynamic Learning
  • Example SavvySearch DrHo97
  • database representative weight wi and cfi for
    term ti and two penalty values ph and pr for
    each D.
  • wi indicate how well D responds to query
    term ti
  • cfi number of databases containing ti
  • ph penalty if the average number of hits
    returned
  • for most recent five queries lt Th
  • ph (Th - h) 2 / Th 2
  • pr penalty if the average response time
    for most
  • recent five queries gt Tr
  • pr (r - Tr ) 2 / (45 - Tr ) 2

59
SavvySearch (continued)
  • Update of wi
  • initially zero
  • reduce by 1/k if no document is retrieved for a
    k-term query containing ti
  • increase by 1/k if some returned document is read
  • Compute the ranking score of database D for query
  • q (t1, ..., tk)
  • r

60
Combined Learning
  • Example ProFusion FaGa99
  • Phase 1 Static Learning
  • 13 categories/concepts are utilized
  • training queries in each category are selected
  • relevance assessment for each query is used to
    compute the average score of each local database
    with respect to each category
  • category D1 D2 . . .
    Dn
  • C1 0.3 0.1 . .
    . 0.2
  • . . . . .
    . . . .
  • C13 0 0.4 . .
    . 0.1

61
ProFusion (continued)
  • Phase 2 Database Selection and Dynamic Learning
  • Each user query is mapped to one or more
    categories
  • Databases are selected based on accumulated
    scores over involved categories
  • Example Suppose query q is mapped to C1, C4,
    C5
  • category D1 D2 D3
    D4
  • C1 0.2 0
    0.1 0.3
  • C4 0.1 0.2
    0 0
  • C5 0 0.4
    0.3 0.2
  • total score 0.3 0.6 0.4
    0.5

62
ProFusion (continued)
  • Each retrieved document from all selected
    databases is re-ranked based on the product of
    local similarity of the document and the score of
    the database.
  • if the first clicked document by the user is not
    the top ranked
  • increase the score of the database that produced
    the document in related categories
  • decrease the score of other searched databases in
    related categories

63
Other Database Selection Techniques
  • incorporating ranks YMLW99a
  • query expansion XuCa98
  • use of lightweight queries HaTh99
  • shorter
  • not evaluated like regular queries
  • use of representative hierarchies YMLW99b

64
Document Selection
  • Goal Select all globally most similar documents
    from a selected local search engine while
    minimizing the retrieval of useless documents.
  • General approaches
  • determine the number k of documents to retrieve
    from a local search engine and then retrieve the
    k documents with the largest local similarities
    from the search engine
  • determine a local threshold for the local
    database and retrieve documents whose local
    similarities exceed the threshold
  • The two approaches are equivalent.

65
Solution Classification
  • Local Determination
  • all locally retrieved documents will be returned
  • Examples NCSTRL, Search Broker MaBi97
  • User Determination
  • global user determines how many documents should
    be retrieved from each local database
  • neither effective nor practical when the number
    of databases is large.
  • Examples MetaCrawler SeEt97
  • SavvySearch DrHo97

66
Solution Classification (continued)
  • Weighted Allocation
  • retrieve proportionally more documents from local
    databases that are ranked higher
  • Learning-based Approaches
  • use past retrieval experience for selection
  • Guaranteed Retrieval
  • aimed at guaranteeing the retrieval of globally
    most similar documents

67
Weighted Allocation
  • Suppose m documents are to be retrieved from N
    local databases.
  • Example 1 CORI net CaLC95
  • Retrieve m 2(1 N - i) / N ( N1)
    documents
  • from the ith ranked local database.
  • Example 2 D-WISE YuLe97
  • Let ri be the ranking score of local
    database Di .
  • Retrieve m ri / documents
    from Di .
  • When retrieving k documents from local database D
    , the k documents with largest local similarities
    are retrieved from Di .

68
Learning-based Approaches
  • determine the number of documents to retrieve
    from a local database based on past retrieval
    experiences with the local database.
  • Example MRDD VoGJ95
  • For query q, three average distribution are
    obtained
  • D1 lt1, 4, 6, 7, 10, 12, 17gt
  • D2 lt1, 5, 7, 9, 15, 20gt
  • D3 lt2, 3, 6, 9, 11, 16gt
  • To retrieve four relevant documents retrieve
    1 document from D1, 1 from D2 and 3 from D3.

69
Guaranteed Retrieval
  • Aim at
  • guaranteeing that all potentially useful
    documents with respect to a query be retrieved
  • minimizing the retrieval of useless documents
  • Two cases
  • case 1 a global similarity threshold is known
  • case 2 the number of globally desired
    documents is known
  • The two cases are mutually translatable.

70
Case 1 Global Similarity Threshold
GT Is Known
  • find all documents whose global similarities ? GT
  • Technique 1 Query modification MLYW98
  • Modify q to q' such that Gsim(q, d)
    Lsim(q', d)
  • find all documents whose local
    similarities
  • with q ? GT
  • Example q (q1, q2) d (d1, d2)
  • Gsim(q, d) gidf1q1d1 gidf2q2d2,
  • Lsim(q, d) lidf1q1d1 lidf2q2d2,
  • q' (gidf1/lidf1 q1, gidf2/lidf2 q2)
  • Lsim(q', d) lidf1(gidf1/lidf1)q1d1
  • lidf2(gidf2/lidf2)q2
    d2 Gsim(q, d)

71
Case 1 Global Similarity Threshold
GT Is Known
  • Technique 2 find largest local threshold LT
    such that
  • Gsim(q, d) ? GT Lsim(q, d) ? LT
    MLYW98
  • retrieve d such that Lsim(q, d) ? LT to form
    set S
  • transmit d from S if Gsim(q, d) ? GT
  • Example Gsim(q, d) Lsim(q,
    d)
  • d1 0.8
    0.7
  • d2 0.75
    0.35
  • ....
  • d3 0.4
    0.6
  • If d2 is desired, then LT can be no higher than
    0.35.
  • If GT 0.6, d3 will not be transmitted.
  • Transmit ? m documents from each local database.

72
Case 1 Global Similarity Threshold
GT Is Known
  • Define tightest local threshold
  • LT min Lsim(q, d) Gsim(q, d) ? GT
  • d
  • Determining LT
  • if both Gsim and Lsim are linear functions, apply
    linear programming
  • otherwise, try Lagrange Multiplier.

73
Case 1 Global Similarity Threshold
GT Is Known
  • Example Gsim(q, d) Cosine(qG , d)
  • Lsim(q, d) Cosine(qL , d)
  • LT min Cosine(qL , d) Cosine(qG, d) ? GT
  • d
  • Cosine(? ?1) when qG, qL, d in the
    same plane
  • GT Cosine ?1 - sin ? sin ?1

qL
?1
?
qG
d
74
Case 2 Number of Globally Desired
Documents Is Known
  • Solution
  • rank databases optimally for a given query q
  • retrieve documents from databases in the optimal
    order

75
Case 2 Number of Globally Desired
Documents Is Known
  • Algorithm OptDocRetrv YLWM99
  • while less than m documents have been obtained do
  • 1. select the next database in the order
  • 2. compute actual similarity of most similar
    document
  • 3. find the minimum min_sim of the actual
    similarities
  • of most similar documents of selected
    databases
  • 4. select documents from each selected database
  • whose actual global similarities ? min_sim
  • end loop
  • Sort the documents in descending similarities and
    present the top m to the user.

76
Case 2 Number of Globally Desired
Documents Is Known
  • Example Number of documents desired 4.
  • Databases are ranked in the order D1, D2, D3,
    D4
  • D1 d1 0.53, d2 0.48, d3
    0.39,
  • D2 d10 0.47, d21 0.43, d52
    0.42,
  • D3 d23 0.54, d42 0.49, ...
  • D4 d33 0.40,
  • select D1, min_sim 0.53 result d1
  • select D2, min_sim 0.47, result d1, d2,
    d10
  • select D3, min_sim 0.47, result d1, d2,
    d10,

  • d23, d42
  • result to user d1, d2, d23, d42

77
Case 2 Number of Globally Desired
Documents Is Known
  • Proposition If databases are optimally ranked,
    then all the m globally most similar documents
    will be retrieved by algorithm OptDocRetrv.
  • Proposition For any single-term query, all the m
    globally most similar documents will be retrieved
    by algorithm OptDocRetrv.

78
Result Merging
  • Goal Merge returned documents from multiple
    sources into a single ranked list.
  • Difficulties
  • local similarities are usually not comparable due
    to
  • different similarity functions
  • different term weighting schemes
  • different statistical values
  • e.g., global idf vs. local idf
  • local similarities may be unavailable to
    metasearch engine (only ranks are provided)
  • Ideal rank in non-increasing order of global
    similarities

79
Solution Classification
  • similarity normalization
  • normalize all local similarities into a
    common fixed range to improve comparability
  • similarity adjustment
  • adjust local similarities/ranks based on the
    quality of local databases
  • global similarity computation
  • aim at obtaining the actual global
    similarities
  • Merge based on normalized/adjusted/computed
    similarities.

80
Similarity Normalization
  • Example 1 MetaCrawler SeEt97
  • map all local similarities into 0, 1000
  • map largest local similarity from each source to
    1000
  • map other local similarities proportionally
  • add normalized local similarities for documents
    retrieved from multiple sources
  • D1
    D2
  • d1 d2
    d3 d1 d4 d5
  • local similarity 100 200 400
    0.3 0.2 0.5
  • normalized 250 500 1000 600
    400 1000
  • final similarity 850 500 1000
    400 1000

81
Similarity Normalization
  • Example 2 SavvySearch DrHo97
  • same as MetaCrawler except using range 0, 1
  • documents with no local similarities are assigned
    0.5
  • Retrieval based on Multiple Evidence
  • normalized similarity between 0 and 1 can be
    considered as a confidence that a document is
    useful
  • let si be the confidence of source i that
    document d is useful to query q
  • estimate overall confidence that d is useful
  • S(d, q) 1 - (1 - si)...(1- sk)
  • Example s1 0.7, s2 0.8 S(d, q)
    0.94

82
Similarity Adjustment
  • Use local similarity of d and the ranking score
    of its database to estimate the global similarity
    of d.
  • database ranking score the higher the better
  • Example CORI net CaLC95
  • assign the following weight to database D
  • w(D) 1 N (r - r') / r'
  • r rank score of D wrt q
  • r avg of scores of searched databases
  • N number of local databases searched
  • adjust local similarity s of document d in D to
    sw(D)
  • Similar approach employed in ProFusion GaWG96.

83
Similarity Adjustment
  • Use local rank of d and the ranking score of its
    database to estimate the global similarity of d.
  • Example D-WISE YuLe97
  • Gsim(q, d) 1 - (r - 1) Rmin / (m
    Ri)
  • Ri ranking score of database Di
  • Rmin lowest database ranking score
  • r local rank of document d from Di
  • m total number of documents desired
  • Observation top ranked document from any
    database has the same global similarity

84
D-WISE (continued)
  • Example R1 0.3, R2 0.7, Rmin 0.2, m 4
  • Gsim(q, d) 1 - (r - 1) 0.2 / (4 Ri)
  • D1 D2
  • r Gsim r
    Gsim
  • d1 1 1.0 d1' 1
    1.0
  • d2 2 0.83 d2' 2
    0.93
  • d3 3 0.67 d3' 3
    0.86
  • more documents from databases with higher ranking
    scores have higher global similarities

85
Global Similarity Computation
  • Technique 1 Document Fetching (e.g. E2RD2,
    ParaCrawler)
  • fetch documents to the metasearch engine
  • collect desired statistics (tf, idf, ...)
  • compute global similarities
  • Problem may not scale well.

86
Global Similarity Computation
  • Technique 2 Knowledge Discovery
  • discover similarity functions and term weighting
    schemes used in different search engines
  • use the discovered knowledge to determine
  • what local similarities are reasonably
    comparable?
  • how to adjust local similarities to make them
    more comparable?
  • how to compute/estimate global similarities?

87
Knowledge Discovery (continued)
  • Example
  • All local search engines selected for a query
  • employ same methods for indexing local documents
    and computing local similarities
  • do not use idf information
  • local similarities comparable
  • idf information is used and q has a single term t
  • Lsim(q, d) tft(q) lidft
  • tft(d)/(qd)
    lidft tft(d)/d
  • Gsim(q, d) (gidft tft(d))
    /d
  • Gsim(q, d) Lsim(q, d)
    gidft / lidft

88
Knowledge Discovery (continued)
  • Example (continued)
  • idf information is used and q has terms t1, ...,
    tk
  • Gsim(q, d)

  • can be determined using ti as a
    single-term
  • query.

89
Knowledge Discovery (continued)
  • submit ti as a single-term query and let
  • si Lsim(d, q(ti))

90
New Challenges
  • Incorporate new search techniques into
    metasearch.
  • Document ranks in Google
  • Kleinberg's hub and authority scores
  • Tag information in HTML documents
  • Implicit user feedback on previous retrieval
  • Pseudo relevance feedback on previous retrieval
  • Use of user profiles
  • Integrate local systems supporting different
    query types
  • fewer researches on boolean queries, proximity
    queries and hierarchical queries

91
New Challenges (continued)
  • Develop techniques to discover knowledge
    (representatives, ranking algorithms) about local
    search engines more accurately and more
    efficiently.
  • some search engines may be unwilling to provided
    desired representatives or may provide inaccurate
    representatives
  • indexing techniques, term weighting schemes and
    similarity functions are typically proprietary.
  • Develop standard guideline on what information
    each search engine should provide to metasearch
    engine (some efforts STARTS, Dublin Core).

92
New Challenges (continued)
  • Distributed implementation of metasearch engine
  • alternative ways to store local database
    representatives?
  • how to perform database selection and document
    selection at multiple sites in parallel?
  • Scale to a million databases
  • storage of database representatives
  • fast algorithms for database selection, document
    selection and result merging
  • efficient network utilization

93
New Challenges (continued)
  • Standard testbed for evaluation
  • need a large number of local databases
  • documents should have links for computing ranks,
    hub and authority scores
  • a large number of typical Internet queries
  • relevance assessment of documents to each query
  • Go beyond text databases
  • how to extend to databases containing text,
    images, video, audio, structured data?
Write a Comment
User Comments (0)