Lecture 9: Search Engine Evaluation - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Lecture 9: Search Engine Evaluation

Description:

There are many search engines on the market, which one is best for your need? ... (b) Okapi Similarity Measurement(Okapi) (c) Cover Density Ranking(CDR) ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 49
Provided by: scie241
Category:

less

Transcript and Presenter's Notes

Title: Lecture 9: Search Engine Evaluation


1
Lecture 9Search Engine Evaluation
  • Prof. Xiaotie Deng
  • Department of Computer Science

2
Outline
  • Background
  • Recall and Precision
  • Other Measures
  • Web Search Engine Evaluation

3
Background Motivation
  • There are many search engines on the market,
    which one is best for your need?
  • A search engine may use several models
    e.g.,Boolean or vector, different indexing data
    structure,different user-interfaces,etc.
  • which combination is the best one?

4
Background Two Major Aspects
  • Efficiency speed
  • Effectiveness how good is the result? (quality)
  • Speed is rather technical and relatively easier
    to evaluate
  • Effectiveness is much more difficult to judge.
  • Our focus will be on effectiveness evaluation.

5
Background Relevancy
  • Effectiveness is related to relevancy of
    documents retrieved
  • Relevancy, from a human judgment standpoint, is
  • subjective - depends upon a specific users
    judgment
  • situational - relates to users requirement
  • cognitive - depends on human perception and
    behavior
  • temporal - changes over time

6
Background Relevancy Threshold Method
  • Relevancy is not a binary value, but a continuous
    function.
  • If the user considers the relevancy value of the
    document exceed a threshold (it may not exist,
    and if it exists, it is decided by the user), the
    document will be deemed as relevant, otherwise
    irrelevant.

7
Background Document Space
8
Background Parameters
  • Given a query I, an IR system will return a set
    of documents as the answer.
  • R is the relevant set for the query
  • A is the returned answer set
  • R and A denote the cardinality of these sets.
  • D denotes the the set of all docs.

9
Background Parameters Availability
  • Among these numbers
  • only two are always available for Internet IR
  • total number of items retrieved A
  • number of relevant items retrieved R ? A
  • total number of relevant items R is usually
    not available

10
Recall and Precision
  • Two important metrics for evaluation of relevance
    of documents returned by a IR system

11
Recall and Precision Definition Recall
  • Recall R ? A / R , between 0 and 1
  • If Recall 1 ,it means it retrieves all relevant
    documents.
  • If Recall 0, it means all retrieved documents
    all irrelevant.
  • What is a simple way to achieve Recall1?

12
Recall and Precision Definition Precision
  • PrecisionR ? A / A , between 0 and 1
  • If precision 1,it means all retrieved documents
    all relevant.
  • If precision 0, it means all retrieved documents
    all irrelevant.
  • How to achieve precision1 ?

13
Recall and Precision Roles
  • Recall measures the ability of the search to find
    all of the relevant items in the database
  • Precision
  • evaluates the correlation of the query to the
    database
  • an indirect measure of the completeness of
    indexing algorithm

14
Recall and Precision Evaluation
  • Precision can be evaluated exactly by dividing R
    ? A by A, since both numbers are available.
  • Recall cannot be exactly evaluated in general
    since it is defined as the ratio of R ? A and
    R, where the latter is usually not available

15
Recall and Precision Estimation Recall
  • Randomly pick up a set F of documents.
  • Heuristic argument (and sampling technique in
    statistics) The proportion of R ? A in R is the
    same as the proportion of R ? A ? F in R ? F.
  • Recall can be estimated by the ratio of R ? A ? F
    over R ? F.

16
Recall and Precision Estimation Precision
  • Even though Precision can be evaluated exactly,
    it may be costly since R ? A can be huge for
    Internet IR. R is often determined subjectively
    by people
  • It is laborious to determine R ? A.
  • Claim can be costly to verify.
  • Again we may randomly pick up a set F of
    documents and estimate precision by the ratio of
    R ? A ? F over A ? F.
  • A ? F is relatively small and it is much easier
    to find and to verify R ? A ? F.

17
Recall and Precision Dual Objectives for IR
Systems
  • We want both precision and recall to be one.
  • Unfortunately, precision and recall affect each
    other in the opposite direction!
  • Given a system
  • Broadening a query will increase recall but lower
    precision
  • Increasing the number of documents returned has
    the same effect
  • Different queries may yield different values of
    recall.
  • Using the average for a chosen set of queries.

18
Recall and Precision The Recall-Precision Curve
  • Usually, Recall and Precision have a trade-off
    relationship increased precision results in
    decreased recall, and vice versa.

19
Recall and Precision Examples Page 1
  • Consider a query for which the relevant set is
    Rd1,d2,d3,d4,d5 out of a set D of 10 docs.
  • Let us assume that given IR system returned
    Ad3,d6,d1,d4.
  • Recall 3/560, and Precision 3/475
  • How do we visualize this relationship between
    Recall and Precision considering ranking?

20
Recall and Precision Examples Page 2
  • Rd1,d2,d3,d4,d5
  • Ad3,d6,d1,d4
  • The first return d3 yields 100 Precision at
    20 Recall
  • The first two returns d3,d6 yield 50 Precision
    at 20 Recall
  • The first three returns d3,d6,d1 yield 66
    Precision at 40 Recall
  • All returns d3,d6,d1,d4 yield 75 Precision at
    60 Recall
  • NOTE Recall always increases

21
Recall and Precision Examples Page 3
22
Recall and Precision Notes Page 1
  • Different queries to the database/search-engine
    may result in different precision/recall values
  • We should evaluate precision and recall of
    different IR methods in average over all the
    queries.

23
Recall and Precision Notes Page 2
  • One is better than another
  • if at the same recall level, one is more precise
    than another
  • or at the same precision level, one recalls more
    than another.
  • Note this may not hold for every query for
    different IR methods and may even be conflicting
    for different queries. Therefore it can only be
    estimated in the average.

24
Other Measures
  • Benchmark Approach
  • Average Precision At Seen Relevant and
    R-Precision
  • Van Rijsbergen
  • Interpretation
  • User Oriented
  • Coverage and Novelty

25
Recall and Precision Alternative Metric
Fallout
  • Definition FOA-R /D-R
  • D-R all the irrelevant documents
  • A-R irrelevant documents in the answer set.

26
Recall and Precision Alternative Metric
Fallout
  • Definition FOA-R /D-R
  • Fallout is concerned with retrieved but
    non-relevant docs.
  • It is the percentage of non-relevant answers in
    the non-relevant document set.
  • It is opposite to recall.
  • A good algorithm should have
  • Recall gt Fallout.
  • Otherwise, one can just return all the document
    for a better recall.

27
Other Measures Benchmark Approach
  • Benchmark approach
  • Given a set of problems
  • with known answers
  • Test different IR methods
  • Yearly competition
  • TREC
  • TREC (see http//trec.nist.gov/) maintains about
    6Gb of SGML tagged text, queries and respective
    answers for evaluation purposes.
  • The answers to the queries are obtained manually
    in advance.
  • IR systems are tested against them and evaluated
    accordingly.

28
Other Measures Average Precision At Seen
Relevant Docs and R-Precision
  • Average precision at seen relevant docs Compute
    the precision every time a relevant doc is found
    and report the overall average.
  • R-Precision The precision of the lowest ranked
    relevant doc.

29
Other Measures Van Rijsbergen
  • Given the j-th doc in the ranking, its recall rj
    and precision pj
  • Van Rijsbergen proposed the following measure
  • Ej 1-(1b2)/(b2/rj1/pj)
  • b is a parameter set by the user ,
  • if b1, Ej 1-2/(1/rj1/pj)
  • Docs with high precision and high recall have a
    low E value, whereas docs with low precision and
    low recall have a high E value

30
Other Measures Van Rijsbergen Interpretation
  • If bgt1, then the emphasis would be on precision
  • If blt1, then the user would be more interested in
    recall
  • The main aspect of the measure E is that it
    evaluate each ranked document, not the whole
    document set, thus anomalies can be seen.

31
Other Measures User Oriented
  • It is also important to take into account what
    different users feel about the answer sets
  • User may consider the same answer set of
    different usefulness, this is is especially true
    if they know (in different degrees) the answers
    they should obtain.
  • In addition to R and A ,let us also consider the
    following subsets of R

32
Other Measures User Oriented Illustration
  • K set of answers which are known to the user
    and,
  • U set of answers which were not known by the
    user and were retrieved.

Answer Set A
Relevant Docs R
Relevant Docs not Known by user and were
retrieved U
Relevant Docs Known to user K
33
Other Measures Coverage And Novelty
  • CA ? K / K is the coverage of the answer set
  • A high coverage ratio means that the system is
    finding most of what the user was expecting.
  • N U / (KU) is the novelty of the answer
  • A high novelty ratio means that the user is
    finding many new docs which were not known before
    and are relevant

34
Web Search Engine Evaluation New Issues
  • Web
  • The web pages (docs) on the web are estimated
    more than two billion on April 2001
  • (www.searchenginewatch.com).
  • It is almost impossible to get all relevant web
    pages from the Internet.
  • Web pages are dynamic. Some of them will be
    disappear tomorrow, or be updated.
  • User (different countries)
  • Even for same query, different user may desire
    different result.
  • User tends to user short query words.
  • Two main issues on web search engine evaluation.
  • Web Search Engine Coverage
  • Web Search Engine Effectiveness

35
Web Search Engine Evaluation Coverage
  • Some published approaches to estimating coverage
    are based on the number of hits for certain
    queries as reported by the services themselves
  • For example , the method used by
    searchengineshowdown
  • http//www.searchengineshowdown.com/stats/fast300.
    shtml
  • To compare the sizes of the search engine
    databases, the study uses 25 specific queries
    that meet the criteria listed as follows.
  • The results of each query are verified when
    possible and only the number of hits that can be
    displayed are counted

36
Web Search Engine Evaluation Coverage Query
Criteria
  • Only single words are used to avoid any variation
    in the processing of multiple term searches
  • Terms were drawn from a variety of reference
    books that cover different fields.

37
Web Search Engine Evaluation Coverage
Selection of Query terms
  • Any term used must find less than 1,000 results
    in the AltaVista Advanced Search, since numbers
    higher than that cannot be verified on AltaVista.
  • Since Northern Light automatically searches both
    English plural and singular forms of words, query
    terms were chosen that cannot generally be made
    plural. This was checked by pluralizing the word
    and running a search on AltaVista or Fast. Only
    those terms where the plural form found zero
    results were used

38
Web Search Engine Evaluation Coverage Real
Data
http//www.searchengineshowdown.com/stats/sizeest.
shtml
39
Web Search Engine Evaluation Coverage
Estimation A Simple Method
  • 1 For all the queries, we check all retrieved
    results by each search engine, and get the total
    count of valid web pages for each search engine.
  • 2 Based on the Northern Light and Fast search
    engine size, we can estimate the search engine
    coverage.

40
Web Search Engine Evaluation Coverage
Estimation Another Method (Krishna Bharat and
Andrei Broder)
41
Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method
  • Let Pr(A) represent the probability that an
    element belongs to the set A and let Pr(A BA)
    represent the conditional probability that an
    element belongs to both sets given that it
    belongs to A. Then,
  • Pr(A BA) Size(A B)/Size(A) and
    similarly, Pr(A BB) Size(A B)/Size(B), and
    therefore
  • Size(A)/Size(B) Pr(A BB) / Pr(A BA).

42
Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method Two Major Procedures
  • To implement this idea we need two procedures
  • Sampling A procedure for picking pages uniformly
    at random from the index of a particular engine.
  • Checking A procedure for determining whether a
    particular page is indexed by a particular
    engine.

43
Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method The Solution
  • Overlap estimate The fraction of E1's database
    indexed by E2 is estimated by Fraction of URLs
    sampled from E1 found in E2.
  • Size comparison For search engines E1 and E2,
    the ratio Size(E1)/Size(E2) is estimated by
  • Fraction of URLs sampled from E2 found in E1
  • ----------------------------------------------
    ------------
  •   Fraction of URLs sampled from E1 found in E2

44
Web Search Engine Evaluation Effectiveness
  • How to evaluate the quality of retrieved results
    by different search engines?
  • Manual
  • Benefit the accuracy with respect to users
    expectation
  • Drawback subjective (how to choose reviewers)
    and time-consuming
  • Automatic
  • It is much better in adapting to the fast
    changing Web and search engines, as well as the
    large amount of information on the web.

45
Web Search Engine Evaluation Effectiveness
Automatic Page 1
  • In Longzhuang Li, Yi Shang, and Wei Zhang,
  • Two sample query set were used
  • (a) The TKDE set containing 1383 queries derived
    from the index terms of papers published in the
    IEEE Transactions on Knowledge and Data
    Engineering between January 1995 and June 2000
  • (b) The TPDC set containing 1726 queries derived
    from the index terms of papers published in the
    IEEE Transactions on Parallel and Distributed
    Systems between January 1995 and February 2000

46
Web Search Engine Evaluation Effectiveness
Automatic Page 2
  • For each query, the top 20 hits from each search
    engine are analysed.
  • To compute the relevance score, we followed each
    hit to retrieve the corresponding Web document.
  • The scores are calculated base on four models
  • (a) Vector Space Model
  • (b) Okapi Similarity Measurement(Okapi)
  • (c) Cover Density Ranking(CDR)
  • (d) Three-level Scoring Method(TLS)

47
Web Search Engine Evaluation Effectiveness
Automatic Page 3
  • Rank the search engine based on their average
    relevance scores computed using the above four
    scoring methods, respectively.
  • Their conclusion Google is the best.

48
Summary
  • Background
  • Recall and Precision
  • Other Measures
  • Web Search Engine Evaluation
Write a Comment
User Comments (0)
About PowerShow.com