Title: Lecture 9: Search Engine Evaluation
1Lecture 9Search Engine Evaluation
- Prof. Xiaotie Deng
- Department of Computer Science
2Outline
- Background
- Recall and Precision
- Other Measures
- Web Search Engine Evaluation
3Background Motivation
- There are many search engines on the market,
which one is best for your need? - A search engine may use several models
e.g.,Boolean or vector, different indexing data
structure,different user-interfaces,etc. - which combination is the best one?
4Background Two Major Aspects
- Efficiency speed
- Effectiveness how good is the result? (quality)
- Speed is rather technical and relatively easier
to evaluate - Effectiveness is much more difficult to judge.
- Our focus will be on effectiveness evaluation.
5Background Relevancy
- Effectiveness is related to relevancy of
documents retrieved - Relevancy, from a human judgment standpoint, is
- subjective - depends upon a specific users
judgment - situational - relates to users requirement
- cognitive - depends on human perception and
behavior - temporal - changes over time
6Background Relevancy Threshold Method
- Relevancy is not a binary value, but a continuous
function. - If the user considers the relevancy value of the
document exceed a threshold (it may not exist,
and if it exists, it is decided by the user), the
document will be deemed as relevant, otherwise
irrelevant.
7Background Document Space
8Background Parameters
- Given a query I, an IR system will return a set
of documents as the answer. - R is the relevant set for the query
- A is the returned answer set
- R and A denote the cardinality of these sets.
- D denotes the the set of all docs.
9Background Parameters Availability
- Among these numbers
- only two are always available for Internet IR
- total number of items retrieved A
- number of relevant items retrieved R ? A
- total number of relevant items R is usually
not available
10Recall and Precision
- Two important metrics for evaluation of relevance
of documents returned by a IR system
11Recall and Precision Definition Recall
- Recall R ? A / R , between 0 and 1
-
- If Recall 1 ,it means it retrieves all relevant
documents. - If Recall 0, it means all retrieved documents
all irrelevant. - What is a simple way to achieve Recall1?
12Recall and Precision Definition Precision
- PrecisionR ? A / A , between 0 and 1
-
-
- If precision 1,it means all retrieved documents
all relevant. - If precision 0, it means all retrieved documents
all irrelevant. - How to achieve precision1 ?
13Recall and Precision Roles
- Recall measures the ability of the search to find
all of the relevant items in the database - Precision
- evaluates the correlation of the query to the
database - an indirect measure of the completeness of
indexing algorithm
14Recall and Precision Evaluation
- Precision can be evaluated exactly by dividing R
? A by A, since both numbers are available. - Recall cannot be exactly evaluated in general
since it is defined as the ratio of R ? A and
R, where the latter is usually not available
15Recall and Precision Estimation Recall
- Randomly pick up a set F of documents.
- Heuristic argument (and sampling technique in
statistics) The proportion of R ? A in R is the
same as the proportion of R ? A ? F in R ? F. - Recall can be estimated by the ratio of R ? A ? F
over R ? F.
16Recall and Precision Estimation Precision
- Even though Precision can be evaluated exactly,
it may be costly since R ? A can be huge for
Internet IR. R is often determined subjectively
by people - It is laborious to determine R ? A.
- Claim can be costly to verify.
- Again we may randomly pick up a set F of
documents and estimate precision by the ratio of
R ? A ? F over A ? F. - A ? F is relatively small and it is much easier
to find and to verify R ? A ? F.
17Recall and Precision Dual Objectives for IR
Systems
- We want both precision and recall to be one.
- Unfortunately, precision and recall affect each
other in the opposite direction! - Given a system
- Broadening a query will increase recall but lower
precision - Increasing the number of documents returned has
the same effect - Different queries may yield different values of
recall. - Using the average for a chosen set of queries.
18Recall and Precision The Recall-Precision Curve
- Usually, Recall and Precision have a trade-off
relationship increased precision results in
decreased recall, and vice versa.
19Recall and Precision Examples Page 1
- Consider a query for which the relevant set is
Rd1,d2,d3,d4,d5 out of a set D of 10 docs. - Let us assume that given IR system returned
Ad3,d6,d1,d4. - Recall 3/560, and Precision 3/475
- How do we visualize this relationship between
Recall and Precision considering ranking?
20Recall and Precision Examples Page 2
- Rd1,d2,d3,d4,d5
- Ad3,d6,d1,d4
- The first return d3 yields 100 Precision at
20 Recall - The first two returns d3,d6 yield 50 Precision
at 20 Recall - The first three returns d3,d6,d1 yield 66
Precision at 40 Recall - All returns d3,d6,d1,d4 yield 75 Precision at
60 Recall - NOTE Recall always increases
21Recall and Precision Examples Page 3
22Recall and Precision Notes Page 1
- Different queries to the database/search-engine
may result in different precision/recall values - We should evaluate precision and recall of
different IR methods in average over all the
queries.
23Recall and Precision Notes Page 2
- One is better than another
- if at the same recall level, one is more precise
than another - or at the same precision level, one recalls more
than another. - Note this may not hold for every query for
different IR methods and may even be conflicting
for different queries. Therefore it can only be
estimated in the average.
24Other Measures
- Benchmark Approach
- Average Precision At Seen Relevant and
R-Precision - Van Rijsbergen
- Interpretation
- User Oriented
- Coverage and Novelty
25Recall and Precision Alternative Metric
Fallout
- Definition FOA-R /D-R
- D-R all the irrelevant documents
- A-R irrelevant documents in the answer set.
26Recall and Precision Alternative Metric
Fallout
- Definition FOA-R /D-R
- Fallout is concerned with retrieved but
non-relevant docs. - It is the percentage of non-relevant answers in
the non-relevant document set. - It is opposite to recall.
- A good algorithm should have
- Recall gt Fallout.
- Otherwise, one can just return all the document
for a better recall.
27Other Measures Benchmark Approach
- Benchmark approach
- Given a set of problems
- with known answers
- Test different IR methods
- Yearly competition
- TREC
- TREC (see http//trec.nist.gov/) maintains about
6Gb of SGML tagged text, queries and respective
answers for evaluation purposes. - The answers to the queries are obtained manually
in advance. - IR systems are tested against them and evaluated
accordingly.
28Other Measures Average Precision At Seen
Relevant Docs and R-Precision
- Average precision at seen relevant docs Compute
the precision every time a relevant doc is found
and report the overall average. - R-Precision The precision of the lowest ranked
relevant doc.
29Other Measures Van Rijsbergen
- Given the j-th doc in the ranking, its recall rj
and precision pj - Van Rijsbergen proposed the following measure
- Ej 1-(1b2)/(b2/rj1/pj)
- b is a parameter set by the user ,
- if b1, Ej 1-2/(1/rj1/pj)
- Docs with high precision and high recall have a
low E value, whereas docs with low precision and
low recall have a high E value
30Other Measures Van Rijsbergen Interpretation
- If bgt1, then the emphasis would be on precision
- If blt1, then the user would be more interested in
recall - The main aspect of the measure E is that it
evaluate each ranked document, not the whole
document set, thus anomalies can be seen.
31Other Measures User Oriented
- It is also important to take into account what
different users feel about the answer sets - User may consider the same answer set of
different usefulness, this is is especially true
if they know (in different degrees) the answers
they should obtain. - In addition to R and A ,let us also consider the
following subsets of R
32Other Measures User Oriented Illustration
- K set of answers which are known to the user
and, - U set of answers which were not known by the
user and were retrieved.
Answer Set A
Relevant Docs R
Relevant Docs not Known by user and were
retrieved U
Relevant Docs Known to user K
33Other Measures Coverage And Novelty
- CA ? K / K is the coverage of the answer set
- A high coverage ratio means that the system is
finding most of what the user was expecting. - N U / (KU) is the novelty of the answer
- A high novelty ratio means that the user is
finding many new docs which were not known before
and are relevant
34Web Search Engine Evaluation New Issues
- Web
- The web pages (docs) on the web are estimated
more than two billion on April 2001 - (www.searchenginewatch.com).
- It is almost impossible to get all relevant web
pages from the Internet. - Web pages are dynamic. Some of them will be
disappear tomorrow, or be updated. - User (different countries)
- Even for same query, different user may desire
different result. - User tends to user short query words.
- Two main issues on web search engine evaluation.
- Web Search Engine Coverage
- Web Search Engine Effectiveness
35Web Search Engine Evaluation Coverage
- Some published approaches to estimating coverage
are based on the number of hits for certain
queries as reported by the services themselves - For example , the method used by
searchengineshowdown - http//www.searchengineshowdown.com/stats/fast300.
shtml - To compare the sizes of the search engine
databases, the study uses 25 specific queries
that meet the criteria listed as follows. - The results of each query are verified when
possible and only the number of hits that can be
displayed are counted
36Web Search Engine Evaluation Coverage Query
Criteria
- Only single words are used to avoid any variation
in the processing of multiple term searches - Terms were drawn from a variety of reference
books that cover different fields.
37Web Search Engine Evaluation Coverage
Selection of Query terms
- Any term used must find less than 1,000 results
in the AltaVista Advanced Search, since numbers
higher than that cannot be verified on AltaVista.
- Since Northern Light automatically searches both
English plural and singular forms of words, query
terms were chosen that cannot generally be made
plural. This was checked by pluralizing the word
and running a search on AltaVista or Fast. Only
those terms where the plural form found zero
results were used
38Web Search Engine Evaluation Coverage Real
Data
http//www.searchengineshowdown.com/stats/sizeest.
shtml
39Web Search Engine Evaluation Coverage
Estimation A Simple Method
- 1 For all the queries, we check all retrieved
results by each search engine, and get the total
count of valid web pages for each search engine. -
- 2 Based on the Northern Light and Fast search
engine size, we can estimate the search engine
coverage.
40Web Search Engine Evaluation Coverage
Estimation Another Method (Krishna Bharat and
Andrei Broder)
41Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method
- Let Pr(A) represent the probability that an
element belongs to the set A and let Pr(A BA)
represent the conditional probability that an
element belongs to both sets given that it
belongs to A. Then, - Pr(A BA) Size(A B)/Size(A) and
similarly, Pr(A BB) Size(A B)/Size(B), and
therefore - Size(A)/Size(B) Pr(A BB) / Pr(A BA).
42Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method Two Major Procedures
- To implement this idea we need two procedures
- Sampling A procedure for picking pages uniformly
at random from the index of a particular engine. - Checking A procedure for determining whether a
particular page is indexed by a particular
engine.
43Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method The Solution
- Overlap estimate The fraction of E1's database
indexed by E2 is estimated by Fraction of URLs
sampled from E1 found in E2. - Size comparison For search engines E1 and E2,
the ratio Size(E1)/Size(E2) is estimated by - Fraction of URLs sampled from E2 found in E1
- ----------------------------------------------
------------ - Â Fraction of URLs sampled from E1 found in E2
44Web Search Engine Evaluation Effectiveness
- How to evaluate the quality of retrieved results
by different search engines? - Manual
- Benefit the accuracy with respect to users
expectation - Drawback subjective (how to choose reviewers)
and time-consuming - Automatic
- It is much better in adapting to the fast
changing Web and search engines, as well as the
large amount of information on the web.
45Web Search Engine Evaluation Effectiveness
Automatic Page 1
- In Longzhuang Li, Yi Shang, and Wei Zhang,
- Two sample query set were used
- (a) The TKDE set containing 1383 queries derived
from the index terms of papers published in the
IEEE Transactions on Knowledge and Data
Engineering between January 1995 and June 2000 - (b) The TPDC set containing 1726 queries derived
from the index terms of papers published in the
IEEE Transactions on Parallel and Distributed
Systems between January 1995 and February 2000
46Web Search Engine Evaluation Effectiveness
Automatic Page 2
- For each query, the top 20 hits from each search
engine are analysed. - To compute the relevance score, we followed each
hit to retrieve the corresponding Web document. - The scores are calculated base on four models
- (a) Vector Space Model
- (b) Okapi Similarity Measurement(Okapi)
- (c) Cover Density Ranking(CDR)
- (d) Three-level Scoring Method(TLS)
47Web Search Engine Evaluation Effectiveness
Automatic Page 3
- Rank the search engine based on their average
relevance scores computed using the above four
scoring methods, respectively. - Their conclusion Google is the best.
48Summary
- Background
- Recall and Precision
- Other Measures
- Web Search Engine Evaluation