Lecture 9: Search Engine Evaluation

About This Presentation

Title:

Lecture 9: Search Engine Evaluation

Description:

There are many search engines on the market, which one is best for your need? ... (b) Okapi Similarity Measurement(Okapi) (c) Cover Density Ranking(CDR) ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 49

Provided by: scie241

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 9: Search Engine Evaluation

1
Lecture 9Search Engine Evaluation

Prof. Xiaotie Deng
Department of Computer Science

2
Outline

Background
Recall and Precision
Other Measures
Web Search Engine Evaluation

3
Background Motivation

There are many search engines on the market,
which one is best for your need?
A search engine may use several models
e.g.,Boolean or vector, different indexing data
structure,different user-interfaces,etc.
which combination is the best one?

4
Background Two Major Aspects

Efficiency speed
Effectiveness how good is the result? (quality)
Speed is rather technical and relatively easier
to evaluate
Effectiveness is much more difficult to judge.
Our focus will be on effectiveness evaluation.

5
Background Relevancy

Effectiveness is related to relevancy of
documents retrieved
Relevancy, from a human judgment standpoint, is
subjective - depends upon a specific users
judgment
situational - relates to users requirement
cognitive - depends on human perception and
behavior
temporal - changes over time

6
Background Relevancy Threshold Method

Relevancy is not a binary value, but a continuous
function.
If the user considers the relevancy value of the
document exceed a threshold (it may not exist,
and if it exists, it is decided by the user), the
document will be deemed as relevant, otherwise
irrelevant.

7
Background Document Space
8
Background Parameters

Given a query I, an IR system will return a set
of documents as the answer.
R is the relevant set for the query
A is the returned answer set
R and A denote the cardinality of these sets.
D denotes the the set of all docs.

9
Background Parameters Availability

Among these numbers
only two are always available for Internet IR
total number of items retrieved A
number of relevant items retrieved R ? A
total number of relevant items R is usually
not available

10
Recall and Precision

Two important metrics for evaluation of relevance
of documents returned by a IR system

11
Recall and Precision Definition Recall

Recall R ? A / R , between 0 and 1

If Recall 1 ,it means it retrieves all relevant
documents.
If Recall 0, it means all retrieved documents
all irrelevant.
What is a simple way to achieve Recall1?

12
Recall and Precision Definition Precision

PrecisionR ? A / A , between 0 and 1
If precision 1,it means all retrieved documents
all relevant.
If precision 0, it means all retrieved documents
all irrelevant.
How to achieve precision1 ?

13
Recall and Precision Roles

Recall measures the ability of the search to find
all of the relevant items in the database
Precision
evaluates the correlation of the query to the
database
an indirect measure of the completeness of
indexing algorithm

14
Recall and Precision Evaluation

Precision can be evaluated exactly by dividing R
? A by A, since both numbers are available.
Recall cannot be exactly evaluated in general
since it is defined as the ratio of R ? A and
R, where the latter is usually not available

15
Recall and Precision Estimation Recall

Randomly pick up a set F of documents.
Heuristic argument (and sampling technique in
statistics) The proportion of R ? A in R is the
same as the proportion of R ? A ? F in R ? F.
Recall can be estimated by the ratio of R ? A ? F
over R ? F.

16
Recall and Precision Estimation Precision

Even though Precision can be evaluated exactly,
it may be costly since R ? A can be huge for
Internet IR. R is often determined subjectively
by people
It is laborious to determine R ? A.
Claim can be costly to verify.
Again we may randomly pick up a set F of
documents and estimate precision by the ratio of
R ? A ? F over A ? F.
A ? F is relatively small and it is much easier
to find and to verify R ? A ? F.

17
Recall and Precision Dual Objectives for IR
Systems

We want both precision and recall to be one.
Unfortunately, precision and recall affect each
other in the opposite direction!
Given a system
Broadening a query will increase recall but lower
precision
Increasing the number of documents returned has
the same effect
Different queries may yield different values of
recall.
Using the average for a chosen set of queries.

18
Recall and Precision The Recall-Precision Curve

Usually, Recall and Precision have a trade-off
relationship increased precision results in
decreased recall, and vice versa.

19
Recall and Precision Examples Page 1

Consider a query for which the relevant set is
Rd1,d2,d3,d4,d5 out of a set D of 10 docs.
Let us assume that given IR system returned
Ad3,d6,d1,d4.
Recall 3/560, and Precision 3/475
How do we visualize this relationship between
Recall and Precision considering ranking?

20
Recall and Precision Examples Page 2

Rd1,d2,d3,d4,d5
Ad3,d6,d1,d4
The first return d3 yields 100 Precision at
20 Recall
The first two returns d3,d6 yield 50 Precision
at 20 Recall
The first three returns d3,d6,d1 yield 66
Precision at 40 Recall
All returns d3,d6,d1,d4 yield 75 Precision at
60 Recall
NOTE Recall always increases

21
Recall and Precision Examples Page 3
22
Recall and Precision Notes Page 1

Different queries to the database/search-engine
may result in different precision/recall values
We should evaluate precision and recall of
different IR methods in average over all the
queries.

23
Recall and Precision Notes Page 2

One is better than another
if at the same recall level, one is more precise
than another
or at the same precision level, one recalls more
than another.
Note this may not hold for every query for
different IR methods and may even be conflicting
for different queries. Therefore it can only be
estimated in the average.

24
Other Measures

Benchmark Approach
Average Precision At Seen Relevant and
R-Precision
Van Rijsbergen
Interpretation
User Oriented
Coverage and Novelty

25
Recall and Precision Alternative Metric
Fallout

Definition FOA-R /D-R
D-R all the irrelevant documents
A-R irrelevant documents in the answer set.

26
Recall and Precision Alternative Metric
Fallout

Definition FOA-R /D-R
Fallout is concerned with retrieved but
non-relevant docs.
It is the percentage of non-relevant answers in
the non-relevant document set.
It is opposite to recall.
A good algorithm should have
Recall gt Fallout.
Otherwise, one can just return all the document
for a better recall.

27
Other Measures Benchmark Approach

Benchmark approach
Given a set of problems
with known answers
Test different IR methods
Yearly competition
TREC
TREC (see http//trec.nist.gov/) maintains about
6Gb of SGML tagged text, queries and respective
answers for evaluation purposes.
The answers to the queries are obtained manually
in advance.
IR systems are tested against them and evaluated
accordingly.

28
Other Measures Average Precision At Seen
Relevant Docs and R-Precision

Average precision at seen relevant docs Compute
the precision every time a relevant doc is found
and report the overall average.
R-Precision The precision of the lowest ranked
relevant doc.

29
Other Measures Van Rijsbergen

Given the j-th doc in the ranking, its recall rj
and precision pj
Van Rijsbergen proposed the following measure
Ej 1-(1b2)/(b2/rj1/pj)
b is a parameter set by the user ,
if b1, Ej 1-2/(1/rj1/pj)
Docs with high precision and high recall have a
low E value, whereas docs with low precision and
low recall have a high E value

30
Other Measures Van Rijsbergen Interpretation

If bgt1, then the emphasis would be on precision
If blt1, then the user would be more interested in
recall
The main aspect of the measure E is that it
evaluate each ranked document, not the whole
document set, thus anomalies can be seen.

31
Other Measures User Oriented

It is also important to take into account what
different users feel about the answer sets
User may consider the same answer set of
different usefulness, this is is especially true
if they know (in different degrees) the answers
they should obtain.
In addition to R and A ,let us also consider the
following subsets of R

32
Other Measures User Oriented Illustration

K set of answers which are known to the user
and,
U set of answers which were not known by the
user and were retrieved.

Answer Set A
Relevant Docs R
Relevant Docs not Known by user and were
retrieved U
Relevant Docs Known to user K
33
Other Measures Coverage And Novelty

CA ? K / K is the coverage of the answer set
A high coverage ratio means that the system is
finding most of what the user was expecting.
N U / (KU) is the novelty of the answer
A high novelty ratio means that the user is
finding many new docs which were not known before
and are relevant

34
Web Search Engine Evaluation New Issues

Web
The web pages (docs) on the web are estimated
more than two billion on April 2001
(www.searchenginewatch.com).
It is almost impossible to get all relevant web
pages from the Internet.
Web pages are dynamic. Some of them will be
disappear tomorrow, or be updated.
User (different countries)
Even for same query, different user may desire
different result.
User tends to user short query words.
Two main issues on web search engine evaluation.
Web Search Engine Coverage
Web Search Engine Effectiveness

35
Web Search Engine Evaluation Coverage

Some published approaches to estimating coverage
are based on the number of hits for certain
queries as reported by the services themselves
For example , the method used by
searchengineshowdown
http//www.searchengineshowdown.com/stats/fast300.
shtml
To compare the sizes of the search engine
databases, the study uses 25 specific queries
that meet the criteria listed as follows.
The results of each query are verified when
possible and only the number of hits that can be
displayed are counted

36
Web Search Engine Evaluation Coverage Query
Criteria

Only single words are used to avoid any variation
in the processing of multiple term searches
Terms were drawn from a variety of reference
books that cover different fields.

37
Web Search Engine Evaluation Coverage
Selection of Query terms

Any term used must find less than 1,000 results
in the AltaVista Advanced Search, since numbers
higher than that cannot be verified on AltaVista.
Since Northern Light automatically searches both
English plural and singular forms of words, query
terms were chosen that cannot generally be made
plural. This was checked by pluralizing the word
and running a search on AltaVista or Fast. Only
those terms where the plural form found zero
results were used

38
Web Search Engine Evaluation Coverage Real
Data
http//www.searchengineshowdown.com/stats/sizeest.
shtml
39
Web Search Engine Evaluation Coverage
Estimation A Simple Method

1 For all the queries, we check all retrieved
results by each search engine, and get the total
count of valid web pages for each search engine.
2 Based on the Northern Light and Fast search
engine size, we can estimate the search engine
coverage.

40
Web Search Engine Evaluation Coverage
Estimation Another Method (Krishna Bharat and
Andrei Broder)
41
Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method

Let Pr(A) represent the probability that an
element belongs to the set A and let Pr(A BA)
represent the conditional probability that an
element belongs to both sets given that it
belongs to A. Then,
Pr(A BA) Size(A B)/Size(A) and
similarly, Pr(A BB) Size(A B)/Size(B), and
therefore
Size(A)/Size(B) Pr(A BB) / Pr(A BA).

42
Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method Two Major Procedures

To implement this idea we need two procedures
Sampling A procedure for picking pages uniformly
at random from the index of a particular engine.
Checking A procedure for determining whether a
particular page is indexed by a particular
engine.

43
Web Search Engine Evaluation Coverage
Estimation Background Conditional
Probability Method The Solution

Overlap estimate The fraction of E1's database
indexed by E2 is estimated by Fraction of URLs
sampled from E1 found in E2.
Size comparison For search engines E1 and E2,
the ratio Size(E1)/Size(E2) is estimated by
Fraction of URLs sampled from E2 found in E1
----------------------------------------------
------------
Fraction of URLs sampled from E1 found in E2

44
Web Search Engine Evaluation Effectiveness

How to evaluate the quality of retrieved results
by different search engines?
Manual
Benefit the accuracy with respect to users
expectation
Drawback subjective (how to choose reviewers)
and time-consuming
Automatic
It is much better in adapting to the fast
changing Web and search engines, as well as the
large amount of information on the web.

45
Web Search Engine Evaluation Effectiveness
Automatic Page 1

In Longzhuang Li, Yi Shang, and Wei Zhang,
Two sample query set were used
(a) The TKDE set containing 1383 queries derived
from the index terms of papers published in the
IEEE Transactions on Knowledge and Data
Engineering between January 1995 and June 2000
(b) The TPDC set containing 1726 queries derived
from the index terms of papers published in the
IEEE Transactions on Parallel and Distributed
Systems between January 1995 and February 2000

46
Web Search Engine Evaluation Effectiveness
Automatic Page 2

For each query, the top 20 hits from each search
engine are analysed.
To compute the relevance score, we followed each
hit to retrieve the corresponding Web document.
The scores are calculated base on four models
(a) Vector Space Model
(b) Okapi Similarity Measurement(Okapi)
(c) Cover Density Ranking(CDR)
(d) Three-level Scoring Method(TLS)

47
Web Search Engine Evaluation Effectiveness
Automatic Page 3

Rank the search engine based on their average
relevance scores computed using the above four
scoring methods, respectively.
Their conclusion Google is the best.

Lecture 9: Search Engine Evaluation - PowerPoint PPT Presentation

Lecture 9: Search Engine Evaluation

There are many search engines on the market, which one is best for your need? ... (b) Okapi Similarity Measurement(Okapi) (c) Cover Density Ranking(CDR) ... – PowerPoint PPT presentation