Towards Comprehensive and Consistent Web Search - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Towards Comprehensive and Consistent Web Search

Description:

WebCrawler. 9. Goals of MetaCrawler. Show no available search service is comprehensive ... WebCrawler. 21. Relative size estimates, Jan. 1, 1999. 22. Projected ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 48

Provided by: speed4

Category:

more less

Transcript and Presenter's Notes

Title: Towards Comprehensive and Consistent Web Search

1
Towards Comprehensive and Consistent Web Search

Erik Selberg
University of Washington
April 22, 1999

2
The Ideal Web Information Service

Single universal Information Service
Comprehensive
Indexable Web, Invisible Web, Local Files,
...
Retrieves all relevant information
No irrelevant information
Results returned in under a second

3
Available Web Information Services

Spider-based Web search engines
AltaVista, Excite, Lycos,
Web Directories
Yahoo!, Hot100,
Local Web search engines
Search UW, Search Microsoft, ...
Online databases
IMDB, USWest Dex,

4
Combine whats Available,Move closer to Ideal

Single text-based Information Service
Comprehensive
As much of the Indexable Web as possible
Consistent
What is relevant stays relevant
Increase relevant documents
Decrease irrelevant documents
Results returned in real time

5
Three Hypotheses

No available search service is comprehensive
No available search service will likely be
comprehensive
Can be addressed by combining services
Most available search services are inconsistent
Can be addressed by auxiliary services
Quality and Speed are not sacrificed

6
History of MetaCrawler

MetaCrawler UW from 1995 - 1996
MetaCrawler licensed to NetBot, Inc. 1996
NetBot licensed MetaCrawler to Go2Net
Excite acquired NetBot, _at_Home acquired Excite
Authors returned to UW Fall of 1996
HuskySearch research MetaCrawler at UW
This talk MetaCrawler 1995-96, HuskySearch

7
Hypothesis I No available search service is
comprehensive

Comprehensive All documents relevant to a given
query are retrievable by that query
Obtain a significant number of queries
Submit them to each available service
Compare results
Obtaining queries is hard
Build a meta-engine, let real users use it!

8
MetaCrawler
9
Goals of MetaCrawler

Show no available search service is comprehensive
Improve quality of search results
Satisfy real time constraints

10
Evaluating Comprehensiveness

Analyzed MetaCrawler log files
Log results followed by users
Logs from Nov. 26 1995 - Dec 2 1995
20,906 queries
Logs from Jan. 1 1999 - Mar 31 1999
185,027 queries
Followed referenced imply information of interest

11
Search engines were disjoint in 1995
12
All engines returned information of interest in
1995
13
Search engines are still disjoint
14
All engines still return information of interest
15
Improving Search Result Quality

Combining results increases relevant results
Also increases irrelevant results!
Interleaving results produces better ranking
Use alternate forms of ranking
Clustering Zamir, Site
Quality checking via Post-Processing
Existence
Relevance
Duplicate detection

16
Satisfying Real Time Constraints

Good parallel Web retrieval engine
Event-based model
Immediate feedback via Server Push
Tell user how far along we are
Interactive browsing via Java Client
Browse intermediate results

17
Goals of MetaCrawler

Show no available search service is comprehensive
Improve quality of search results
Satisfy real time constraints

18
Hypothesis II No available search service will
likely be comprehensive

Disjoint in 1995, 1999
But theyre working on it!
How big is the Web?
How big are the indices?
How fast is the Web growing?
How fast are the indices growing?
If and when an index will cover the Web

19
Lawrence Giles Web Estimate(Science, Apr. 98
study done Dec. 1997)

Extended Web evaluation done by MetaCrawler
Estimate on size of Indexable Web
N AV P(x?HB) / P(x?AV?HB)
320M pages
200M (Bharat Broder)

N
AV
AV?HB
HB
20
Search Engine Sizes (12/95 - 12/98)(www.searcheng
inewatch.com)
AltaVista
HotBot
NorthernLight
Excite
Lycos
InfoSeek
WebCrawler
21
Relative size estimates, Jan. 1, 1999
22
Projected size estimates, Jan. 1, 2000
23
Hypothesis III Most available search services
are inconsistent

Consistent Retrieving the same set of results by
submitting the same query
Unless better results are available
25 queries issued repeatedly over one month to 9
major search services
Queries part of Lawrence Giles study
Each query issued using 3 search options
default, phrase (x y z), AllPlus (x y z)
Top 200 documents requested

24
Measuring Change

Results from a search engine treated as a set
Positional data was ignored!
Only compared results from the same query
returned by same engine
Bi-directional set difference
(T1 - T2) ? (T2 - T1)

T1
T2
25
Results change by over 40 after one month in 8
of 9 engines
26
Engine results change faster than Web growth
estimate
27
Testing for result consistency

T1, T2, T3 results from an engine at three
different times
What percent of URLs
Appear in Top 10 at T1
Do not appear in Top 200 at T2
Appear in Top 10 at T3
In theory Zero

28
34-49 of Top 10 URLs are temporarily removed!
29
60 of URLs in Top 200 differ using different
query options
30
Long term search improvements

Relevant results may be temporarily unavailable
Each query is treated independently
Some limited query refinement available
Most users (77) dont use it
URL ranking based on content
Popularity? Freshness?
If we cant fix it now, can we learn how to fix
it?

31
Collaborative Index Enhancement

Helping users find relevant documents with input
from past queries
Log record everything in all query sessions
Pages returned, pages viewed, result pages, etc.
Store information in databases / indices
Use information to help future queries
Enhance Web indices with access patterns

32
Microsoft IPO 1986
33
ipo microsoft
34
Implementation

Create URL statistics database (ala DirectHit)
Augments ranking of pages
Create new searchable indices
Include indices with HuskySearch queries
Adds new pages to results list
Augments pages returned from other sources
Ensures previously returned pages are returned
Evaluate using passive testing

35
Collaborative Databases

ReturnedURLs Create an index of pages referenced
in results
ClickedURLs Create an index of references pages
clicked on
This addresses the inconsistency of Web search
services

36
Indexed Results

Hypothesis Snippets in HuskySearch results pages
highlight relevant terms
ResultsPages Create an index of result pages and
search them
SuccessResPgs Create an index of good result
pages and search them
Note this creates an implicit searchable query
history

37
Evaluation of CIE

Do any of these auxiliaries improve performance?
Does performance improve over time?
What the auxiliaries contribute useful additional
information?
Does the re-ranking effect rank viewed documents
higher?

38
Metrics

Log analysis
92,072 queries (12 weeks, Jun. 1 - Aug. 17 1998)
10,303,553 returned, 70,227 followed URLs
ViewRate
viewed docs / docs returned
Unique Contribution
unique URLs viewed / total viewed
Document Contribution
documents returned / total documents returned

39
ViewRate of CIE auxiliaries
40
ViewRate of ClickedURLs vs three search services
41
Additional followed URLs
42
Median and Average Height of Viewed URLs
43
After one week, half of the URLs viewed have been
viewed before
44
Towards Comprehensive Web Search