Towards Comprehensive and Consistent Web Search - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Towards Comprehensive and Consistent Web Search

Description:

WebCrawler. 9. Goals of MetaCrawler. Show no available search service is comprehensive ... WebCrawler. 21. Relative size estimates, Jan. 1, 1999. 22. Projected ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 48
Provided by: speed4
Category:

less

Transcript and Presenter's Notes

Title: Towards Comprehensive and Consistent Web Search


1
Towards Comprehensive and Consistent Web Search
  • Erik Selberg
  • University of Washington
  • April 22, 1999

2
The Ideal Web Information Service
  • Single universal Information Service
  • Comprehensive
  • Indexable Web, Invisible Web, Local Files,
    ...
  • Retrieves all relevant information
  • No irrelevant information
  • Results returned in under a second

3
Available Web Information Services
  • Spider-based Web search engines
  • AltaVista, Excite, Lycos,
  • Web Directories
  • Yahoo!, Hot100,
  • Local Web search engines
  • Search UW, Search Microsoft, ...
  • Online databases
  • IMDB, USWest Dex,

4
Combine whats Available,Move closer to Ideal
  • Single text-based Information Service
  • Comprehensive
  • As much of the Indexable Web as possible
  • Consistent
  • What is relevant stays relevant
  • Increase relevant documents
  • Decrease irrelevant documents
  • Results returned in real time

5
Three Hypotheses
  • No available search service is comprehensive
  • No available search service will likely be
    comprehensive
  • Can be addressed by combining services
  • Most available search services are inconsistent
  • Can be addressed by auxiliary services
  • Quality and Speed are not sacrificed

6
History of MetaCrawler
  • MetaCrawler UW from 1995 - 1996
  • MetaCrawler licensed to NetBot, Inc. 1996
  • NetBot licensed MetaCrawler to Go2Net
  • Excite acquired NetBot, _at_Home acquired Excite
  • Authors returned to UW Fall of 1996
  • HuskySearch research MetaCrawler at UW
  • This talk MetaCrawler 1995-96, HuskySearch

7
Hypothesis I No available search service is
comprehensive
  • Comprehensive All documents relevant to a given
    query are retrievable by that query
  • Obtain a significant number of queries
  • Submit them to each available service
  • Compare results
  • Obtaining queries is hard
  • Build a meta-engine, let real users use it!

8
MetaCrawler
9
Goals of MetaCrawler
  • Show no available search service is comprehensive
  • Improve quality of search results
  • Satisfy real time constraints

10
Evaluating Comprehensiveness
  • Analyzed MetaCrawler log files
  • Log results followed by users
  • Logs from Nov. 26 1995 - Dec 2 1995
  • 20,906 queries
  • Logs from Jan. 1 1999 - Mar 31 1999
  • 185,027 queries
  • Followed referenced imply information of interest

11
Search engines were disjoint in 1995
12
All engines returned information of interest in
1995
13
Search engines are still disjoint
14
All engines still return information of interest
15
Improving Search Result Quality
  • Combining results increases relevant results
  • Also increases irrelevant results!
  • Interleaving results produces better ranking
  • Use alternate forms of ranking
  • Clustering Zamir, Site
  • Quality checking via Post-Processing
  • Existence
  • Relevance
  • Duplicate detection

16
Satisfying Real Time Constraints
  • Good parallel Web retrieval engine
  • Event-based model
  • Immediate feedback via Server Push
  • Tell user how far along we are
  • Interactive browsing via Java Client
  • Browse intermediate results

17
Goals of MetaCrawler
  • Show no available search service is comprehensive
  • Improve quality of search results
  • Satisfy real time constraints

18
Hypothesis II No available search service will
likely be comprehensive
  • Disjoint in 1995, 1999
  • But theyre working on it!
  • How big is the Web?
  • How big are the indices?
  • How fast is the Web growing?
  • How fast are the indices growing?
  • If and when an index will cover the Web

19
Lawrence Giles Web Estimate(Science, Apr. 98
study done Dec. 1997)
  • Extended Web evaluation done by MetaCrawler
  • Estimate on size of Indexable Web
  • N AV P(x?HB) / P(x?AV?HB)
  • 320M pages
  • 200M (Bharat Broder)

N
AV
AV?HB
HB
20
Search Engine Sizes (12/95 - 12/98)(www.searcheng
inewatch.com)
AltaVista
HotBot
NorthernLight
Excite
Lycos
InfoSeek
WebCrawler
21
Relative size estimates, Jan. 1, 1999
22
Projected size estimates, Jan. 1, 2000
23
Hypothesis III Most available search services
are inconsistent
  • Consistent Retrieving the same set of results by
    submitting the same query
  • Unless better results are available
  • 25 queries issued repeatedly over one month to 9
    major search services
  • Queries part of Lawrence Giles study
  • Each query issued using 3 search options
  • default, phrase (x y z), AllPlus (x y z)
  • Top 200 documents requested

24
Measuring Change
  • Results from a search engine treated as a set
  • Positional data was ignored!
  • Only compared results from the same query
    returned by same engine
  • Bi-directional set difference
  • (T1 - T2) ? (T2 - T1)

T1
T2
25
Results change by over 40 after one month in 8
of 9 engines
26
Engine results change faster than Web growth
estimate
27
Testing for result consistency
  • T1, T2, T3 results from an engine at three
    different times
  • What percent of URLs
  • Appear in Top 10 at T1
  • Do not appear in Top 200 at T2
  • Appear in Top 10 at T3
  • In theory Zero

28
34-49 of Top 10 URLs are temporarily removed!
29
60 of URLs in Top 200 differ using different
query options
30
Long term search improvements
  • Relevant results may be temporarily unavailable
  • Each query is treated independently
  • Some limited query refinement available
  • Most users (77) dont use it
  • URL ranking based on content
  • Popularity? Freshness?
  • If we cant fix it now, can we learn how to fix
    it?

31
Collaborative Index Enhancement
  • Helping users find relevant documents with input
    from past queries
  • Log record everything in all query sessions
  • Pages returned, pages viewed, result pages, etc.
  • Store information in databases / indices
  • Use information to help future queries
  • Enhance Web indices with access patterns

32
Microsoft IPO 1986
33
ipo microsoft
34
Implementation
  • Create URL statistics database (ala DirectHit)
  • Augments ranking of pages
  • Create new searchable indices
  • Include indices with HuskySearch queries
  • Adds new pages to results list
  • Augments pages returned from other sources
  • Ensures previously returned pages are returned
  • Evaluate using passive testing

35
Collaborative Databases
  • ReturnedURLs Create an index of pages referenced
    in results
  • ClickedURLs Create an index of references pages
    clicked on
  • This addresses the inconsistency of Web search
    services

36
Indexed Results
  • Hypothesis Snippets in HuskySearch results pages
    highlight relevant terms
  • ResultsPages Create an index of result pages and
    search them
  • SuccessResPgs Create an index of good result
    pages and search them
  • Note this creates an implicit searchable query
    history

37
Evaluation of CIE
  • Do any of these auxiliaries improve performance?
  • Does performance improve over time?
  • What the auxiliaries contribute useful additional
    information?
  • Does the re-ranking effect rank viewed documents
    higher?

38
Metrics
  • Log analysis
  • 92,072 queries (12 weeks, Jun. 1 - Aug. 17 1998)
  • 10,303,553 returned, 70,227 followed URLs
  • ViewRate
  • viewed docs / docs returned
  • Unique Contribution
  • unique URLs viewed / total viewed
  • Document Contribution
  • documents returned / total documents returned

39
ViewRate of CIE auxiliaries
40
ViewRate of ClickedURLs vs three search services
41
Additional followed URLs
42
Median and Average Height of Viewed URLs
43
After one week, half of the URLs viewed have been
viewed before
44
Towards Comprehensive Web Search
  • No single search service is comprehensive
  • Addressed by combining services
  • Quality can be improved
  • Speed is not completely sacrificed
  • No search service will likely be comprehensive
  • Barring a great leap of technology and/or
    investment of capital
  • Next problem effectively combining many services

45
Towards Consistent Web Search
  • Most available search services are inconsistent
  • 5/9 engines temporarily omit 34-49 of the Top 10
    URLs
  • 3/9 temporarily omit 4-12
  • CIE auxiliaries address this problem
  • CIE auxiliaries also improve results
  • Contributing new documents
  • Improving document ranking

46
Related Work
  • Lots of meta-search
  • SavvySearch, Fusion, ProFusion, DogPile, ...
  • Distributed Search
  • Belkin, Fox TREC InfoSeek, Inktomi, ...
  • Query Routing
  • Gravano (STARTS), Leach (CIP), ...
  • Collaboration
  • FireFly (music), Amazon (books)
  • Ask Jeeves (www.ask.com)

47
Towards the Ideal Information Service
  • Still far from fully comprehensive
  • Intranets, Online DBs, Personal Files, ...
  • Huge Integration and Routing problems
  • More work with non-text
  • Audio, video, chemical, numbers, product IDs, ...
  • Improving result quality
  • Known item vs all available
  • Page semantics

48
Team MetaCrawler
  • Oren Etzioni Oren Zamir Zhenya Sigal Melissa
    Johnson Greg Lauckhart
  • Christin Boyd Darren Schack
  • Erik Selberg
  • selberg_at_cs.washington.edu
  • http//huskysearch.cs.washington.edu
  • http//www.metacrawler.com
Write a Comment
User Comments (0)
About PowerShow.com