Metasearch: A View from the Gallery - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Metasearch: A View from the Gallery

Description:

WebCrawler (Brian Pinkerton) Spider: automatically retrieve and follow URLs ... WebCrawler, Lycos, InfoSeek, Yahoo. Galaxy, Open Text, Magellan. Lots and lots ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 27
Provided by: speed4
Category:

less

Transcript and Presenter's Notes

Title: Metasearch: A View from the Gallery


1
Meta-search A View from the Gallery
  • Erik Selberg
  • University of Washington
  • March 26, 1999

2
Outline
  • A (fast) history of Web searching
  • Metasearch with MetaCrawler
  • Evaluation of MetaCrawler
  • Evaluation of Web Search with MetaCrawler
  • Conclusions and Futurama

3
1994Finding Information on the Web
  • SEW (Paul Barton-Davis)
  • E-mail URLs
  • Automatically added to personal Whats New page
  • http//www/homes/speed/incoming/default.html
  • NCSAs Whats New Page
  • WebCrawler (Brian Pinkerton)
  • Spider automatically retrieve and follow URLs
  • Glimpse for text retrieval

4
Spring 1995 Search begins to scale
  • Lots of search engines / directories
  • WebCrawler, Lycos, InfoSeek, Yahoo
  • Galaxy, Open Text, Magellan
  • Lots and lots of kinks
  • Service downtime
  • Nonstandard syntax and features
  • Index Coherency
  • Error 404 File Not Found
  • False positives

5
Addressing problems of search A Web Search Agent
User says What, Agent determines How and
Where
  • Integrate Multiple Search Resources
  • Obtain Precise and Relevant Information
  • Satisfy Real Time Constraints
  • Build Customizable Interfaces

6
MetaCrawler
7
Search Improvements using MetaCrawler
  • Fresher data
  • As long as one service has a recent copy
  • Better coverage
  • As long as one service has a reference
  • Post-processing
  • Existence
  • Quality-check documents
  • Filter based on URL
  • Smart, but not too smart

8
Evaluation of MetaCrawler
  • 50-100M people search the Web daily
  • How well does MetaCrawler work?
  • How well do search engines work?
  • Where will improvements to MetaCrawler be needed
    in the future?

9
MetaCrawler allows massive, passive user studies
  • Traditional IR
  • Small test corpus
  • (50-100K docs)
  • Well formed docs
  • Small set of questions
  • (50 - 300)
  • Relevance judgements for each question
  • Metrics Precision and Recall
  • Evaluates the engine
  • Web IR
  • Huge test corpus
  • (350M docs)
  • Poorly formed docs
  • Huge set of questions
  • (20K - 10M)
  • No relevance judgements
  • Metrics user clicks
  • Evaluates the system

10
Evaluating MetaCrawler, Dec. 1995
  • Analyzed MetaCrawler log files
  • Log results followed by users
  • Logs from Nov. 26 1995 - Dec 2 1995
  • 20,906 queries
  • Followed referenced imply information of interest
  • Similar findings by Lawrence Giles Dec. 1997

11
Search engines are disjoint (1995)
12
All engines return information of interest
(1995)
13
Low percentage of URLs viewed (1995)
14
Summary of 1995 Results
  • Search engine results are disjoint
  • All engines return information of value
  • MetaCrawler addressed by incorporating multiple
    engines
  • Search engine return lots of useless URLs
  • MetaCrawler addressed by post-processing results
  • What doesnt MetaCrawler fix?

15
Evaluation of Search Engines (1999)
  • 25 queries issued repeatedly over one month to 9
    major search services
  • Time between submissions grew exponentially
  • Queries part of Lawrence Giles study
  • Each query issued using 3 search options
  • default, phrase (x y z), AllPlus (x y z)
  • Top 200 documents requested

16
Measuring Change
  • Results from a search engine treated as a set
  • Positional data was ignored!
  • Only compared results from the same query
    returned by same engine
  • Bi-directional set difference
  • (T1 - T2) ? (T2 - T1)

T1
T2
17
Results change by over 40 after one month in 8
of 9 engines (1999)
18
Engine results change faster than Web growth
estimate (1999)
19
34-49 of Top 10 URLs are temporarily removed!
(1999)
20
60 of URLs in Top 200 differ using different
query options (1999)
21
Lawrence Giles Web Estimate(Science, Apr. 98
study done Dec. 1997)
  • Estimate on size of Indexable Web
  • P(x?AV) AV / N
  • P(x?AV)P(x?HB) P(x?AV?HB)
  • N AV / (P(x?AV?HB) / P(x?HB))
  • 320M pages (bigger than previous estimates!)

N
AV
AV?HB
HB
22
Search Engine Sizes (12/95 - 12/98)(www.searcheng
inewatch.com)
AltaVista
HotBot
NorthernLight
Excite
Lycos
InfoSeek
WebCrawler
23
Relative size estimates, Jan. 1, 1999
24
Conclusions and Futurama
  • Search engine results are disjoint
  • MetaCrawler fixed this
  • Search engine results arent consistent
  • MetaCrawler hasnt fixed this (yet)
  • Search engines are covering less and less of the
    Web
  • Huge data integration problems on the horizon

25
Futurama (or, whats next in meta-land)
  • MetaCrawler 95 integrate 6-10 engines
  • MetaCrawler 99 integrate 60-100 engines
  • Query Routing - figuring out proper engines based
    on keyword
  • Improved interface - getting more information
    from the user (Search Channels)
  • More post-processing of results
  • Scalable business models

26
Ta-da!
  • User still says What, Agent still determines
    How and Where
  • Theres just a whole lot more!
  • Erik Selberg
  • selberg_at_cs.washington.edu
  • http//huskysearch.cs.washington.edu
  • http//www.metacrawler.com
Write a Comment
User Comments (0)
About PowerShow.com