Title: Metasearch: A View from the Gallery
1Meta-search A View from the Gallery
- Erik Selberg
- University of Washington
- March 26, 1999
2Outline
- A (fast) history of Web searching
- Metasearch with MetaCrawler
- Evaluation of MetaCrawler
- Evaluation of Web Search with MetaCrawler
- Conclusions and Futurama
31994Finding Information on the Web
- SEW (Paul Barton-Davis)
- E-mail URLs
- Automatically added to personal Whats New page
- http//www/homes/speed/incoming/default.html
- NCSAs Whats New Page
- WebCrawler (Brian Pinkerton)
- Spider automatically retrieve and follow URLs
- Glimpse for text retrieval
4Spring 1995 Search begins to scale
- Lots of search engines / directories
- WebCrawler, Lycos, InfoSeek, Yahoo
- Galaxy, Open Text, Magellan
- Lots and lots of kinks
- Service downtime
- Nonstandard syntax and features
- Index Coherency
- Error 404 File Not Found
- False positives
5Addressing problems of search A Web Search Agent
User says What, Agent determines How and
Where
- Integrate Multiple Search Resources
- Obtain Precise and Relevant Information
- Satisfy Real Time Constraints
- Build Customizable Interfaces
6MetaCrawler
7Search Improvements using MetaCrawler
- Fresher data
- As long as one service has a recent copy
- Better coverage
- As long as one service has a reference
- Post-processing
- Existence
- Quality-check documents
- Filter based on URL
- Smart, but not too smart
8Evaluation of MetaCrawler
- 50-100M people search the Web daily
- How well does MetaCrawler work?
- How well do search engines work?
- Where will improvements to MetaCrawler be needed
in the future?
9MetaCrawler allows massive, passive user studies
- Traditional IR
- Small test corpus
- (50-100K docs)
- Well formed docs
- Small set of questions
- (50 - 300)
- Relevance judgements for each question
- Metrics Precision and Recall
- Evaluates the engine
- Web IR
- Huge test corpus
- (350M docs)
- Poorly formed docs
- Huge set of questions
- (20K - 10M)
- No relevance judgements
- Metrics user clicks
- Evaluates the system
10Evaluating MetaCrawler, Dec. 1995
- Analyzed MetaCrawler log files
- Log results followed by users
- Logs from Nov. 26 1995 - Dec 2 1995
- 20,906 queries
- Followed referenced imply information of interest
- Similar findings by Lawrence Giles Dec. 1997
11Search engines are disjoint (1995)
12All engines return information of interest
(1995)
13Low percentage of URLs viewed (1995)
14Summary of 1995 Results
- Search engine results are disjoint
- All engines return information of value
- MetaCrawler addressed by incorporating multiple
engines - Search engine return lots of useless URLs
- MetaCrawler addressed by post-processing results
- What doesnt MetaCrawler fix?
15Evaluation of Search Engines (1999)
- 25 queries issued repeatedly over one month to 9
major search services - Time between submissions grew exponentially
- Queries part of Lawrence Giles study
- Each query issued using 3 search options
- default, phrase (x y z), AllPlus (x y z)
- Top 200 documents requested
16Measuring Change
- Results from a search engine treated as a set
- Positional data was ignored!
- Only compared results from the same query
returned by same engine - Bi-directional set difference
- (T1 - T2) ? (T2 - T1)
T1
T2
17Results change by over 40 after one month in 8
of 9 engines (1999)
18Engine results change faster than Web growth
estimate (1999)
1934-49 of Top 10 URLs are temporarily removed!
(1999)
2060 of URLs in Top 200 differ using different
query options (1999)
21Lawrence Giles Web Estimate(Science, Apr. 98
study done Dec. 1997)
- Estimate on size of Indexable Web
- P(x?AV) AV / N
- P(x?AV)P(x?HB) P(x?AV?HB)
- N AV / (P(x?AV?HB) / P(x?HB))
- 320M pages (bigger than previous estimates!)
N
AV
AV?HB
HB
22Search Engine Sizes (12/95 - 12/98)(www.searcheng
inewatch.com)
AltaVista
HotBot
NorthernLight
Excite
Lycos
InfoSeek
WebCrawler
23Relative size estimates, Jan. 1, 1999
24Conclusions and Futurama
- Search engine results are disjoint
- MetaCrawler fixed this
- Search engine results arent consistent
- MetaCrawler hasnt fixed this (yet)
- Search engines are covering less and less of the
Web - Huge data integration problems on the horizon
25Futurama (or, whats next in meta-land)
- MetaCrawler 95 integrate 6-10 engines
- MetaCrawler 99 integrate 60-100 engines
- Query Routing - figuring out proper engines based
on keyword - Improved interface - getting more information
from the user (Search Channels) - More post-processing of results
- Scalable business models
26Ta-da!
- User still says What, Agent still determines
How and Where - Theres just a whole lot more!
- Erik Selberg
- selberg_at_cs.washington.edu
- http//huskysearch.cs.washington.edu
- http//www.metacrawler.com