Title: Towards Comprehensive and Consistent Web Search
1Towards Comprehensive and Consistent Web Search
- Erik Selberg
- University of Washington
- April 22, 1999
2The Ideal Web Information Service
- Single universal Information Service
- Comprehensive
- Indexable Web, Invisible Web, Local Files,
... - Retrieves all relevant information
- No irrelevant information
- Results returned in under a second
3Available Web Information Services
- Spider-based Web search engines
- AltaVista, Excite, Lycos,
- Web Directories
- Yahoo!, Hot100,
- Local Web search engines
- Search UW, Search Microsoft, ...
- Online databases
- IMDB, USWest Dex,
4Combine whats Available,Move closer to Ideal
- Single text-based Information Service
- Comprehensive
- As much of the Indexable Web as possible
- Consistent
- What is relevant stays relevant
- Increase relevant documents
- Decrease irrelevant documents
- Results returned in real time
5Three Hypotheses
- No available search service is comprehensive
- No available search service will likely be
comprehensive - Can be addressed by combining services
- Most available search services are inconsistent
- Can be addressed by auxiliary services
- Quality and Speed are not sacrificed
6History of MetaCrawler
- MetaCrawler UW from 1995 - 1996
- MetaCrawler licensed to NetBot, Inc. 1996
- NetBot licensed MetaCrawler to Go2Net
- Excite acquired NetBot, _at_Home acquired Excite
- Authors returned to UW Fall of 1996
- HuskySearch research MetaCrawler at UW
- This talk MetaCrawler 1995-96, HuskySearch
7Hypothesis I No available search service is
comprehensive
- Comprehensive All documents relevant to a given
query are retrievable by that query - Obtain a significant number of queries
- Submit them to each available service
- Compare results
- Obtaining queries is hard
- Build a meta-engine, let real users use it!
8MetaCrawler
9Goals of MetaCrawler
- Show no available search service is comprehensive
- Improve quality of search results
- Satisfy real time constraints
10Evaluating Comprehensiveness
- Analyzed MetaCrawler log files
- Log results followed by users
- Logs from Nov. 26 1995 - Dec 2 1995
- 20,906 queries
- Logs from Jan. 1 1999 - Mar 31 1999
- 185,027 queries
- Followed referenced imply information of interest
11Search engines were disjoint in 1995
12All engines returned information of interest in
1995
13Search engines are still disjoint
14All engines still return information of interest
15Improving Search Result Quality
- Combining results increases relevant results
- Also increases irrelevant results!
- Interleaving results produces better ranking
- Use alternate forms of ranking
- Clustering Zamir, Site
- Quality checking via Post-Processing
- Existence
- Relevance
- Duplicate detection
16Satisfying Real Time Constraints
- Good parallel Web retrieval engine
- Event-based model
- Immediate feedback via Server Push
- Tell user how far along we are
- Interactive browsing via Java Client
- Browse intermediate results
17Goals of MetaCrawler
- Show no available search service is comprehensive
- Improve quality of search results
- Satisfy real time constraints
18Hypothesis II No available search service will
likely be comprehensive
- Disjoint in 1995, 1999
- But theyre working on it!
- How big is the Web?
- How big are the indices?
- How fast is the Web growing?
- How fast are the indices growing?
- If and when an index will cover the Web
19Lawrence Giles Web Estimate(Science, Apr. 98
study done Dec. 1997)
- Extended Web evaluation done by MetaCrawler
- Estimate on size of Indexable Web
- N AV P(x?HB) / P(x?AV?HB)
- 320M pages
- 200M (Bharat Broder)
N
AV
AV?HB
HB
20Search Engine Sizes (12/95 - 12/98)(www.searcheng
inewatch.com)
AltaVista
HotBot
NorthernLight
Excite
Lycos
InfoSeek
WebCrawler
21Relative size estimates, Jan. 1, 1999
22Projected size estimates, Jan. 1, 2000
23Hypothesis III Most available search services
are inconsistent
- Consistent Retrieving the same set of results by
submitting the same query - Unless better results are available
- 25 queries issued repeatedly over one month to 9
major search services - Queries part of Lawrence Giles study
- Each query issued using 3 search options
- default, phrase (x y z), AllPlus (x y z)
- Top 200 documents requested
24Measuring Change
- Results from a search engine treated as a set
- Positional data was ignored!
- Only compared results from the same query
returned by same engine - Bi-directional set difference
- (T1 - T2) ? (T2 - T1)
T1
T2
25Results change by over 40 after one month in 8
of 9 engines
26Engine results change faster than Web growth
estimate
27Testing for result consistency
- T1, T2, T3 results from an engine at three
different times - What percent of URLs
- Appear in Top 10 at T1
- Do not appear in Top 200 at T2
- Appear in Top 10 at T3
- In theory Zero
2834-49 of Top 10 URLs are temporarily removed!
2960 of URLs in Top 200 differ using different
query options
30Long term search improvements
- Relevant results may be temporarily unavailable
- Each query is treated independently
- Some limited query refinement available
- Most users (77) dont use it
- URL ranking based on content
- Popularity? Freshness?
- If we cant fix it now, can we learn how to fix
it?
31Collaborative Index Enhancement
- Helping users find relevant documents with input
from past queries - Log record everything in all query sessions
- Pages returned, pages viewed, result pages, etc.
- Store information in databases / indices
- Use information to help future queries
- Enhance Web indices with access patterns
32Microsoft IPO 1986
33ipo microsoft
34Implementation
- Create URL statistics database (ala DirectHit)
- Augments ranking of pages
- Create new searchable indices
- Include indices with HuskySearch queries
- Adds new pages to results list
- Augments pages returned from other sources
- Ensures previously returned pages are returned
- Evaluate using passive testing
35Collaborative Databases
- ReturnedURLs Create an index of pages referenced
in results - ClickedURLs Create an index of references pages
clicked on - This addresses the inconsistency of Web search
services
36Indexed Results
- Hypothesis Snippets in HuskySearch results pages
highlight relevant terms - ResultsPages Create an index of result pages and
search them - SuccessResPgs Create an index of good result
pages and search them - Note this creates an implicit searchable query
history
37Evaluation of CIE
- Do any of these auxiliaries improve performance?
- Does performance improve over time?
- What the auxiliaries contribute useful additional
information? - Does the re-ranking effect rank viewed documents
higher?
38Metrics
- Log analysis
- 92,072 queries (12 weeks, Jun. 1 - Aug. 17 1998)
- 10,303,553 returned, 70,227 followed URLs
- ViewRate
- viewed docs / docs returned
- Unique Contribution
- unique URLs viewed / total viewed
- Document Contribution
- documents returned / total documents returned
39ViewRate of CIE auxiliaries
40ViewRate of ClickedURLs vs three search services
41Additional followed URLs
42Median and Average Height of Viewed URLs
43After one week, half of the URLs viewed have been
viewed before
44Towards Comprehensive Web Search
- No single search service is comprehensive
- Addressed by combining services
- Quality can be improved
- Speed is not completely sacrificed
- No search service will likely be comprehensive
- Barring a great leap of technology and/or
investment of capital - Next problem effectively combining many services
45Towards Consistent Web Search
- Most available search services are inconsistent
- 5/9 engines temporarily omit 34-49 of the Top 10
URLs - 3/9 temporarily omit 4-12
- CIE auxiliaries address this problem
- CIE auxiliaries also improve results
- Contributing new documents
- Improving document ranking
46Related Work
- Lots of meta-search
- SavvySearch, Fusion, ProFusion, DogPile, ...
- Distributed Search
- Belkin, Fox TREC InfoSeek, Inktomi, ...
- Query Routing
- Gravano (STARTS), Leach (CIP), ...
- Collaboration
- FireFly (music), Amazon (books)
- Ask Jeeves (www.ask.com)
47Towards the Ideal Information Service
- Still far from fully comprehensive
- Intranets, Online DBs, Personal Files, ...
- Huge Integration and Routing problems
- More work with non-text
- Audio, video, chemical, numbers, product IDs, ...
- Improving result quality
- Known item vs all available
- Page semantics
48Team MetaCrawler
- Oren Etzioni Oren Zamir Zhenya Sigal Melissa
Johnson Greg Lauckhart - Christin Boyd Darren Schack
- Erik Selberg
- selberg_at_cs.washington.edu
- http//huskysearch.cs.washington.edu
- http//www.metacrawler.com