Title: Choosing and Using the Best Metas
1Choosing and Using the Best Metas
Hyper-searching the Web
- Michael Hunter
- Reference Librarian
- Hobart and William Smith Colleges
- for Rochester Regional Library Council
- Member Libraries Staff
- Sponsored by the
- Rochester Regional Library Council
- Supported by Library Services and Technology Act
(LSTA) and/or - Regional Bibliographic Databases and Resources
Sharing (RBDB) funds granted by the - New York State Library 2002
2For Today
- Metas History and Functions
- Search and Retrieval Issues
- Major Players in 2003
- Clustering Technology
- More Good Metas
- Web Search Agents
- Evaluating Metasearch Services
3Metasearch defined . . .
- Group of search engines, subject directories
and/or databases made searchable through a common
interface. - Results may or may not follow the original
sources rankings - Today our focus is free metaengines using
subject directories (Yahoo, LII, OD) and
crawler-based engines as sources (Google, FAST,
Teoma) - We will NOT examine specialized or Deep Web metas
4A GOOD Meta will . . .
- Re-format queries to be compatible with search
syntax of each source - Enable searchers to use advanced features (when
the sources support them) - Indicate overlapping results without repeating
them - Perform additional processing of results, eg.
ranking for appropriateness, catagorization, etc.
- Use only sources with unique databases
5The beginnings of metasearch
- A conceptual descendant of Veronica
- March 1995 Harvest (later Savvysearch, now
Search.com) developed at Colorado State by Daniel
Dreilinger - July 1995 Metacrawler developed at U. of
Washington by Selberg and Etzioni - Metacrawler Architecture for Resource
Aggregation on the Web 1996
6(No Transcript)
7The beginnings of metasearch
- 1996 - Dogpile
- 1998 - Ixquick
- 1999 - Kartoo
- 2000 - Ithaki
- 2001 - Vivisimo
8More facts about metas
- Flavor determined by choice of sources
- Comprehensive
- Vivisimo, Ixquick, Metacrawler
- General Lifestyle, popular culture
- Dogpile, Profusion
- Commercial
- Search.com, Excite_at_home
9Metas and retrieval
- Metas search quickly but not deeply
- Search time or a quantity of searches are
purchased from sources (typically top 10-50 hits
from each) - Metas are subject to time-out limits from their
sources - Each source is usually NOT searched for each
query
10Metas and retrieval
- Dumbing Down the Query
- Advanced features are often not available, and
then only those that are shared among sources - Default setting for time-out is the shortest set
to maximum for more comprehensive searches (when
available) - For most metas, advertising is the only source of
revenue software sales are rare
11Metas and retrieval
- What is their place in my search strategy?
- Metas best used for simple searches, with little
(or no) syntactic complexity - Use them to find the top few sites on a topic
- For a quick overview of a topics coverage on the
Web in general - Use them as a last resort for highly focused
topics that elude your usual search tools - As a possible indication of coverage of a topic
among several engines (NOTE problematic)
12Searching the metas
- Results depend on
- Choice of sources
- Query processing speed OF THE SOURCE
- Length of time spent at each source
13A search comparison . . .
- Searched heterotropia (abnormal binocular vision)
on 4/21/03 - Vivisimo 77 Shortest 126 Longest
- Ixquick 37 from at least 450 results
- Profusion 30 Shortest 39 Longest
- Metacrawler 42 Shortest 61 Longest
- Webcrawler 31 Shortest 80 Longest
- Dogpile 29 (no time-out option)
- Excite 41 Shortest 31 Longest
14Stability of ResultsSearched kids of survival
(modern art group) as a phrase at 3-minute
intervals (time-outs at default setting)
4/21/03
15Metas and ranking options
- Listing by SOURCE
- Usually retains ranking of source
- COMBINED Listing options
- Indicate source of each result
- Indicate duplicates without repeating them
- Indicate position in original sources ranking
- Most duplicated hits listed first
- Disclose paid listings (if disclosed by source)
16Vivisimo
- http//vivisimo.com
- Sources Altavista, Yahoo, MSN, Netscape, Lycos,
LookSmart, Gigablast, Vizzavi, BBC, Librarians
Index to the Internet plus 11 specialized news
sources and 7 specialized business, medical and
governmental sources - Offers full Boolean and phrase search (if
supported by the source)
17Vivisimo
- Offers the following customizations
- Selection of sources searched
- Total number of results retrieved
- Length of search (time-out period)
- Results combined
- Source for each result given
- Ranking data from that source given
- Duplicates noted, but not repeated
18Vivisimo
- Other features
- Results are clustered by keyword prevalence or
website of origin - Offers a preview of each result in a separate
window - Offers vertical searches Top News, Business
News, Tech News, Sports News
19Clustering results (folders)
- Automated subject analysis
- Facilitates navigation and query refinement
- Can be hierarchical (folders within folders)
- One document may appear in several folders
- Northern Light first public search engine to make
use of folders
20Clustering technology in a metasearch environment
- Real-time processing of results retrieved from
sources - Variety of data can be returned from each source
- Url
- Title
- First few sentences
- Human-created summary
- Folder creation varies according to data from
sources and processing time available at the
moment of the query
21Clustering -- Step 1
- Significant terms are identified from all results
based on - Frequency of term(s)
- Position of term(s)
- Normalization algorithms applied
- Documents analyzed for word variants (stemming)
- Norms set (authority control)
- game downloads download games
- downloading games
- Folder labels created
22Clustering Step 2
- Each result from the sources is matched against
the set of folder labels and assigned to one or
more folders - By linguistic analysis (term position, predictive
descriptive importance) - By statistical analysis (term frequency)
- Final, proprietary analysis combines these (and
more) - Remember The full documents are not available
to a meta for this type of processing
23(No Transcript)
24(No Transcript)
25Profusion
- http//profusion.com
- Sources Altavista, Yahoo, MSN, About.com, Adobe
PDF, AOL, LookSmart, Lycos, Netscape, Raging
Search, Teoma, WiseNut - Offers full Boolean and phrase search (if
supported by the source)
26Profusion
- Offers the following customizations
- Selection of sources searched
- Total number of results retrieved
- Length of search (time-out period)
- Offers option of results listed by source or
combined listing - Source for each result given
- Ranking data from that source given
- Duplicates noted, but not repeated
27Profusion
- Other features
- Results can be sorted by relevance score, title
or URL - Similar Result enhancement
- Profusion Relevance Score shown
- Search terms highlighted in results listing
- Set Search Alert feature stores searches and
alerts user to page changes requires setting up
a (free) account - Search Analysis available
- Offer vertical searches Deep Web content in 21
broad categories News
28Ixquick
- http//ixquick.com
- Sources Altavista, Netscape, Gigablast, Adobe
PDF, Avaya PDF, AskJeeves, Teoma, Go, Open
Directory, Overture, Kanoodle, LookSmart,
WiseNut, FindWhat, Yahoo, MSN - Offers full Boolean and phrase search (if
supported by the source) - Offers the following customizations
- Selection of sources searched
- Length of search (time-out period)
29Ixquick
- Results combined
- Source for each result given
- Ranking data from that source given
- Duplicates noted, but not repeated
30Ixquick
- Other features
- Offers 7 field searches (when supported by
sources) - Clusters hits from same site
- Highlights search terms in each hit
- Offers Related Searches
- Offers vertical searches MP3, News, Pictures
31iBoogie
- http//iboogie.com
- Sources Altavista, Yahoo, MSN, FAST, FindWhat,
Teoma, WiseNut, OpenFind - Boolean and phrase search somewhat unreliable
- Offers the following customizations
- Selection of sources searched
- Total number of results retrieved
- Length of search (time-out period)
32iBoogie
- Results combined
- Source for each result given
- Duplicates noted, but not repeated
- Other features
- Adult content filter (when supported by source)
- Language limit (when supported by source)
- Clusters results by keyword and/or website
- Offers Similar Pages enhancement
- Offers vertical searches Newspapers,
Bookstores, Reference, Shopping
33Metacrawler
- http//metacrawler.com
- Sources FAST, Google, About.com, AskJeeves,
FindWhat, LookSmart, Inktomi (?), Open Directory,
Overture, Search Hippo, Sprinks, Teoma - Offers Boolean and, or (no not) and phrase
search (if supported by the source) - Offers the following customizations
- Selection of sources searched
- Total number of results retrieved
- Length of search (time-out period)
34Metacrawler
- Offers option of results listed by source or
combined listing - Source for each result given
- Duplicates noted, but not repeated
- Other features
- Offers Related Searches
- More like this results enhancement
- Offers a wide range of vertical searches Images,
MP3, Shopping, Subject Directory, Multimedia,
News, Message Boards
35Dogpile
- http//dogpile.com
- Sources Google, Fast, About.com, Ah-ha,
AskJeeves, FindWhat, LookSmart, Open Directory,
Search Hippo, Sprinks, Overture, Inktomi (?) - Offers Boolean and, or (no not) and phrase
search (if supported by the source) - Offers the following customization
- Selection of sources searched
36Dogpile
- Results listed ONLY by source
- Source for each result given
- Other features
- Offers Related Searches
- Offers a wide range of vertical searches, similar
to Metacrawler Images, MP3, Shopping, Subject
Directory, Multimedia, News, Message Boards
37Web Search Agentsaka desktop client search
programs
- Software must be purchased
- Queries a fixed set of engines, directories,
news and other databases - Sites that review and feature search agents
- Searchenginewatch.com
- Searchengineshowdown.com
- www.botspot.com
- www.agentland.com
38Web Search Agentstypical features
- Queries are re-formulated to follow syntax of
source databases - Duplicates removed
- Additional ranking performed
- Source given
- Optional sort orders
- Optional grouping of results into folders
- Many output options (html, word processor, xml,
e-mail and more)
39Web Search Agentsdifferent from other metas?
- Differences from the (good) free metas
- Many more sources queried
- Several output options
- Update option (re-running the search at specified
intervals) - Customizable search parameters
40Web Search Agents
- BullsEye Pro 3.0 199
- BullsEye Plus 49.99
- Covers 1000 sources
- Removes dead links
- Multiple language capability
- Government and News search groups
- Customization of sources available for an
additional fee - All other typical features
- Available at intelliseek.com
41Web Search Agents
- Copernic Pro 5.02 79.95
- Copernic 2001 Plus 39.95
- Copernic Plus Basic Free
- Pro version covers 1000 sources
- Removes dead links
- Post-search refinement and processing of
retrieved results - Automatic document summarizations (requires more
software) - All other typical features
- Available at www.copernic.com
42(No Transcript)
43Ultrabar choosing your own sources
- Free download
- Searches a small set of pre-selected engines and
allows more to be added, including Deep Web
resources - Offers search term highlighting
- Does not re-formulate queries for each source
- No output options
- Available at ultrabar.com
44(No Transcript)
45Evaluating metasearch services
- What are the sources for the results?
- Good general search engines and high-quality
directories? Shopping engines? Do any sources
share the same database? - What search features are offered?
- Remember, these are only in effect for the
sources that support them. - What results-based enhancements are offered?
- Clustering? More like this? Highlighting of
search terms? Related Searches?
46Evaluating metasearch services
- What factors determine the ranking of results?
- Is there any processing of results after
retrieval from the sources? - Is the source and/or ranking in that source given
for each hit? - Can the user expand the number of sources
searched and/or the search time?
47Evaluating metasearch services
- Use your own test-drive questions and compare
with results from other meta-engines and good
single engines and directories. - Search for questions in specialized subject areas
you are familiar with (tests database depth). - Search for very recent topics (tests database
freshness)
48Evaluating metasearch services
- Check its popularity through an independent
rating or popularity monitoring service - Media Metrix http//www.mediametrix.com/
- The oldest user-based rating service on the Web
lists top 50 most visited sites. - PC Data Online http//www.pcdataonline.com/reports
/ - Check for information at the site
- About, FAQ, Contact Us
49A GOOD meta will . . .
- Re-format queries to be compatible with search
syntax of each source - Enable searchers to use advanced features (when
the sources support them) - Indicate overlapping results without repeating
them - Perform additional processing of results, eg.
ranking for appropriateness, catagorization, etc.
- Use only sources with unique databases
50In conclusion . . .
- How do metas fit into my search strategy?
- Metas best used for simple searches, with little
(or no) syntactic complexity - Use them to find the top few sites on a topic
- For a quick overview of a topics coverage on the
Web in general - Use them as a last resort for highly focused
topics that elude your usual search tools - As a possible indication of coverage of a topic
among several engines (NOTE problematic) - Other uses??
51Thank you and Best of Luck with Metaengines!
- Michael Hunter
- Warren Hunting Smith Library
- Hobart and William Smith Colleges
- Geneva, NY 14507
- (315) 781-3552
- hunter_at_hws.edu