Choosing and Using the Best Metas - PowerPoint PPT Presentation

About This Presentation
Title:

Choosing and Using the Best Metas

Description:

Dogpile 29 (no time-out option) Excite 41 Shortest 31 Longest. Stability of Results ... Dogpile. Results listed ONLY by source. Source for each result given ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 52
Provided by: hun4
Learn more at: https://people.hws.edu
Category:

less

Transcript and Presenter's Notes

Title: Choosing and Using the Best Metas


1
Choosing and Using the Best Metas
Hyper-searching the Web
  • Michael Hunter
  • Reference Librarian
  • Hobart and William Smith Colleges
  • for Rochester Regional Library Council
  • Member Libraries Staff
  • Sponsored by the
  • Rochester Regional Library Council
  • Supported by Library Services and Technology Act
    (LSTA) and/or
  • Regional Bibliographic Databases and Resources
    Sharing (RBDB) funds granted by the
  • New York State Library 2002

2
For Today
  • Metas History and Functions
  • Search and Retrieval Issues
  • Major Players in 2003
  • Clustering Technology
  • More Good Metas
  • Web Search Agents
  • Evaluating Metasearch Services

3
Metasearch defined . . .
  • Group of search engines, subject directories
    and/or databases made searchable through a common
    interface.
  • Results may or may not follow the original
    sources rankings
  • Today our focus is free metaengines using
    subject directories (Yahoo, LII, OD) and
    crawler-based engines as sources (Google, FAST,
    Teoma)
  • We will NOT examine specialized or Deep Web metas

4
A GOOD Meta will . . .
  • Re-format queries to be compatible with search
    syntax of each source
  • Enable searchers to use advanced features (when
    the sources support them)
  • Indicate overlapping results without repeating
    them
  • Perform additional processing of results, eg.
    ranking for appropriateness, catagorization, etc.
  • Use only sources with unique databases

5
The beginnings of metasearch
  • A conceptual descendant of Veronica
  • March 1995 Harvest (later Savvysearch, now
    Search.com) developed at Colorado State by Daniel
    Dreilinger
  • July 1995 Metacrawler developed at U. of
    Washington by Selberg and Etzioni
  • Metacrawler Architecture for Resource
    Aggregation on the Web 1996

6
(No Transcript)
7
The beginnings of metasearch
  • 1996 - Dogpile
  • 1998 - Ixquick
  • 1999 - Kartoo
  • 2000 - Ithaki
  • 2001 - Vivisimo

8
More facts about metas
  • Flavor determined by choice of sources
  • Comprehensive
  • Vivisimo, Ixquick, Metacrawler
  • General Lifestyle, popular culture
  • Dogpile, Profusion
  • Commercial
  • Search.com, Excite_at_home

9
Metas and retrieval
  • Metas search quickly but not deeply
  • Search time or a quantity of searches are
    purchased from sources (typically top 10-50 hits
    from each)
  • Metas are subject to time-out limits from their
    sources
  • Each source is usually NOT searched for each
    query

10
Metas and retrieval
  • Dumbing Down the Query
  • Advanced features are often not available, and
    then only those that are shared among sources
  • Default setting for time-out is the shortest set
    to maximum for more comprehensive searches (when
    available)
  • For most metas, advertising is the only source of
    revenue software sales are rare

11
Metas and retrieval
  • What is their place in my search strategy?
  • Metas best used for simple searches, with little
    (or no) syntactic complexity
  • Use them to find the top few sites on a topic
  • For a quick overview of a topics coverage on the
    Web in general
  • Use them as a last resort for highly focused
    topics that elude your usual search tools
  • As a possible indication of coverage of a topic
    among several engines (NOTE problematic)

12
Searching the metas
  • Results depend on
  • Choice of sources
  • Query processing speed OF THE SOURCE
  • Length of time spent at each source

13
A search comparison . . .
  • Searched heterotropia (abnormal binocular vision)
    on 4/21/03
  • Vivisimo 77 Shortest 126 Longest
  • Ixquick 37 from at least 450 results
  • Profusion 30 Shortest 39 Longest
  • Metacrawler 42 Shortest 61 Longest
  • Webcrawler 31 Shortest 80 Longest
  • Dogpile 29 (no time-out option)
  • Excite 41 Shortest 31 Longest

14
Stability of ResultsSearched kids of survival
(modern art group) as a phrase at 3-minute
intervals (time-outs at default setting)
4/21/03
15
Metas and ranking options
  • Listing by SOURCE
  • Usually retains ranking of source
  • COMBINED Listing options
  • Indicate source of each result
  • Indicate duplicates without repeating them
  • Indicate position in original sources ranking
  • Most duplicated hits listed first
  • Disclose paid listings (if disclosed by source)

16
Vivisimo
  • http//vivisimo.com
  • Sources Altavista, Yahoo, MSN, Netscape, Lycos,
    LookSmart, Gigablast, Vizzavi, BBC, Librarians
    Index to the Internet plus 11 specialized news
    sources and 7 specialized business, medical and
    governmental sources
  • Offers full Boolean and phrase search (if
    supported by the source)

17
Vivisimo
  • Offers the following customizations
  • Selection of sources searched
  • Total number of results retrieved
  • Length of search (time-out period)
  • Results combined
  • Source for each result given
  • Ranking data from that source given
  • Duplicates noted, but not repeated

18
Vivisimo
  • Other features
  • Results are clustered by keyword prevalence or
    website of origin
  • Offers a preview of each result in a separate
    window
  • Offers vertical searches Top News, Business
    News, Tech News, Sports News

19
Clustering results (folders)
  • Automated subject analysis
  • Facilitates navigation and query refinement
  • Can be hierarchical (folders within folders)
  • One document may appear in several folders
  • Northern Light first public search engine to make
    use of folders

20
Clustering technology in a metasearch environment
  • Real-time processing of results retrieved from
    sources
  • Variety of data can be returned from each source
  • Url
  • Title
  • First few sentences
  • Human-created summary
  • Folder creation varies according to data from
    sources and processing time available at the
    moment of the query

21
Clustering -- Step 1
  • Significant terms are identified from all results
    based on
  • Frequency of term(s)
  • Position of term(s)
  • Normalization algorithms applied
  • Documents analyzed for word variants (stemming)
  • Norms set (authority control)
  • game downloads download games
  • downloading games
  • Folder labels created

22
Clustering Step 2
  • Each result from the sources is matched against
    the set of folder labels and assigned to one or
    more folders
  • By linguistic analysis (term position, predictive
    descriptive importance)
  • By statistical analysis (term frequency)
  • Final, proprietary analysis combines these (and
    more)
  • Remember The full documents are not available
    to a meta for this type of processing

23
(No Transcript)
24
(No Transcript)
25
Profusion
  • http//profusion.com
  • Sources Altavista, Yahoo, MSN, About.com, Adobe
    PDF, AOL, LookSmart, Lycos, Netscape, Raging
    Search, Teoma, WiseNut
  • Offers full Boolean and phrase search (if
    supported by the source)

26
Profusion
  • Offers the following customizations
  • Selection of sources searched
  • Total number of results retrieved
  • Length of search (time-out period)
  • Offers option of results listed by source or
    combined listing
  • Source for each result given
  • Ranking data from that source given
  • Duplicates noted, but not repeated

27
Profusion
  • Other features
  • Results can be sorted by relevance score, title
    or URL
  • Similar Result enhancement
  • Profusion Relevance Score shown
  • Search terms highlighted in results listing
  • Set Search Alert feature stores searches and
    alerts user to page changes requires setting up
    a (free) account
  • Search Analysis available
  • Offer vertical searches Deep Web content in 21
    broad categories News

28
Ixquick
  • http//ixquick.com
  • Sources Altavista, Netscape, Gigablast, Adobe
    PDF, Avaya PDF, AskJeeves, Teoma, Go, Open
    Directory, Overture, Kanoodle, LookSmart,
    WiseNut, FindWhat, Yahoo, MSN
  • Offers full Boolean and phrase search (if
    supported by the source)
  • Offers the following customizations
  • Selection of sources searched
  • Length of search (time-out period)

29
Ixquick
  • Results combined
  • Source for each result given
  • Ranking data from that source given
  • Duplicates noted, but not repeated

30
Ixquick
  • Other features
  • Offers 7 field searches (when supported by
    sources)
  • Clusters hits from same site
  • Highlights search terms in each hit
  • Offers Related Searches
  • Offers vertical searches MP3, News, Pictures

31
iBoogie
  • http//iboogie.com
  • Sources Altavista, Yahoo, MSN, FAST, FindWhat,
    Teoma, WiseNut, OpenFind
  • Boolean and phrase search somewhat unreliable
  • Offers the following customizations
  • Selection of sources searched
  • Total number of results retrieved
  • Length of search (time-out period)

32
iBoogie
  • Results combined
  • Source for each result given
  • Duplicates noted, but not repeated
  • Other features
  • Adult content filter (when supported by source)
  • Language limit (when supported by source)
  • Clusters results by keyword and/or website
  • Offers Similar Pages enhancement
  • Offers vertical searches Newspapers,
    Bookstores, Reference, Shopping

33
Metacrawler
  • http//metacrawler.com
  • Sources FAST, Google, About.com, AskJeeves,
    FindWhat, LookSmart, Inktomi (?), Open Directory,
    Overture, Search Hippo, Sprinks, Teoma
  • Offers Boolean and, or (no not) and phrase
    search (if supported by the source)
  • Offers the following customizations
  • Selection of sources searched
  • Total number of results retrieved
  • Length of search (time-out period)

34
Metacrawler
  • Offers option of results listed by source or
    combined listing
  • Source for each result given
  • Duplicates noted, but not repeated
  • Other features
  • Offers Related Searches
  • More like this results enhancement
  • Offers a wide range of vertical searches Images,
    MP3, Shopping, Subject Directory, Multimedia,
    News, Message Boards

35
Dogpile
  • http//dogpile.com
  • Sources Google, Fast, About.com, Ah-ha,
    AskJeeves, FindWhat, LookSmart, Open Directory,
    Search Hippo, Sprinks, Overture, Inktomi (?)
  • Offers Boolean and, or (no not) and phrase
    search (if supported by the source)
  • Offers the following customization
  • Selection of sources searched

36
Dogpile
  • Results listed ONLY by source
  • Source for each result given
  • Other features
  • Offers Related Searches
  • Offers a wide range of vertical searches, similar
    to Metacrawler Images, MP3, Shopping, Subject
    Directory, Multimedia, News, Message Boards

37
Web Search Agentsaka desktop client search
programs
  • Software must be purchased
  • Queries a fixed set of engines, directories,
    news and other databases
  • Sites that review and feature search agents
  • Searchenginewatch.com
  • Searchengineshowdown.com
  • www.botspot.com
  • www.agentland.com

38
Web Search Agentstypical features
  • Queries are re-formulated to follow syntax of
    source databases
  • Duplicates removed
  • Additional ranking performed
  • Source given
  • Optional sort orders
  • Optional grouping of results into folders
  • Many output options (html, word processor, xml,
    e-mail and more)

39
Web Search Agentsdifferent from other metas?
  • Differences from the (good) free metas
  • Many more sources queried
  • Several output options
  • Update option (re-running the search at specified
    intervals)
  • Customizable search parameters

40
Web Search Agents
  • BullsEye Pro 3.0 199
  • BullsEye Plus 49.99
  • Covers 1000 sources
  • Removes dead links
  • Multiple language capability
  • Government and News search groups
  • Customization of sources available for an
    additional fee
  • All other typical features
  • Available at intelliseek.com

41
Web Search Agents
  • Copernic Pro 5.02 79.95
  • Copernic 2001 Plus 39.95
  • Copernic Plus Basic Free
  • Pro version covers 1000 sources
  • Removes dead links
  • Post-search refinement and processing of
    retrieved results
  • Automatic document summarizations (requires more
    software)
  • All other typical features
  • Available at www.copernic.com

42
(No Transcript)
43
Ultrabar choosing your own sources
  • Free download
  • Searches a small set of pre-selected engines and
    allows more to be added, including Deep Web
    resources
  • Offers search term highlighting
  • Does not re-formulate queries for each source
  • No output options
  • Available at ultrabar.com

44
(No Transcript)
45
Evaluating metasearch services
  • What are the sources for the results?
  • Good general search engines and high-quality
    directories? Shopping engines? Do any sources
    share the same database?
  • What search features are offered?
  • Remember, these are only in effect for the
    sources that support them.
  • What results-based enhancements are offered?
  • Clustering? More like this? Highlighting of
    search terms? Related Searches?

46
Evaluating metasearch services
  • What factors determine the ranking of results?
  • Is there any processing of results after
    retrieval from the sources?
  • Is the source and/or ranking in that source given
    for each hit?
  • Can the user expand the number of sources
    searched and/or the search time?

47
Evaluating metasearch services
  • Use your own test-drive questions and compare
    with results from other meta-engines and good
    single engines and directories.
  • Search for questions in specialized subject areas
    you are familiar with (tests database depth).
  • Search for very recent topics (tests database
    freshness)

48
Evaluating metasearch services
  • Check its popularity through an independent
    rating or popularity monitoring service
  • Media Metrix http//www.mediametrix.com/
  • The oldest user-based rating service on the Web
    lists top 50 most visited sites.
  • PC Data Online http//www.pcdataonline.com/reports
    /
  • Check for information at the site
  • About, FAQ, Contact Us

49
A GOOD meta will . . .
  • Re-format queries to be compatible with search
    syntax of each source
  • Enable searchers to use advanced features (when
    the sources support them)
  • Indicate overlapping results without repeating
    them
  • Perform additional processing of results, eg.
    ranking for appropriateness, catagorization, etc.
  • Use only sources with unique databases

50
In conclusion . . .
  • How do metas fit into my search strategy?
  • Metas best used for simple searches, with little
    (or no) syntactic complexity
  • Use them to find the top few sites on a topic
  • For a quick overview of a topics coverage on the
    Web in general
  • Use them as a last resort for highly focused
    topics that elude your usual search tools
  • As a possible indication of coverage of a topic
    among several engines (NOTE problematic)
  • Other uses??

51
Thank you and Best of Luck with Metaengines!
  • Michael Hunter
  • Warren Hunting Smith Library
  • Hobart and William Smith Colleges
  • Geneva, NY 14507
  • (315) 781-3552
  • hunter_at_hws.edu
Write a Comment
User Comments (0)
About PowerShow.com