Title: Metasearch Engines
1Metasearch Engines
2Course Outline (recap)
- Introduction and the MPEG standards
- The research issues in MPEG-7
- Introduction to speech processing for multimedia
- Introduction to statistical pattern recognition
- Media indexing and retrieval
- Past, present and future
- Content-based retrieval (CBR)
- Introduction to concept-based retrieval
- Metasearch engines
- Human-computer interface
- Human body movement analysis
- Human emotion recognition
- Media transmission over peer-2-peer networks
- Dynamic resource allocation in media
transmission
3This class
- What are metasearch engines
- Common features
- Some metasearch engines
- Research issues
4Searching more than one database
- Users find more good documents but must
- Learn how to use each search engine
- Combine results
5Metasearch Engines
- Metasearch engines search many databases in
parallel
- Combine results
6Metasearch engine
Read query
Choose databases
For each chosen data base translate and send quer
y
7Metasearch engine
Accept search results
Select a subset from each
Merge and display results
8Advantages
- A uniform query language
- Choose best databases for query
- Save users time
- Provide better retrieval results
9Common features
- Search most of the popular search engines.
- Fast, because they use "parallel" (i.e.,
simultaneous) querying and have high-speed
processors
- Allow you to set length of wait time
10Differences
- How results are compiled when reported
- How and whether they can handle complex searches
- Whether you can customize the search strategy
11How results are compiled when reported
- Some report the results from each search engine
in sequence
- Others sort the results, eliminating duplicates.
- In some you can specify how results are sorted
12How and whether they can handle complex searches
- Some allow phrase searching,
- Some allow Boolean operators (especially OR and
NOT)
- Some strip out quotations or Boolean operators,
or create garbage by passing them through as
search terms.
- Few allow you to request truncation.
13Whether you can customize the search strategy
- In some you have more flexibility to vary time
limits and choose how results are reported.
- Some let you specify which search tool databases
are queried and in what order.
14Metacrawler
- No choice
- Fast searches
- Sends query to AltaVista, Excite, Infoseek,
LookSmart, Lycos, The Mining Co., WebCrawler,
Yahoo!
- Identifies and removed duplicates
- Consolidates results in one large list, ranked by
a "vote"
15Metacrawler
- Merges results by first normalizing all the
scores to values 0 to 1000
- Then adding the scores of multiply retrieved
documents
- Query ALL terms (AND), ANY terms ( OR), or
exact PHRASE. use /- and " around phrases.
16Inference Find!
- Queries 6 search engines currently uses
WebCrawler, Yahoo!, Lycos, AltaVista, Infoseek,
and Excite.
- Results are merged and clustered redundancies
are removed.
- Default is AND (can use OR and NOT. ignored in
tools that dont support)
- Allows phrases in
17Internet sleuth www.isleuth.com
- Users may search for appropriate database (3000
available)
- Will search for appropriate database
- A search for databases with pictures (or recipes)
finds a variety of databases
- Then users choose ones to search
- Does not merge results
18Dogpile
- AltaVista, Excite, Excite Subj. Guide,GoTo.com,
Infoseek, Lycos, Lycos' a2z, Magellan, The
Mining Co., PlanetSearch, Thunderstone,
WebCrawler, What-U-Seek, yahoo
19Dogpile
- List of hits after each search tool queried.
Duplicates may occur
- If 10 or more hits found among first 3 tried,
option to search more.
- Click on a link to a search engine
20Cyber 411
- Fast. Contacts 15 search engines for each query.
- Query one word or phrase
- Does not merge results
21Savvysearch www.savvysearch.com
- (Colorado State, Howes)
- Search engines selected based on Query text,
- Sources and types of information selected,
- Estimated Internet traffic,
- Anticipated response time
- The load on CSU computer
22Research issues
- How to choose best DBs
- How to merge results
23Choosing the best databases automatically
- Depends on available information
- Different researchers and systems make different
assumptions
- Choose DB X if it can provide good documents and
if users query can be executed
24Stored queries/relevancy (Voorhese)
- Queries with relevant results are stored
- New query compared to stored queries
- Use previous results to select databases and
- Number of documents to merge from each
25DB summary index (Callan)
- Collection information is available
- Commonly used keywords and their dfs
- Query is compared to databases
- Similarity used to select database and
- Number of documents from each
26Gloss
- Assumes knowledge of database/terms dfs
- Computes the probability of finding a document
containing all of the query terms in database
27Merging retrieval results
- Similarity values may not be available
- Similarity values may not be comparable
- Should similarity be modified when documents are
retrieved by more than one search engine
28Same search engine different databases
- Same ranking functions
- However same document different similarity
because of different database characteristics
(idfs)
29Experiments by Fox
- Used maximum score (good for relevant document)
- Used minimum score (good if non relevant)
- Sum of scores, average
30Difficulty of choosing and merging
- Search engines are constantly updated
- Interface changes
- Search changes
- Rank changes
- Display of results changes