Introduction to WWW Application - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Introduction to WWW Application

Description:

Match words that are almost the same spelling as these keywords. Usually an on/off switch ... Lists the most relevant hits first. Search-and-rank ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 29
Provided by: people48
Category:

less

Transcript and Presenter's Notes

Title: Introduction to WWW Application


1
Chapter 5
  • Introduction to WWW Application

2
WWW Applications
  • Search Engine / Meta-Search Engine
  • Web Data Mining
  • Bots and Internet Intelligent Agents
  • Electronic Commerce
  • Web Titles
  • e-Learning

3
Section 5-1
  • Search Engine / Meta-Search Engine

4
What is Search Engine?
  • A mechanism that help users to find online
    resources quickly.

5
Popular Search Engines
  • AltaVista (http//www.altavista.com)
  • Excite (http//www.excite.com)
  • Google (http//www.google.com)
  • HotBot (http//www.hotbot.com)
  • Lycos (http//www.lycos.com)
  • Yahoo! (http//www.yahoo.com)
  • WebCrawler (http//www.webcrawler.com)
  • Openfind, GAIS, Yam,etc.

6
Types of Search Tools
  • Search Engines Meta-Search Engines
  • Search Engine Google
  • Meta-Search Engine Metacrawler,SavvySearch
  • Subject Directories
  • Yahoo!
  • Specialized Databases (The Invisible Web)
  • Librarian's Index

7
How to choose a starting point?
  • Search Engines
  • Advantage Can be fast.
  • disadvantage Irrelevant information can
    overwhelm useful information. (Good choice of
    keywords can help here.)
  • Specialized Web Site
  • Advantage Leads to information inaccessible to
    search engines.
  • disadvantage May not exist for your topic.

8
How to choose a starting point? (Cont.)
  • FAQ
  • Advantage A great place to start.
  • Disadvantage Not all topics have FAQs.
  • Guess
  • Advantage Can be very fast.
  • Disadvantage Requires experience, intuition.
  • Discussion group
  • Advantage Reaches a community of experts.
  • Disadvantage Relatively slow. Experts may tire
    of beginner questions.

9
Search Engine Catalog
  • Catalog is the set of Web pages that a search
    engine knows how to find. Also called a database
    or index.
  • A search engine can find only the Web pages in
    its catalog.
  • No catalog covers the entire Internet since the
    Internet keeps changing, so catalogs are never
    completely up to date.

10
Give a query, get a hit
  • Keyword is a word, partial word, or phrase that
    you can give to a search engine. Also called a
    search term.
  • Query is one or more keywords that, together,
    represent the concept that you want to find on
    the Net. Also called a search string.
  • Hit is a Web page in the catalog that matches
    your query. Also called a match.

11
Techniques to build catalogs
  • Active Search Engine
  • Collects Web page information by itself.
  • Use a program called a spider (also called a
    robot, wanderer or crawler) that travels around
    the Net, locates Web pages, and adds entries to
    the catalog.
  • Some spiders run all the time, adding information
    to the catalog on a regular basis. Others run
    less frequently, perhaps updating the catalog
    weekly or monthly.

12
(No Transcript)
13
Techniques to build catalogs (Cont.)
  • Passive Search Engine
  • Does not seek out Web pages by itself.
  • Allow people to register their Web pages, usually
    by filling out a form online. Once a page is
    registered with the search engine, the page can
    be found by queries.
  • Some search engines have both active and passive
    features. They use a spider to gather
    information, but also allow users to register
    pages.

14
Techniques to build catalogs (Cont.)
  • Meta-Search Engine
  • Do not catalog any Web pages themselves.
  • It forward users queries to other search engines
    to do the actual work.
  • When results come back from the other search
    engines, the meta-search engine presents them to
    the user, possibly summarizing them or at least
    giving them a consistent appearance.

15
(No Transcript)
16
Comparison of Search Engines
  • Active Search Engine
  • Advantage Large catalog.
  • Disadvantage Too many hits.
  • Passive Search Engine
  • Advantage Possibly more organized.
  • Disadvantage Smaller catalog items may be
    cataloged in unexpected places.
  • Meta-Search Engine
  • Advantage One query goes a long way.
  • Disadvantage Longer search time.

17
Choose keywords with care
  • The success of a Web search depends heavily on
    the keywords you choose. Be sure to watch out
    for
  • Misspellings (???)
  • Alternate spellings (?????)
  • Synonyms (???)
  • Word forms (?????)

18
The forms of advanced query
Concept Appearance Meaning
And AND, , Match all of these keywords.
Or OR, , Match at least one of these keywords.
Not NOT, , - Match if this keyword is not present.
Some Usually an on/off switch Only some of the keywords must be matched.
Required keyword Along with the Some operator, indicates a keyword that must be matched.
Near NEAR Match these keywords if they are near each other.
19
The forms of advanced query (Cont.)
Concept Appearance Meaning
Adjacent quotation marks Match these keywords if they are next to each other, in order.
Grouping (parentheses) Try to match these keywords before matching the rest of the keywords.
Allow misspellings Usually an on/off switch Match words that are almost the same spelling as these keywords.
Allow partial words Usually an on/off switch Also called substring match. Match.words that contain your keyword.
20
The forms of advanced query (Cont.)
Concept Appearance Meaning
Case sensitivity Usually an on/off switch Ignore or obey capitalization when matching words.
Wildcard Match anything
Limit search Usually an on/off switch Search only part of the search engines catalog.
21
Search Strategies
  • General search (?????) When you know little
    about your topic.
  • Specific search (?????) When you know a lot
    about your topic.
  • Incremental search (?????) Zeroing in on your
    topic.
  • Substring search (????) Matching several similar
    keywords at once.
  • Search-and-jump (?????) A speedy, two-part
    search technique.
  • Category search (????) Convenient browsing of a
    topic area.
  • Search-and-rank (?????) Locating the most
    relevant hits first.

22
Comparison of search strategies
Strategy Advantage Disadvantage
General search Likely to get a relevant hit. Likely to get many irrelevant hits too.
Specific search Hits are more likely to be relevant. Low odds of getting a hit.
Incremental search Zero in on your goal. Multiple queries are time-consuming.
Substring search Can simplify queries. Likely to produce irrelevant hits.
Search-and-jump Faster than multiple queries. Download time may be longer less powerful than multiple queries.
Category search Logical, organized, great for browsing. Relies on the skill of the organizer, whose world view may or may not match yours.
Search-and-rank Lists the most relevant hits first. Effective ranking functions are still undiscovered.
23
Some Meta-Search Engines
  • WebCrawler
  • Characteristics
  • It uses a content-based, full-text indexing
    system to provide a high-quality index.
  • It uses a breadth-first search strategy to create
    a broad index.
  • It tries to include as many Web servers as
    possible.

24
Some Meta-Search Engines (Cont.)
  • Architecture
  • The search engine.
  • The agents.
  • The database.
  • The query server.

25
Some Meta-Search Engines (Cont.)
  • Lycos
  • It extracts the following pieces of information
    from each document that it retrieves
  • Title
  • Headings and subheadings
  • 100 most important words
  • First 20 lines
  • Size in bytes
  • Number of words

26
Some Meta-Search Engines (Cont.)
  • The 100 important words are selected using the Tf
    Idf weighting algorithm.
  • Tf (Term Frequency) is the number of occurences
    of particular terms in the collection.
  • Df (Document Frequency) is the number of
    documents in the collection which particular
    terms occur.
  • IDf (Inverse Document Frequency)
  • N the number of documents in a collection
  • IDf log(N / Df)
  • weight Tf IDf Tf log(N / Df)

27
Some Meta-Search Engines (Cont.)
  • Harvest
  • It is an integrated tool that provides a
    scalable, customizable architecture for
    gathering, indexing, caching, replicating, and
    accessing Internet information.

28
Some Meta-Search Engines (Cont.)
  • Subsystems
  • Gatherer collects indexing information
  • Broker provides a flexible interface to gathered
    information
  • Index/Search subsystem allows the information
    space to be flexibly indexed and searched in a
    variety of ways
  • Object Cache stores contents of retrieved objects
    to alleviate access bottlenecks to popular data
  • Replicator mirrors index information of Brokers
    to alleviate server bottlenecks
Write a Comment
User Comments (0)
About PowerShow.com