CS 904: Natural Language Processing WEB SEARCH ENGINES - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

CS 904: Natural Language Processing WEB SEARCH ENGINES

Description:

Search Engines: Allow the user to enter keywords that are run against a database. ... A search engine is not searching the Internet 'live,' as it exists at ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 23
Provided by: ven7
Category:

less

Transcript and Presenter's Notes

Title: CS 904: Natural Language Processing WEB SEARCH ENGINES


1
CS 904 Natural Language ProcessingWEB SEARCH
ENGINES
  • L. Venkata Subramaniam
  • January 31, 2002

2
The World Wide Web
  • The World Wide Web is estimated to contain more
    than seven billion pages of publicly-accessible
    information.
  • The Web continues to grow at an exponential rate
    tripling in size over the past two years,
    according to one estimate.
  • All this data is uncatalogued and unclassified.

3
Is this a Library?
  • Definitely not!
  • Many a times no titles, no author names, no
    publication dates ....
  • No specific way of arranging the text no
    classification or cataloguing.
  • New data appearing every day and some old data
    disappearing.

4
Variability in WWW pages
  • Documents on the web have extreme variation
    internal to the documents, and also in the
    external meta information that might be
    available.
  • documents differ internally in their language,
    vocabulary, type or format etc.
  • Meta data includes reputation of the source,
    update frequency, quality, popularity or usage,
    and citations.

5
How do we search the WWW?
  • Subject Directories
  • Allow the user to browse through lists of WWW
    sites that are hierarchically organized indexes
    of subject categories.
  • Search Engines
  • Allow the user to enter keywords that are run
    against a database.
  • General directories, subject specific
    directories, general search engines, multi
    threaded search engines, subject specific search
    engines, all exist.

6
What a Search Engine does not do
  • A search engine does not search the whole WWW.
  • As of Jan 31, 2002, Google reports its size as
    2,073,418,204 pages. This is not the complete
    WWW.
  • A search engine is not searching the Internet
    "live," as it exists at this very moment.
  • Database updated every few hours, days or even
    months.

7
How it works
  • The three parts of a search engine
  • A mechanism that identifies web pages to be
    included in the database.
  • A mechanism that indexes the sites.
  • A searching mechanism with an interface, which
    scans, for keywords within the index.
  • At run time
  • Users search the index through queries.
  • Documents in which the search terms occur are
    presented as "hits."
  • The documents are listed according to some
    relevance criteria.

8
Indexing
  • A search engine uses its index to retrieve web
    documents in which your search terms occur.
  • Hit List The index lists the term and where it
    occurs (the URL or address of the web page,
    position in the page, font, capitalization etc.)
    much like a book index.
  • Every single word is included in the index!

9
Hand Indexing and Automatic Indexing
  • Human maintained Indices
  • Yahoo!
  • Cover popular topics effectively but are
    subjective, expensive to build and maintain, slow
    to improve, and cannot cover all esoteric
    topics.
  • Automatic generation of Indices
  • Google
  • Low quality matches.
  • Can be mislead.

10
A 'bot
  • Also called An intelligent agent, spider,
    crawler, robot, or worm.
  • An automated device (software) which may be
    programmed to search for terms (data "strings")
    matching certain criteria.
  • A 'bot identifies and notes the url's of web
    pages to be included in the database
  • Another 'bot then works on the interiors of the
    web documents, recording occurrences of words
    and their position within the text.
  • This information is used to create a huge index.

11
Querying
  • The query terms are treated as keywords to be
    found in the documents.
  • In the second generation web search engines
    natural language queries are understood and then
    acted upon.

12
Relevance (Results Ranking)
  • Relevance calculated based on how many times the
    search terms were found in the site.
  • Noting where the term occurs within the text and
    assigning this position a "weight" or level of
    importance.
  • Search terms occurring in the title, summary, in
    key positions within a paragraph or appearing
    several times within a paragraph usually carry
    more "weight."
  • For multiple terms higher weights given when
    terms appear closer together.

13
Relevance (Cont.)
  • Incorporating the popularity element.
  • Looking at how many links a web document has from
    other websites, and also the quality of the
    referring websites.
  • Ranking according to sites other searchers have
    chosen from their results to similar queries.

14
Query Evaluation
  • Parse the query.
  • Scan through the documents to find those matching
    the queries.
  • Rank the documents that matched the queries and
    present the top K.

15
Examples of search engines
  • AltaVista (http//www.altavista.com)
  • Excite (http//www.excite.com/search)
  • FAST (http//www.alltheweb.com)
  • Google (http//www.google.com)
  • HotBot (http//hotbot.lycos.com)
  • Northern Light (http//www.northernlight.com)

16
Evaluating a Search Engine
  • Quality of results
  • Coverage
  • Scalability
  • Efficiency in Storage and Retrieval
  • Query handling speed
  • Interface quality and ease

Good quality results. Efficient Crawling,
Indexing and Searching.
17
Techniques in Searching
  • Extend traditional IR techniques.
  • For example, the standard vector space model
    tries to return the document that most closely
    approximates the query, given that both query and
    document are vectors defined by their word
    occurrence. On the web, this strategy often
    returns very short documents that are the query
    plus a few words.

18
Google
  • As of Jan 31, 2002, Google reports its size as
    2,073,418,204 pages.
  • Automatic indexing of pages.
  • Implemented in C/C
  • In addition to keyword locations on a page it
    makes use of the link structure of the Web
  • To improve search results.
  • To calculate a quality ranking for each web
    page.
  • Data structures designed to avoid disk seeks
    whenever possible.

19
Yahoo!
  • Human maintained indexes
  • Covers popular topics
  • Subjective
  • Expensive to build and maintain
  • Slow to improve
  • Cannot cover all esoteric topics
  • In Yahoo, you are searching only the title and
    the short descriptive blurb about the site by
    contrast, search engines usually give you access
    to the full text of the document.

20
Alta Vista
  • Offers spell check.
  • Recognizes capitalization and proper nouns.
  • Offers search in numerous languages.
  • Ranks according to how many of the search terms a
    page contains, where in the document, and how
    close to one another the search terms are.

21
Ask Jeeves
  • Ask Jeeves is a natural language search engine
    which attempts to resolve user questions into
    appropriate answers.
  • Does semantic and syntactic processing of the
    query to understand the question.
  • Learns from previous interactions with other
    users to get to popular resources.
  • Guides (interacts with) the user into asking
    "useful" questions.
  • Retrieves the sites with the best answers.
  • It is a multi-threaded search engine.

22
References
  • Technical paper on Google http//www7.scu.edu.au/
    programme/fullpapers/1921/com1921.htm
Write a Comment
User Comments (0)
About PowerShow.com