Searching the web - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Searching the web

Description:

Alta Vista was a well-known early example. Automating Search ... Alta Vista Advanced Search. Simplest is 'and'ing together keywords; fast, straightforward. ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 13
Provided by: paulaam
Category:
Tags: alta | searching | vista | web

less

Transcript and Presenter's Notes

Title: Searching the web


1
Searching the web
  • Enormous amount of information
  • In 1994, 100 thousand pages indexed
  • In 1997, 100 million pages indexed
  • In June, 2000, 500 million pages indexed
  • In January, 2001, 1.3 billion pages indexed
  • Estimate is that this is around 10-15 of the web

2
Finding Out About
  • People interact with all that information because
    they want to KNOW something there is a question
    they are trying to answer or a piece of
    information they want
  • Simplest approach
  • Knowledge is organized into chunks (pages)
  • Goal is to return appropriate chunks

3
Search Engines
  • Goal of search engine is to return appropriate
    chunks
  • Steps involve include
  • asking a question
  • finding answers
  • evaluating answers
  • presenting answers
  • Value of a search engine depends on how well it
    does on all of these.

4
Asking a question
  • Reflect some information need
  • Query Syntax needs to allow information need to
    be expressed
  • Keywords
  • Combining terms
  • Simple required, NOT ( and -)
  • Boolean expressions with and/or/not and nested
    parentheses
  • Variations strings, NEAR, capitalization.
  • Simplest syntax that works
  • Typically more acceptable if predictable
  • Another set of problems when information isnt
    text graphics, music

5
Finding the Information
  • Goal is to retrieve all relevant chunks. Too
    time-consuming to do in real-time, so search
    engines index pages.
  • Two basic approaches
  • Index and classify by hand
  • Automate
  • For BOTH approaches deciding what to index on
    (e.g., what is a keyword) is a significant issue.
  • Most major search sites now provide both

6
Indexing by Hand
  • Indexing by hand involves having a person look at
    web pages and assign them to categories.
  • Assumes a hierarchy of categories exists into
    which pages are placed
  • Each document can go into multiple categories
  • Produces very high quality indices
  • Can retrieve by browsing the hierarchy
  • Very expensive to create.
  • YAHOO is best-known early example

7
Automated Indexing
  • Automated indexing involves parsing documents to
    pull out key words and creating a table which
    links keywords to documents
  • Doesnt have any predefined categories or
    keywords
  • Can cover a much higher proportion of the web
  • Can update more quickly
  • Much lower quality, therefore important to have
    some kind of relevance ranking
  • Alta Vista was a well-known early example

8
Automating Search
  • We will focus on automated search and indexing.
  • Always balancing various factors
  • Recall and Precision
  • If there are 100 relevant documents and you find
    50, your recall is 50.
  • If you find 100 documents, and 10 of them are on
    topic, your precision is 10.
  • Which is more important varies with query and
    with coverage
  • Speed, storage, completeness, timeliness
  • How fast can you locate and index documents?
    Answer a query?
  • How much room do you need on your server?
  • What percent of the web do you cover?
  • How many dead links do you have? How long before
    information is found by your search engine?
  • Ease of use vs power of queries
  • Full Boolean queries very rich, very confusing.
    Alta Vista Advanced Search.
  • Simplest is anding together keywords fast,
    straightforward. Google

9
Search Engine Basics
  • A spider or crawler starts at a web page,
    identifies all links on it, and follows them to
    new web pages.
  • A parser processes each web page and extracts
    individual words.
  • An indexer creates/updates a hash table which
    connects words with documents
  • A searcher uses the hash table to retrieve
    documents based on words
  • A ranking system decides the order in which to
    present the documents their relevance

10
Search Engines Not So Basic
  • Summary of document
  • Cache of document
  • Format Filters (pdf, postscript, etc)
  • Duplicate identification and removal
  • More like this
  • Content Filters

11
Evaluating Search Engines
  • Generally, the usability of a search engine
    includes several factors
  • Coverage. Most important -- if it doesnt even
    look at a page it cant retrieve it. Matters
    more when looking for rare information.
  • Relevance ranking. If precision is low and
    relevance ranking is poor, takes too much wading
    to find desired result. Matters more in noisy
    domains.
  • Ease of use. This includes the query syntax, and
    also factors like how crowded the page is. A lot
    of personal preference here.
  • Other features. In some settings special
    features may be critical.

12
Some Well-Known Search Engines
  • Google. www.google.com
  • AltaVista
  • Yahoo
  • Lycos
  • ...
Write a Comment
User Comments (0)
About PowerShow.com