Information Retrieval on the World Wide Web - PowerPoint PPT Presentation

About This Presentation
Title:

Information Retrieval on the World Wide Web

Description:

Current Search Engines. type 1: automatically indexed ... Current Search Engines (cont) Type 1. AltaVista, Excite, HotBot, InfoSeek, Lycos, OpenText. Type 2 ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 20
Provided by: jamesrober
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval on the World Wide Web


1
Information Retrieval on the World Wide Web
  • Authors Venkat N. Gudivada Vijay V.
    Raghavan William I. Grosky Rajesh Kasanagottu
  • Presented by Rob von Behren

2
Roadmap
  • Information Retrieval
  • Implementation Issues and Techniques
  • Analysis

3
Definitions
  • Information retrieval - querying against a set of
    documents to find a subset of "relevant"
    documents
  • Objective terms - external descriptions, not
    related to content
  • Nonobjective terms (content terms) - descriptions
    of the informational content of the document
  • indexing exhaustivity - degree to which the index
    covers the document space
  • term specificity - describes how well a
    particular term limits search results.
  • Recall - relevant docs found / relevant docs in
    collection
  • Precision - relevant docs found / total docs found

4
Key qualities
  • Document and query representations
  • Mechanisms for finding relevant documents and
    ranking the results
  • Mechanisms for obtaining user feedback

5
Types of IR models
  • Set Theoretic
  • Algebraic
  • Probabilistic
  • Hybrid

6
Set Theoretic Models
  • Boolean model - Simple Boolean queries regarding
    existence of terms within documents. Queries do
    not contain information about the context of the
    terms.
  • fuzzy set model - Slight expansion of Boolean.
    Allows results to include documents that meet
    most of the requirements of the Boolean search.

7
Algebraic models (vector-space model)
  • Documents are represented by n-dimensional
    vectors.
  • Typically one dimension per term
  • Also possible to treat signatures as bit vectors
  • Queries are n-dimensional vectors
  • Query relevance is the scalar product of the
    document with the query

8
Probabilistic models
  • Start with some user-supplied relevance
    information about a training set of documents
  • Compute P(relevant T) and P(non-relevant T)
    based on the terms observed in the training set
  • Useful for theoretical analysis, but probably not
    in practice (?)

9
Hybrid models (extended Boolean model)
  • Represent documents as vectors
  • Use the L-p norm, to allow definition of Boolean
    operations on vectors
  • p1 gt the vector model
  • pinfinity terms are equally weighted gt
    Boolean model
  • Empirically best values 2 lt p lt 5

10
User feedback
  • Modify query representation (can be done by the
    user)
  • modify term weights
  • query expansion (add new terms)
  • split the query
  • Modify document representation
  • change term weights within the database
  • agent-based filtering

11
Roadmap
  • Information Retrieval
  • Implementation Issues and Techniques
  • Analysis

12
Web Crawling
  • WWW is a directed graph
  • Use your favorite graph traversal algorithm!!
  • Netizenship issues
  • starting points
  • individual page
  • set of pages
  • domain name searching (good because the web isn't
    necessarily connected)

13
Automatic Indexing
  • single term - Just look at the existence or
    non-existence of the term in the document
  • phrase - Additionally store other information
    about the position of the term in the document,
    and the positions of other terms relative to it

14
Automatic Indexing (cont)
  • Statistical - Term weights depend on how well
    they differentiate between documents
  • Information-theoretic - Signal to noise. Similar
    to some types of statistical indexing
  • Probabilistic - Compute the importance of terms
    based on user feedback on a subset of the
    documents
  • linguistic - Use language syntax information such
    as part of speech

15
Current Search Engines
  • type 1 automatically indexed
  • type 2 (partially) human indexed, hierarchically
    organized
  • Common features
  • allow Boolean searches
  • do vector-like queries to find document relevance

16
Current Search Engines (cont)
  • Type 1
  • AltaVista, Excite, HotBot, InfoSeek, Lycos,
    OpenText
  • Type 2
  • Yahoo, Magellan, WWW Virtual Library, Galaxy

17
Roadmap
  • Information Retrieval
  • Implementation Issues and Techniques
  • Analysis

18
Analysis
  • disjunctive gt conjunctive gt phrase (DUH!)
  • Flaws
  • No tayloring of search to intent of query (by
    adding/excluding terms) or doing more complicated
    boolean expressions.
  • No tayloring of search to specific capabilities
    of search engine (lowest common denominator)

19
Future Directions
  • Use META tags to note content
  • Add user feedback mechanisms
  • Have small, specific databases, rather than
    monolithic databases
  • Create common interfaces (federation of
    databases)
  • Possibly allow better management of index content?
Write a Comment
User Comments (0)
About PowerShow.com