CMP 788 Distributed IR - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

CMP 788 Distributed IR

Description:

Project started at University of Colorado(1994 ) in collaboration with several ... FTP, NetNews, WAIS (Wide Area Information Service: 1992), Gopher (1992) ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 14
Provided by: GJu
Category:

less

Transcript and Presenter's Notes

Title: CMP 788 Distributed IR


1
CMP 788 Distributed IR
  • Part 2 Lecture 6
  • Harvest- Part 1
  • Harvest A Scalable, Customizable Discovery and
    Access System
  • Fall 05
  • Department of Mathematics
  • and Computer Science
  • Lehman College, CUNY

2
Harvest Project
  • Project started at University of Colorado(1994)
    in collaboration with several researchers in
    other universities, research institutions, and
    corporations(USC, Uarizona, Transac co., etc)
  • Some principal protocols for the Internet
    information resources
  • FTP, NetNews, WAIS (Wide Area Information
    Service 1992), Gopher (1992)
  • HTTP, Z39.50 (http//www.dlib.org/dlib/april97/04l
    ynch.html)
  • Problems in information storage and retrieval in
    Internet environment
  • Ever increasing user base, ever increasing
    information volume, difficulty in locating
    relevant information
  • Server and network bottlenecks due to excessive
    loads.
  • Little support from existing systems for
    structured and complex data.
  • Support mainly text and graphics for end-use
    viewing.
  • Many systems usually store the encountered object
    with internal structure (a database object for
    example) in some file.
  • Harvest tries to support structured data through
    the use of attribute-value structured indexes.

3
Harvest Functionality
  • Major functionality
  • Supports resource discovery through topic
    specific content index.
  • Employees very efficient distributed information
    gathering architecture.
  • Topology adaptive index replication and object
    caching
  • Try to support structured data through structure
    preserving indexes and data type specific
    manipulation.
  • Major System Components
  • Gatherer locate process information(webspider
    )
  • also Essence subsystem (summarizer)
  • Broker Information server(interface to gathered
    data)
  • Indexing/Search subsystems specialized search
    interface to broker
  • Distributed Database technologies
  • Object cache supports rapid retrieval of
    frequently used object
  • Replicator supports transparent mirror sites

4
Two Types of Indexing in Harvest
  • File/menu name indexing of widely distributed
    information (e.g., Archie- acts like a library
    card catalog searches the anonymous FTP file list
    to get you the location of the site)
  • Very space efficient but support limited queries
  • It is only possible to query Archie for graphics
    packages (for example) whose file names happen
    to reflect their contents.
  • Full text indexing systems
  • Support powerful queries but require indexing
    space overhead.
  • Global flat indexes are less useful as the
    information space grows
  • (Because) Causing queries to match too much
    indexed information.
  • Harvest achieves space efficient, content indexes
    of widely distributed information.

5
Caching and Replication
  • There are efforts currently to build object
    caches on HTTP servers and to support replication
    in Internet information systems such as Archie
    and FTP servers.
  • In contrast to the flat organization of existing
    Internet caches, Harvest supports a hierarchical
    architecture of object caches modeled after DNS
    (Domain Name Service).
  • In many systems, replicas are placed and
    configured manually.
  • Harvest configures the replicas automatically,
    adapting to measured network changes. More
    appropriate in the dynamic environment.
  • The object cache reduces the network load, server
    load and response latency while accessing located
    information objects.

6
Inefficient Information Gathering
7
Observations on Inefficient Gathering
  • Provider indicates a server running one or more
    of the standard Internet information services
    (e.g., FTP, HTTP).
  • These servers typically for a child process for
    every request fro the indexing system (bold boxes
    represent excessive load on the server).
  • Indexing systems retrieve entire object (bold
    edges indicating excessive network traffic) but
    discard most of its content.
  • There is also no coordinated effort between the
    indexing systems among each other. The indexing
    systems gather information independently.

8
Harvest Info Gathering Approach
9
Harvest Gatherer/Broker
  • Gatherer collects indexing information from the
    Providers.
  • Broker provides the indexing query interface to
    this information.
  • Brokers update their indexes incrementally
  • Gathered information can be shared by several
    brokers (this saves gathering cost
    substantially).
  • Gatherer on the same machine as the Provider.
  • Broker itself can get information from other
    Brokers, which can allow further filtering and
    fine tuning of the indexing information if
    needed.
  • Communication between Gatherer and Broker uses
    SOIF (Summary Object Interchange Format) which is
    attribute-value stream protocol. Easily parsed
    but expressive enough to handle different object
    types

10
Harvest Gatherer/Broker (contd)
  • Each SOIF template contains a type, a URL, and a
    list of byte-count delimited field name/value
    pairs.
  • Mandatory fields can be added too (e.g.,
    attribute for a Broker describe the servers
    administrator, location, software version, and
    type of objects it contains)
  • (e.g.)
  • _at_DOCUMENThttp//comet.lehman.cuny.edu/../cmp788D
    IR.htm
  • Title25Distributed Information
  • Instructor-Name14) Prof. G. Jung
  • Description91 This course discusses.

11
Harvest Gatherer/Broker (contd)
  • Harvest Server Registry (HSR) is a special
    instance of Broker that registers information
    about every other component in the system.
  • Topic specific brokers are possible (e.g., one
    broker can index PC software archives while
    another can keep track of scientific papers in
    computer architecture).
  • HSR will be contacted first for searches and new
    server deployment efforts. HSR can be replicated.
  • Divide the gathering process into several servers
    (e.g., one server index each US regional network)
    distributing the partial updates among the
    replicas.

12
Harvest System Architecture
13
Harvest Part 2 (DIRP2-lecture7)
  • Will discuss
  • more about broker subsystem
  • index/search subsystem
  • caching subsystem
  • replication subsystem.
Write a Comment
User Comments (0)
About PowerShow.com