CMP 788 Distributed IR

About This Presentation

Title:

Description:

Number of Views:25

Avg rating:3.0/5.0

Slides: 14

Provided by: GJu

Category:

Tags: cmp | distributed | gopher

Transcript and Presenter's Notes

Title: CMP 788 Distributed IR

1
CMP 788 Distributed IR

2
Harvest Project

Project started at University of Colorado(1994)
in collaboration with several researchers in
other universities, research institutions, and
corporations(USC, Uarizona, Transac co., etc)
Some principal protocols for the Internet
information resources
FTP, NetNews, WAIS (Wide Area Information
Service 1992), Gopher (1992)
HTTP, Z39.50 (http//www.dlib.org/dlib/april97/04l
ynch.html)
Problems in information storage and retrieval in
Internet environment
Ever increasing user base, ever increasing
information volume, difficulty in locating
relevant information
Server and network bottlenecks due to excessive
loads.
Little support from existing systems for
structured and complex data.
Support mainly text and graphics for end-use
viewing.
Many systems usually store the encountered object
with internal structure (a database object for
example) in some file.
Harvest tries to support structured data through
the use of attribute-value structured indexes.

3
Harvest Functionality

Major functionality
Supports resource discovery through topic
specific content index.
Employees very efficient distributed information
gathering architecture.
Topology adaptive index replication and object
caching
Try to support structured data through structure
preserving indexes and data type specific
manipulation.
Major System Components
Gatherer locate process information(webspider
)
also Essence subsystem (summarizer)
Broker Information server(interface to gathered
data)
Indexing/Search subsystems specialized search
interface to broker
Distributed Database technologies
Object cache supports rapid retrieval of
frequently used object
Replicator supports transparent mirror sites

4
Two Types of Indexing in Harvest

File/menu name indexing of widely distributed
information (e.g., Archie- acts like a library
card catalog searches the anonymous FTP file list
to get you the location of the site)
Very space efficient but support limited queries
It is only possible to query Archie for graphics
packages (for example) whose file names happen
to reflect their contents.
Full text indexing systems
Support powerful queries but require indexing
space overhead.
Global flat indexes are less useful as the
information space grows
(Because) Causing queries to match too much
indexed information.
Harvest achieves space efficient, content indexes
of widely distributed information.

5
Caching and Replication

There are efforts currently to build object
caches on HTTP servers and to support replication
in Internet information systems such as Archie
and FTP servers.
In contrast to the flat organization of existing
Internet caches, Harvest supports a hierarchical
architecture of object caches modeled after DNS
(Domain Name Service).
In many systems, replicas are placed and
configured manually.
Harvest configures the replicas automatically,
adapting to measured network changes. More
appropriate in the dynamic environment.
The object cache reduces the network load, server
load and response latency while accessing located
information objects.

6
Inefficient Information Gathering
7
Observations on Inefficient Gathering

Provider indicates a server running one or more
of the standard Internet information services
(e.g., FTP, HTTP).
These servers typically for a child process for
every request fro the indexing system (bold boxes
represent excessive load on the server).
Indexing systems retrieve entire object (bold
edges indicating excessive network traffic) but
discard most of its content.
There is also no coordinated effort between the
indexing systems among each other. The indexing
systems gather information independently.

8
Harvest Info Gathering Approach
9
Harvest Gatherer/Broker

Gatherer collects indexing information from the
Providers.
Broker provides the indexing query interface to
this information.
Brokers update their indexes incrementally
Gathered information can be shared by several
brokers (this saves gathering cost
substantially).
Gatherer on the same machine as the Provider.
Broker itself can get information from other
Brokers, which can allow further filtering and
fine tuning of the indexing information if
needed.
Communication between Gatherer and Broker uses
SOIF (Summary Object Interchange Format) which is
attribute-value stream protocol. Easily parsed
but expressive enough to handle different object
types

10
Harvest Gatherer/Broker (contd)

Each SOIF template contains a type, a URL, and a
list of byte-count delimited field name/value
pairs.
Mandatory fields can be added too (e.g.,
attribute for a Broker describe the servers
administrator, location, software version, and
type of objects it contains)
(e.g.)
_at_DOCUMENThttp//comet.lehman.cuny.edu/../cmp788D
IR.htm
Title25Distributed Information
Instructor-Name14) Prof. G. Jung
Description91 This course discusses.

11
Harvest Gatherer/Broker (contd)

Harvest Server Registry (HSR) is a special
instance of Broker that registers information
about every other component in the system.
Topic specific brokers are possible (e.g., one
broker can index PC software archives while
another can keep track of scientific papers in
computer architecture).
HSR will be contacted first for searches and new
server deployment efforts. HSR can be replicated.
Divide the gathering process into several servers
(e.g., one server index each US regional network)
distributing the partial updates among the
replicas.

12
Harvest System Architecture
13
Harvest Part 2 (DIRP2-lecture7)

Write a Comment

User Comments (0)

About PowerShow.com

CMP 788 Distributed IR - PowerPoint PPT Presentation