CMP 788 Distributed IR - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

CMP 788 Distributed IR

Description:

Number of Views:53

Avg rating:3.0/5.0

Slides: 24

Provided by: GJu

Category:

Tags: cmp | distributed | manger

Transcript and Presenter's Notes

Title: CMP 788 Distributed IR

1
CMP 788 Distributed IR

2
Broker
3
SOIF Example
4
Replicator and Object Cache

The Harvest Replicator can be used to replicate
servers, to enhance user-base scalability.
For example, the HSR will likely become heavily
replicated, since it acts a point of first
contact for searches and new server deployment
efforts.
The Replication subsystem can also be used to
divide the gathering process among many servers
(e.g., letting one server index each U.S.
regional network), distributing the partial
updates among the replicas.
The Harvest Object Cache reduces network load,
server load, and response latency when accessing
located information objects.
LFU (Least Frequently Used) cache replacement
strategy is often used.
Hierarchical caching if target object is not in
the local cache, use subnets cache (often
provided by firewall software), or parent cache
(larger caches stored on server shared by many
machines)

5
Hierarchical Cache Arrangement
6
Caching Subsystem
7
Caching Resolution Protocol

Each cache in the hierarchy independently decides
whether to fetch the reference from the objects
home site or from its parent or sibling caches,
using a simple resolution protocol.
If the URL contains any of a configurable list of
substrings, then the object is fetched directly
from the objects home, rather than through the
cache hierarchy.
This feature is used to resolve non-cacheable
(e.g., cgi-bin, password protected objects) URLs.
If the URLs domain name matches a configurable
list of substrings, then the object is resolved
through the particular parent bound to that
domain.
If a cache receives a request for a URL that
misses, it performs a remote procedure call to
all of its siblings and parents, checking if the
URL hits any sibling or parent.
The cache retrieves the object from the site with
the lowest measured latency.

8
Caching Resolution Protocol (contd)

Hierarchies as deep as three caches add little
noticeable access latency.
The only case where the cache adds noticeable
latency is when one of its parents fail, but the
child cache has not yet detected it.
In this case, references to this object are
delayed by two seconds, the parent-to-child cache
timeout.
As the hierarchy deepens, the root caches become
responsible for more and more clients.
To keep root caches servers from becoming
overloaded, Harvest hierarchy terminates at the
first place in the regional or backbone network
where bandwidth is plentiful.

9
Caching Resolution Protocol (contd)

Additionally, a cache option can be enabled that
tricks the referenced URLs home site.
This option allows the cache to retrieve the
object from the home site if it happens to be
closer than any of the sibling or parent caches.
Can be based on estimating object access latency
time using network ping or echo.
A cache resolves a reference through the first
sibling, parent, or home site to return a UDP
Hit packet (through echo port).
The first parent returns a UDP Miss message if
all caches miss within two seconds.

10
Caching Resolution Protocol (contd)

The cache will not wait for a home machine to
time out.
it will begin transmitting as soon as all of the
parent and sibling caches have responded.
The resolution protocols goal is for a cache to
resolve an object through the source (cache or
home) that can provide it most efficiently.
This protocol is really a simple heuristic
Fast response to a ping indicates low latency
But bandwidth is more important for large objects.

11
Non-cacheable Objects/Security

The wide variety of Internet information systems
leads to a number of cases where objects should
not be cached.
Objects that are password protected are not
cached. Rather, the cache acts as an application
gateway and discards the retrieved object as soon
as it has been delivered. ? can resolve security
and privacy problems.
CGI-Bins (server-side scripts)
May limit the size of the largest cacheable
object, so that a few large FTP objects do not
purge ten thousand smaller objects from the
cache.
Caching subsystem does not prevent servers from
encrypting or applying digital signature to their
documents.

12
Cache Updates Problem

Problems with caching
Difficult to know if a cache object has been
updated
before its next use without checking (at least
HEAD)
No integrated mechanism in Web for remotely
forced cache flush
Can be controlled by object header (e.g.,
Expires 0 Expires Thu, 16 May 2001 144030
GMT).
This mechanism only supports predictive
expiration (says in advance how long a copy may
be used).
But what if unexpected change before expiration
or unchanged persistence after that specified
time?
Cache Updates
Based on Data access efficiency Use log of uses
statistics (LRU Least Recently Used) triggered
by a cache server.
Cache consistency problem may occur when one or
more cache servers maintain same copy of object.

13
The Economy of Cache Updates
14
Negative Caching

To reduce the costs of repeated failures,
negative caching is used.
When a DNS lookup failure occurs, Harvest caches
the negative result for five minutes (chosen
because transient Internet conditions are
typically resolved this quickly).
When an object retrieval failure occurs, Harvest
caches the negative result for a parameterized
period of time, with a default of five minutes.

15
Replication Subsystem

Motivations
like to have(complete) regional copies with
mechanism to ensure active consistency
updates
mirror-d (replication tool for Harvest using ftp
mirror)

site2
site1
site3
Thin black mirror
Thick gray locally maintained master copies
16
Replication Subsystem (contd)

Active consistent updates
If a server changes its master copy, it notifies
mirror sites.
Harvest supports replication domains
Mirroring within domain and carefully
coordinated/synchronized
Mirroring/replication between domains involves
gradual propagation of changes (between sites
responsible for inter-domain communication)

17
Replication Subsystem (contd)

mirror-d replication tool weakly consistent
replicated tree of files
Motivation multiple copies for future access
(e.g. Europe, North America) ?
replication domain
Problem maintaining data consistency
Logical topology
replication subgroups that coordinate consistency
and internally share updates within subgroup
domain.
Physical issues(network bandwidth/usage)
Help determine how replication domains propagate
(flood) updates among its neighbors ? flooding
handling issue (e.g., flooding in Peer to Peer
network issues)

18
Replication Group
19
Replication Group (contd)

Although Replication Domain members are stable,
Pathways for inter-domain communication may
change based on dynamic properties of server load
and bandwidth

20
Physical vs. Logical Topology

Logical inter-domain network topology is a subset
of the
full physical topology (and is dynamically
re-configurable based on network load and
bandwidth)

21
Replication in Broker World
22
Index Subsystem

Please refer the papers for understanding
indexing subsystem.
GLIMPSE
Uses Essence SOIF objects to create inverted
index entries.
NEBULA
Supports hierarchical classification scheme ?
automatic Yahoo classification.
Provides views (pre-computed query responses) ?
basically vector clusters w.r.t. a given query.

23
Harvest NG Architecture