Title: CMP 788 Distributed IR
1CMP 788 Distributed IR
- Part 2 Lecture 8
- Harvest- Part 3
- Harvest Replication and Object Caching
- Fall 05
- Department of Mathematics
- and Computer Science
- Lehman College, CUNY
2Broker
3SOIF Example
4Replicator and Object Cache
- The Harvest Replicator can be used to replicate
servers, to enhance user-base scalability. - For example, the HSR will likely become heavily
replicated, since it acts a point of first
contact for searches and new server deployment
efforts. - The Replication subsystem can also be used to
divide the gathering process among many servers
(e.g., letting one server index each U.S.
regional network), distributing the partial
updates among the replicas. - The Harvest Object Cache reduces network load,
server load, and response latency when accessing
located information objects. - LFU (Least Frequently Used) cache replacement
strategy is often used. - Hierarchical caching if target object is not in
the local cache, use subnets cache (often
provided by firewall software), or parent cache
(larger caches stored on server shared by many
machines)
5Hierarchical Cache Arrangement
6Caching Subsystem
7Caching Resolution Protocol
- Each cache in the hierarchy independently decides
whether to fetch the reference from the objects
home site or from its parent or sibling caches,
using a simple resolution protocol. - If the URL contains any of a configurable list of
substrings, then the object is fetched directly
from the objects home, rather than through the
cache hierarchy. - This feature is used to resolve non-cacheable
(e.g., cgi-bin, password protected objects) URLs. - If the URLs domain name matches a configurable
list of substrings, then the object is resolved
through the particular parent bound to that
domain. - If a cache receives a request for a URL that
misses, it performs a remote procedure call to
all of its siblings and parents, checking if the
URL hits any sibling or parent. - The cache retrieves the object from the site with
the lowest measured latency.
8Caching Resolution Protocol (contd)
- Hierarchies as deep as three caches add little
noticeable access latency. - The only case where the cache adds noticeable
latency is when one of its parents fail, but the
child cache has not yet detected it. - In this case, references to this object are
delayed by two seconds, the parent-to-child cache
timeout. - As the hierarchy deepens, the root caches become
responsible for more and more clients. - To keep root caches servers from becoming
overloaded, Harvest hierarchy terminates at the
first place in the regional or backbone network
where bandwidth is plentiful.Â
9Caching Resolution Protocol (contd)
- Additionally, a cache option can be enabled that
tricks the referenced URLs home site. - This option allows the cache to retrieve the
object from the home site if it happens to be
closer than any of the sibling or parent caches. - Can be based on estimating object access latency
time using network ping or echo. - A cache resolves a reference through the first
sibling, parent, or home site to return a UDP
Hit packet (through echo port). - The first parent returns a UDP Miss message if
all caches miss within two seconds.
10Caching Resolution Protocol (contd)
- The cache will not wait for a home machine to
time out. - it will begin transmitting as soon as all of the
parent and sibling caches have responded. - The resolution protocols goal is for a cache to
resolve an object through the source (cache or
home) that can provide it most efficiently. - This protocol is really a simple heuristic
- Fast response to a ping indicates low latency
- But bandwidth is more important for large objects.
11Non-cacheable Objects/Security
- The wide variety of Internet information systems
leads to a number of cases where objects should
not be cached. - Objects that are password protected are not
cached. Rather, the cache acts as an application
gateway and discards the retrieved object as soon
as it has been delivered. ? can resolve security
and privacy problems. - CGI-Bins (server-side scripts)
- May limit the size of the largest cacheable
object, so that a few large FTP objects do not
purge ten thousand smaller objects from the
cache. - Caching subsystem does not prevent servers from
encrypting or applying digital signature to their
documents.
12Cache Updates Problem
- Problems with caching
- Difficult to know if a cache object has been
updated - before its next use without checking (at least
HEAD) - No integrated mechanism in Web for remotely
forced cache flush - Can be controlled by object header (e.g.,
Expires 0 Expires Thu, 16 May 2001 144030
GMT). - This mechanism only supports predictive
expiration (says in advance how long a copy may
be used). - But what if unexpected change before expiration
or unchanged persistence after that specified
time? - Cache Updates
- Based on Data access efficiency Use log of uses
statistics (LRU Least Recently Used) triggered
by a cache server. - Cache consistency problem may occur when one or
more cache servers maintain same copy of object.
13The Economy of Cache Updates
14Negative Caching
- To reduce the costs of repeated failures,
negative caching is used. - When a DNS lookup failure occurs, Harvest caches
the negative result for five minutes (chosen
because transient Internet conditions are
typically resolved this quickly). - When an object retrieval failure occurs, Harvest
caches the negative result for a parameterized
period of time, with a default of five minutes.
15Replication Subsystem
- Motivations
- like to have(complete) regional copies with
mechanism to ensure active consistency
updates - mirror-d (replication tool for Harvest using ftp
mirror)
site2
site1
site3
Thin black mirror
Thick gray locally maintained master copies
16Replication Subsystem (contd)
- Active consistent updates
- If a server changes its master copy, it notifies
mirror sites. - Harvest supports replication domains
- Mirroring within domain and carefully
coordinated/synchronized - Mirroring/replication between domains involves
gradual propagation of changes (between sites
responsible for inter-domain communication)
17Replication Subsystem (contd)
- mirror-d replication tool weakly consistent
replicated tree of files - Motivation multiple copies for future access
- (e.g. Europe, North America) ?
replication domain - Problem maintaining data consistency
- Logical topology
- replication subgroups that coordinate consistency
and internally share updates within subgroup
domain. - Physical issues(network bandwidth/usage)
- Help determine how replication domains propagate
(flood) updates among its neighbors ? flooding
handling issue (e.g., flooding in Peer to Peer
network issues)
18Replication Group
19Replication Group (contd)
- Although Replication Domain members are stable,
Pathways for inter-domain communication may
change based on dynamic properties of server load
and bandwidth
20Physical vs. Logical Topology
- Logical inter-domain network topology is a subset
of the - full physical topology (and is dynamically
re-configurable based on network load and
bandwidth)
21Replication in Broker World
22Index Subsystem
- Please refer the papers for understanding
indexing subsystem. - GLIMPSE
- Uses Essence SOIF objects to create inverted
index entries. - NEBULA
- Supports hierarchical classification scheme ?
automatic Yahoo classification. - Provides views (pre-computed query responses) ?
basically vector clusters w.r.t. a given query.
23Harvest NG Architecture
- User agents goal directed
- extraction, analysis,
- even dialog
- Meta Brokers meta search
- collection/query fusion
- Brokers(Index, Search)
- Gatherers gathering and extracting (SOIF)
- Finders (Spiders) locate pages
- Content Provider (Web content pages)