CMP 788 Distributed IR - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

CMP 788 Distributed IR

Description:

Manger. Storage Mgr. And Indexer. Replication. Manager. Object. Cache. Provider. Gatherer ... The Replication subsystem can also be used to divide the gathering ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 24
Provided by: GJu
Category:

less

Transcript and Presenter's Notes

Title: CMP 788 Distributed IR


1
CMP 788 Distributed IR
  • Part 2 Lecture 8
  • Harvest- Part 3
  • Harvest Replication and Object Caching
  • Fall 05
  • Department of Mathematics
  • and Computer Science
  • Lehman College, CUNY

2
Broker
3
SOIF Example
4
Replicator and Object Cache
  • The Harvest Replicator can be used to replicate
    servers, to enhance user-base scalability.
  • For example, the HSR will likely become heavily
    replicated, since it acts a point of first
    contact for searches and new server deployment
    efforts.
  • The Replication subsystem can also be used to
    divide the gathering process among many servers
    (e.g., letting one server index each U.S.
    regional network), distributing the partial
    updates among the replicas.
  • The Harvest Object Cache reduces network load,
    server load, and response latency when accessing
    located information objects.
  • LFU (Least Frequently Used) cache replacement
    strategy is often used.
  • Hierarchical caching if target object is not in
    the local cache, use subnets cache (often
    provided by firewall software), or parent cache
    (larger caches stored on server shared by many
    machines)

5
Hierarchical Cache Arrangement
6
Caching Subsystem
7
Caching Resolution Protocol
  • Each cache in the hierarchy independently decides
    whether to fetch the reference from the objects
    home site or from its parent or sibling caches,
    using a simple resolution protocol.
  • If the URL contains any of a configurable list of
    substrings, then the object is fetched directly
    from the objects home, rather than through the
    cache hierarchy.
  • This feature is used to resolve non-cacheable
    (e.g., cgi-bin, password protected objects) URLs.
  • If the URLs domain name matches a configurable
    list of substrings, then the object is resolved
    through the particular parent bound to that
    domain.
  • If a cache receives a request for a URL that
    misses, it performs a remote procedure call to
    all of its siblings and parents, checking if the
    URL hits any sibling or parent.
  • The cache retrieves the object from the site with
    the lowest measured latency.

8
Caching Resolution Protocol (contd)
  • Hierarchies as deep as three caches add little
    noticeable access latency.
  • The only case where the cache adds noticeable
    latency is when one of its parents fail, but the
    child cache has not yet detected it.
  • In this case, references to this object are
    delayed by two seconds, the parent-to-child cache
    timeout.
  • As the hierarchy deepens, the root caches become
    responsible for more and more clients.
  • To keep root caches servers from becoming
    overloaded, Harvest hierarchy terminates at the
    first place in the regional or backbone network
    where bandwidth is plentiful. 

9
Caching Resolution Protocol (contd)
  • Additionally, a cache option can be enabled that
    tricks the referenced URLs home site.
  • This option allows the cache to retrieve the
    object from the home site if it happens to be
    closer than any of the sibling or parent caches.
  • Can be based on estimating object access latency
    time using network ping or echo.
  • A cache resolves a reference through the first
    sibling, parent, or home site to return a UDP
    Hit packet (through echo port).
  • The first parent returns a UDP Miss message if
    all caches miss within two seconds.

10
Caching Resolution Protocol (contd)
  • The cache will not wait for a home machine to
    time out.
  • it will begin transmitting as soon as all of the
    parent and sibling caches have responded.
  • The resolution protocols goal is for a cache to
    resolve an object through the source (cache or
    home) that can provide it most efficiently.
  • This protocol is really a simple heuristic
  • Fast response to a ping indicates low latency
  • But bandwidth is more important for large objects.

11
Non-cacheable Objects/Security
  • The wide variety of Internet information systems
    leads to a number of cases where objects should
    not be cached.
  • Objects that are password protected are not
    cached. Rather, the cache acts as an application
    gateway and discards the retrieved object as soon
    as it has been delivered. ? can resolve security
    and privacy problems.
  • CGI-Bins (server-side scripts)
  • May limit the size of the largest cacheable
    object, so that a few large FTP objects do not
    purge ten thousand smaller objects from the
    cache.
  • Caching subsystem does not prevent servers from
    encrypting or applying digital signature to their
    documents.

12
Cache Updates Problem
  • Problems with caching
  • Difficult to know if a cache object has been
    updated
  • before its next use without checking (at least
    HEAD)
  • No integrated mechanism in Web for remotely
    forced cache flush
  • Can be controlled by object header (e.g.,
    Expires 0 Expires Thu, 16 May 2001 144030
    GMT).
  • This mechanism only supports predictive
    expiration (says in advance how long a copy may
    be used).
  • But what if unexpected change before expiration
    or unchanged persistence after that specified
    time?
  • Cache Updates
  • Based on Data access efficiency Use log of uses
    statistics (LRU Least Recently Used) triggered
    by a cache server.
  • Cache consistency problem may occur when one or
    more cache servers maintain same copy of object.

13
The Economy of Cache Updates
14
Negative Caching
  • To reduce the costs of repeated failures,
    negative caching is used.
  • When a DNS lookup failure occurs, Harvest caches
    the negative result for five minutes (chosen
    because transient Internet conditions are
    typically resolved this quickly).
  • When an object retrieval failure occurs, Harvest
    caches the negative result for a parameterized
    period of time, with a default of five minutes.

15
Replication Subsystem
  • Motivations
  • like to have(complete) regional copies with
    mechanism to ensure active consistency
    updates
  • mirror-d (replication tool for Harvest using ftp
    mirror)

site2
site1
site3
Thin black mirror
Thick gray locally maintained master copies
16
Replication Subsystem (contd)
  • Active consistent updates
  • If a server changes its master copy, it notifies
    mirror sites.
  • Harvest supports replication domains
  • Mirroring within domain and carefully
    coordinated/synchronized
  • Mirroring/replication between domains involves
    gradual propagation of changes (between sites
    responsible for inter-domain communication)

17
Replication Subsystem (contd)
  • mirror-d replication tool weakly consistent
    replicated tree of files
  • Motivation multiple copies for future access
  • (e.g. Europe, North America) ?
    replication domain
  • Problem maintaining data consistency
  • Logical topology
  • replication subgroups that coordinate consistency
    and internally share updates within subgroup
    domain.
  • Physical issues(network bandwidth/usage)
  • Help determine how replication domains propagate
    (flood) updates among its neighbors ? flooding
    handling issue (e.g., flooding in Peer to Peer
    network issues)

18
Replication Group
19
Replication Group (contd)
  • Although Replication Domain members are stable,
    Pathways for inter-domain communication may
    change based on dynamic properties of server load
    and bandwidth

20
Physical vs. Logical Topology
  • Logical inter-domain network topology is a subset
    of the
  • full physical topology (and is dynamically
    re-configurable based on network load and
    bandwidth)

21
Replication in Broker World
22
Index Subsystem
  • Please refer the papers for understanding
    indexing subsystem.
  • GLIMPSE
  • Uses Essence SOIF objects to create inverted
    index entries.
  • NEBULA
  • Supports hierarchical classification scheme ?
    automatic Yahoo classification.
  • Provides views (pre-computed query responses) ?
    basically vector clusters w.r.t. a given query.

23
Harvest NG Architecture
  • User agents goal directed
  • extraction, analysis,
  • even dialog
  • Meta Brokers meta search
  • collection/query fusion
  • Brokers(Index, Search)
  • Gatherers gathering and extracting (SOIF)
  • Finders (Spiders) locate pages
  • Content Provider (Web content pages)
Write a Comment
User Comments (0)
About PowerShow.com