Caching and Data Consistency in P2P

About This Presentation

Title:

Caching and Data Consistency in P2P

Description:

Caching and Data Consistency in P2P Dai Bing Tian Zeng Yiming Caching and Data Consistency Why Caching Caching helps use bandwidth more efficiently The data ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 60

Provided by: MikeZ156

Category:

more less

Transcript and Presenter's Notes

Title: Caching and Data Consistency in P2P

1
Caching and Data Consistency in P2P

Dai Bing Tian
Zeng Yiming

2
Caching and Data Consistency

Why Caching
Caching helps use bandwidth more efficiently
The data consistency in this topic is different
from the consistency in distributed database
It refers to the consistency between cached copy
and data on servers.

3
Introduction

Caching is built based on current P2P
architectures like CAN, BestPeer, Pastry, etc.
Caching layer is between application layer and
P2P layer.
Every peer has its cache control unit and its
local cache, and publish the cache contents

4
Presentation Order

We will present four papers, they are
Squirrel
PeerOLAP
Caching for Range Queries
With CAN
With DAG

5
Overview
Paper Based on Caching Consistency
Squirrel Pastry Yes Yes
PeerOLAP BestPeer Yes No
RQ with CAN CAN Yes Yes
RQ with DAG Not Specified Yes Yes
6
Squirrel

Enables web browsers on desktop machines to share
their local caches
Uses a self-organizing, peer-to-peer network
Pastry as its object location service
Pastry is fault resilient, so is Squirrel

7
Web Caching

Web browser generate HTTP GET requests
If the object is in the local cache, return it if
fresh enough
freshness can be checked by submitting cGET
request
If no such object, issue GET request to the
server
For simplicity, we assume objects are cacheable

8
Home Node

As described in Pastry, every peer (node) has its
nodeID
objectID SHA-1 (obj URL)
This object is assigned to the node whose ID is
numerically nearest to the objectID
The node who owns this object is called the home
node of this object

9
Two approaches

There are two approaches of Squirrel
Home-store
Directory
Home-store stores the object directly in the
cache of the home node
Directory stores the pointer to the nodes who
have this object in its cache, these nodes are
called delegates

10
Home-store
WAN
Origin Server
Requester
LAN
Send A over
Send A over
Yes, it is fresh
Request for A
Yes, it is fresh
Request for A
Is my copy of A fresh?
Is my copy of A fresh?
Home Node
Request Routed Through Pastry
11
Directory
Origin Server
Send A over
Request for A
Send A over
Yes, it is fresh
Request for A
Requester
Is my copy of A fresh?
Send A over
WAN
Request for A
Delegate
LAN
Requester and I are your delegates
Get it from D
Update Meta-info Keep the directory
Request for A
Get it from Server
No directory
Request Routed Through Pastry
Home Node
Im your delegate
12
Conclusion

The home-store approach is less complicated, but
it does not have any collaboration
The directory approach is more collaborative, it
has the ability to store more objects in those
peers with larger cache capacity, by setting the
pointers to these peers in the directory

13
PeerOLAP

OnLine Analytical Processing (OLAP) query
typically involves large amounts of data
Each peer has a cache containing some results
An OLAP query can be answered by combining
partial results from many peers
PeerOLAP acts as a large distributed cache

14
Data Warehouse Chunk

A data warehouse is based on a multidimensional
data model which views data in the form of a data
cube.
Han Kamber

http//www.cs.sfu.ca/han/dmbook
15
PeerOLAP network

LIGLO servers provide global name lookup and
maintain a list of active peers

Except for LIGLO servers, the network is fully
distributed without any centralized
administration point

16
Query Processing

Assumption 1 Only chunks at the same aggregation
level as the query are considered
Assumption 2 The selecting predicates is a
subset of grouping-by predicates

17
Cost Model

Every chunk is associated with a cost value,
indicating how long it spends to get this chunk

18
Eager Query Processing (EQP)

Peer P sends requests for the missing chunks to
all its neighbors, Q1, Q2, .... Qk
Each Qi provides the desired chunks as many as
possible, return to P with a cost associated with
each chunk
Qi then propagates the requests to all its
neighbors recursively
In order to avoid flooding, hmax is set to limit
the depth of the search

19
EQP (Contd.)

P collects (chunk, cost) pairs from all its
neighbors
Random select one chunk ci, and find the peer who
can provide it with lowest cost, Qi
For the subsequent chunks, it evaluates the
minimum of two cases the peer with lowest cost
is not connected yet, or some existing peer who
can also provide this chunk
Ask for chunks from these peers and the rest
missing chunks from the warehouse.

20
Lazy Query Processing (LQP)

Instead of propagating the requests from each Qi
to all its neighbors, each Qi selects its most
beneficial neighbor, and forward the request.
Given the expected number of neighbors a peer has
is k, EQP will visit O(khmax) nodes, LQP only
visit O(khmax)

21
Chunk Replacement

Least Benefit First (LBF)

Similar to LRU, every chunk has a weight
Once the chunk is used by P, its weight is set
back to the original benefit value
Every time there is a new chunk come in, the
weight of old chunks will reduce

22
Collaboration

LBF gives local chunk replacement algorithm
3 variations of global behavior
Isolated Caching Policy non-collaborative
Hit Aware Caching Policy collaborative
Voluntary Caching highly collaborative

23
Network Reorganization

Optimization can be done by creating virtual
neighborhoods of peers with similar query
patterns
So that there is a high probability for P to get
missing chunks directly from neighbors
Each connection is assigned a benefit value and
the most beneficial connections are selected to
be the peers neighbors

24
Conclusion

PeerOLAP is a distributed caching system for OLAP
results
By sharing the contents of individual caches,
PeerOLAP constructs a large virtual cache which
can benefit all peers
PeerOLAP is fully distributed and highly scalable

25
Caching For Range Queries

Range Query
E.g.
SELECT Student.name
WHERE 20ltStudent.agelt30
Why Cache?
Data source too far away from the requesting node
Data source overloaded with queries
Data source is a single point of failure
What to cache?
All tuples falling in the range
Who cache?
Peers responsible for the range

26
Problem Definition

Given a relation R, and a range attribute A, we
assume that the results of prior range-selection
queries of the form R.A(LOW, HIGH) are stored at
the peers. When a query is issued at a peer which
requires the retrieval of tuples from R in the
range R.A(low, high), we want to locate a peer in
the system which already stores tuples that can
be accessed to compute the answer.

27
A P2P Framework for Caching Range Queries

Based on CAN.
Map data into 2d virtual space, where d is
dimensions of the relation.
For every dimension/attribute, say its domain is
a, b, it is mapped to a square virtual hash
space whose corner coordinates are (a,a), (b,a),
(b,b) and (a,b).
The virtual hash space is further partitioned
into rectangular areas, each of which is called a
zone.

28
Example

Virtual hash space for an attribute whose domain
is 10,70
zone-1 lt(10,56),(15,70)gt
zone-5 lt(10,48),(25,56)gt
zone-8 lt(47,10),(70,54)gt

29
Terminology

Each zone is assigned to a peer.
Active Peer
Owns a zone
Passive Peer
Not participate in the partitioning, register
itself with an active peer
Target Point
A range low,high is hashed to a point with
coordinates (low,high)
Target Zone
Where the target point resides
Target Node
The peer that owns the target zone
Stores the tuples falling into the range which
is mapped to the its zone
Caches the tuples in the local cache OR
Stores a pointer to the peer who caches the tuples

30
Zone Maintenance

Initially, only the data source is the active
node and the entire virtual hash space is its
zone
A zone split happens under two conditions
Heavy Answering Load
Heavy Routing Load

31
Example of Zone Splits

If a zone has too many queries to answer
It finds the x-median and y-median of the stored
results. Determine if a split at x-median or
y-median results in even distribution of stored
answers and the space.
If a zone is overloaded because of routing
queries
It splits the zone from the midpoint of the
longer side.

32
Answering A Range Query

If an active node poses the query, the query is
initiated from the corresponding zone if a
passive node poses the query, it contacts any
active node from where the query starts routing.
2 steps involved
Query Routing
Query Forwarding

33
Query Routing

If the target point falls in this zone
Return this zone
Else
Route the query to the neighbor who is closest
to the target point

(26,30)
34
Query Routing

If the target point falls in this zone
Return this zone
Else
Route the query to the neighbor who is closest
to the target point

(26,30)
35
Query Routing

If the target point falls in this zone
Return this zone
Else
Route the query to the neighbor who is closest
to the target point

(26,30)
36
Forwarding

If the results are stored in the target node,
then the results are sent back to the querying
node
Else, it is still possible that zones lie in the
upper left area of the target point store the
results. So we need to forward the query to these
zones too.

37
Example

If no results are found in zone-7, the shaded
region may still contains the results.
Reason Any prior range query q whose range
subsumes (x,y) must be hashed into the shaded
region.

38
Forwarding (Cont.)

How far should it go?
For a range (low,high), we want to restrict to
results falling in (low-offset,highoffset),
where offset AcceptableFit x domain.
AcceptabelFit 0,1
The shaded square defined by the target point and
offset is called the Acceptable Region

offset
39
Forwarding (Cont.)

Flood Forwarding
A naïve approach. Forward to the left and top
neighbors if they fall in the acceptable region
Directed Forwarding
Forward to the neighbor that maximally overlaps
with the acceptable region
Can bound the number of forwards by specifying a
limit d, which is decremented for every forward.

40
Discussion

Improvements
Lookup During Routing
Warm up queries
Peer soft-departure Failure event
Updatecache consistency
Say a tuple t with range attribut ak is updated
in the data source, then the target zone of point
(k,k) and all zones lie in the upper left region
have to update their cache.

41
Range Addressable Network A P2P Cache
Architecture for Data Ranges

Assumption
Tuples stored in the system are labeled 1,2,,N
according to the range attribute
A range a,b is a contiguous subset of
1,2,,N, where 1ltaltbltN
Objective
Given a query range a,b, peers cooperatively
find results falling in the shortest superset of
a,b, if they are cached somewhere.

42
Overview

Based on Range Addressable DAG (Directed Acyclic
Graph)
Map every active node in the P2P system to a
group of nodes in the DAG
A node is responsible for storing results and
answering queries falling into a specific range

43
Range Addressable DAG

The entire universe 1,N is mapped to the root.
Recursively divide one node into 3 overlapping
intervals of equal length.

44
Range Lookup
7,13

Input a query range qa,b,
a node v in DAG
Output the shortest range in
DAG that contains q
boolean downtrue
search (q, v)
if q i(v)
search (q, parent(v))
if q i(child(v)) down
search (q, child(v))
else
if some range stored at v is a superset of q
return the shortest range containing q that is
stored at v or parent(v) ()
else
downfalse
search(q,parent(v))

5,12
Q 7,10
45
Peer Protocol

Maps the logical DAG structure to physical peers
Two components
Peer Management
Handles peer joining, leaving, failure
Range Management
Deals with query routing and updates

46
Peer Management

It ensures that at any time,
every node in the DAG is assigned to some peer
the nodes belonging to one peer, called a zone,
is a connected component of the DAG
This is done by handling Join Request, Leave
Request, Failure Event properly.

47
Join Request

The first peer joining the system takes over the
entire DAG
A new peer joining the system contacts one of the
peers in the system to take over one of its child
zones. Default strategy left child, then mid
child, then right child.

48
Join Request

The first peer joining the system takes over the
entire DAG
A new peer joining the system contacts one of the
peers in the system to take over one of its child
zones. Default strategy left child, then mid
child, then right child.

49
Join Request

The first peer joining the system takes over the
entire DAG
A new peer joining the system contacts one of the
peers in the system to take over one of its child
zones. Default strategy left child, then mid
child, then right child.

50
Join Request

The first peer joining the system takes over the
entire DAG
A new peer joining the system contacts one of the
peers in the system to take over one of its child
zones. Default strategy left child, then mid
child, then right child.

51
Leave Request

When a peer wants to leave (soft departure), it
hands over its zone to the smallest neighboring
zone.
Neighboring zones there is a parent-child
relationship among any nodes in the zones

52
Leave Request

When a peer wants to leave (soft departure), it
hands over its zone to the smallest neighboring
zone.
Neighboring zones there is a parent-child
relationship among any nodes in the zones

53
Failure Event

A zone maintains info on all its ancestors. So in
case it finds out one of its parents failed, it
contacts the nearest alive ancestor for zone
takeover.

54
Range Management

Range Lookup
Range Update
When a tuple is updated in the data source, we
locate the peer with the shortest range
containing that tuple, then update this peer and
all its ancestors.

55
Improvement

Cross Pointers
For a node v, if its the left child of its
parent, then it keeps cross pointers to all the
left children of nodes that are in its parents
level.
Similarly for mid child.

56
Improvement (Cont.)
P1

Load Balancing by Peer Sampling
Collapsed DAG collapse each peers zone to a
single node.
The system is balanced if the collapsed DAG is
balanced.
Lookup time is O(h) where h is the height of the
collapsed DAG. Hence a balanced system leads to
optimal performance.
When a new peer joins, it polls k peers randomly,
and send join request to the one whose zone is
rooted nearest to the root.

P2
P3
57
Improvement (Cont.)

Load Balancing by Peer Sampling
Collapsed DAG collapse each peers zone to a
single node.
The system is balanced if the collapsed DAG is
balanced.
Lookup time is O(h) where h is the height of the
collapsed DAG. Hence a balanced system leads to
optimal performance.
When a new peer joins, it polls k peers randomly,
and send join request to the one whose zone roots
nearest to the root.

Collapsed DAG
58
Conclusion