Title: ICS 214B: Transaction Processing and Distributed Data Management
1ICS 214B Transaction Processing and Distributed
Data Management
- Lecture 18 Data Management in Peer-to-Peer
Systems - Professor Chen Li
- Based on slides developed by
- Beverley Yang and Hector Garcia-Molina
2What is P2P?
pastry
jxta
can
fiorana
napster
freenet
united devices
open cola
?
aim
ocean store
netmeeting
farsite
gnutella
icq
ebay
morpheus
limewire
seti_at_home
bearshare
uddi
grove
jabber
popular power
kazaa
folding_at_home
tapestry
mojo nation
process tree
chord
3Napster
central server
...
4Gnutella
5PeerCast
UCI
source
6What is Peer-to-Peer?
- Definition
- Nodes of equal roles exchanging information
and services directly - Is this a new idea?
- IP routing (1970s)
- Mariposa (1980s)
- Distributed Databases!
- What are people really thinking?
7Implicit Definition of P2P
- Scale millions (billions?) of peers
- Nature of peers PCs
- Application lightweight semantics
- (e.g., file-sharing)
8P2P vs. Distributed DBMS
- Traditional DDBMS Issues
- Transactions
- Network Partitions
- Distributed Query Optimization
- Interoperation of heterogeneous data sources
- Reliability/failure of nodes
- Complex features do not scale
9P2P vs. Distributed DBMS
- Example application file-sharing
- Simple data model and query language
- No complex query optimization
- Easy interoperation
- No guarantee on quality of results
- Individual site availability unimportant
- Local updates
- No transactions
- Network partitions OK
- Simple Amenable to large-scale network of
PCs
10Potential Benefits
- Efficiency harnessing unused resources
- Self-organizing
- Effectively sharing cost of ownership
- Robustness and availability through replication
- Anonymity/legal protection
11Challenges
- No authority to enforce behavior
- Cooperation
- Unreliability of individual peers
- Efficiency of distributed operations (absolute
resources)
12Research Areas
- Resource Management
- Security
- Efficient Search
13Resource Management
- Resource
- Storage/information
- CPU processing
- bandwidth
- Issues
- fairness
- load balancing
14Example Data Trading
site 1
site 2
site 3
A1
C1
B1
A2
B2
C2
15Example Data Trading
site 1
site 2
site 3
A1
C1
B1
A2
B2
C2
16Data Trading
- Order of trades impacts availability
- Issues
- Swaps vs. Deeds
- Fixed price vs. bids
- Preference to
- sites with a lot of space?
- reliable sites?
- desperate sites?
17Security
- Issues
- Reputation
- Trust
- Accountability
- Information Preservation
- Information Quality
- Denial of service attacks
- Problem Detecting and punishing bad behavior
18Information Preservation
- Example Policy make 3 copies of documents
A1
make copies
What can go wrong?
19What Can Go Wrong?
- Bad sites deletes copies
- Bad site alters copy
- Bad site publishes fake
- Bad site makes many copies at other sites
- ...
A1
A1
make copies
A1
20Reputation Systems
- Peers evaluate each other
- Good reviews -gt Good reputation
- Bad reviews -gt Bad reputation
- No reviews -gt ?
- Problems
- Trustworthiness of reviews
- Permanence of identity
21Efficiency of Search
- Problem finding needle in haystack
- Efficiency measured in terms of absolute
resources consumed
22Architecture
- Hybrid
- Centralized index, P2P
- file storage and transfer
- Super-peer
- A pure network of
- hybrid clusters
- Pure
- functionality completely
- distributed
23Goal
- Develop search techniques for loose systems
that are - Efficient
- Simple (easy to implement, no hidden costs)
- Realistically and thoroughly evaluated
24Current Techniques Gnutella
Breadth-First Search (BFS)
25Metrics
- Cost (aggregate)
- Bandwidth
- Processing Power
- Quality of Results
- Number of results
- Satisfaction (true if results gt X, false
otherwise) - Time to satisfaction
26Iterative Deepening
- Interested in satisfaction, not of results
- BFS returns too many results ? expensive
- Iterative Deepening common technique to reduce
the cost of BFS - Intuition A search at a small depth is much
cheaper than at a larger depth
27Iterative Deepening
source
forward query
processed query
found result
forward response
28Directed BFS
- Sends query to a subset of neighbors
- Maintains statistics on neighbors
- E.g., ping latency, history of number of results
- Chooses subset intelligently (via heuristics), to
maximize quality of results - E.g., Neighbors with shortest message queue,
since long message queue implies neighbor is
saturated/dead
29Directed BFS
source
forward query
processed query
?
found result
forward response
30Directed BFS Heuristics
RAND (Random)
RES Returned greatest results in past
TIME Had shorted avg. time to satisfaction in past
HOPS Had smallest avg. hops for response messages in past
MSG Sent our client greatest of messages
QLEN Shortest message queue
DEG Highest degree
31Local Indices
- Each node maintains index over other nodes
collections - r is the radius of the index
- Index covers all nodes within r hops away
- Can process query at fewer nodes, but get just as
many results back
r
32Local Indices (r1)
source
forward query
processed query
found result
forward response
33Evaluation
- Goal realistic evaluation of techniques
- Cannot directly evaluate techniques in a real
environment - Simulation of large-scale distributed systems is
hard - Use Gnutella as a laboratory for gathering data
- Use analysis driven by query traces to project
cost
34Passive Observation
Gnutella Network
- Statistics
- Size of collection
- redundant messages
- Sample queries (Qrep)
35Gathering Data
- hops traveled
- IP address
- Timestamp
- Individual result records
36Gathering Data
L(Q) Length of query string
M(Q,n) response messages from n hops away
R(Q,n) results from n hops away
S(Q,n,Z) True if gt Z results received from n hops away
T(Q,Z,W,P) Time to satisfaction
N(Q,n) nodes n hops away
C(Q,n) redundant edges n hops away
37Example Trace-driven Cost Projection
source
forward query
processed query
found result
forward response
38Example Calculating Message Size
- Use the Gnutella protocol, trace data
- e.g., Query message consists of
- Gnutella header (22 B)
- Options field (2 B)
- Query string (L(Q))
- TCP/IP and Ethernet headers (58 B)
- Total size of Query message for query Q
- 82 L(Q) bytes
39Calculating Cost
- We know the sizes of each type of message
- We know messages sent, for each type of
message, for query Q - Put together aggregate bandwidth for Q
- Similar process to compute aggregate processing
power
40Overall Comparison
B
I
D
L
B
I
D
L
B
I
D
L
Time to Satisfy
Prob. of Satisfying
results
BFS
B
Iterative Deepening (d5,W6)
I
B
D
I
L
D
Directed BFS (gtRES)
L
Bandwidth Cost
Local Indices (r1)
41Summary Efficient Search
- What weve done
- Proposed techniques to improve performance
- Kept simple
- Evaluated techniques using extensive real data
- Improved performance, with tradeoffs
- Open issues
- More efficient!
- Make intelligent use of topology, replication
- Take advantage of heterogeneity (e.g.,
super-peers)