ICS 214B: Transaction Processing and Distributed Data Management - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

ICS 214B: Transaction Processing and Distributed Data Management

Description:

1. ICS 214B: Transaction Processing and Distributed Data Management ... morpheus. kazaa. bearshare. seti_at_home. folding_at_home. ebay. limewire. icq. fiorana. mojo nation ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 42
Provided by: che7
Category:

less

Transcript and Presenter's Notes

Title: ICS 214B: Transaction Processing and Distributed Data Management


1
ICS 214B Transaction Processing and Distributed
Data Management
  • Lecture 18 Data Management in Peer-to-Peer
    Systems
  • Professor Chen Li
  • Based on slides developed by
  • Beverley Yang and Hector Garcia-Molina

2
What is P2P?
pastry
jxta
can
fiorana
napster
freenet
united devices
open cola
?
aim
ocean store
netmeeting
farsite
gnutella
icq
ebay
morpheus
limewire
seti_at_home
bearshare
uddi
grove
jabber
popular power
kazaa
folding_at_home
tapestry
mojo nation
process tree
chord
3
Napster
central server
...
4
Gnutella
5
PeerCast
UCI
source
6
What is Peer-to-Peer?
  • Definition
  • Nodes of equal roles exchanging information
    and services directly
  • Is this a new idea?
  • IP routing (1970s)
  • Mariposa (1980s)
  • Distributed Databases!
  • What are people really thinking?

7
Implicit Definition of P2P
  • Scale millions (billions?) of peers
  • Nature of peers PCs
  • Application lightweight semantics
  • (e.g., file-sharing)

8
P2P vs. Distributed DBMS
  • Traditional DDBMS Issues
  • Transactions
  • Network Partitions
  • Distributed Query Optimization
  • Interoperation of heterogeneous data sources
  • Reliability/failure of nodes
  • Complex features do not scale

9
P2P vs. Distributed DBMS
  • Example application file-sharing
  • Simple data model and query language
  • No complex query optimization
  • Easy interoperation
  • No guarantee on quality of results
  • Individual site availability unimportant
  • Local updates
  • No transactions
  • Network partitions OK
  • Simple Amenable to large-scale network of
    PCs

10
Potential Benefits
  • Efficiency harnessing unused resources
  • Self-organizing
  • Effectively sharing cost of ownership
  • Robustness and availability through replication
  • Anonymity/legal protection

11
Challenges
  • No authority to enforce behavior
  • Cooperation
  • Unreliability of individual peers
  • Efficiency of distributed operations (absolute
    resources)

12
Research Areas
  • Resource Management
  • Security
  • Efficient Search

13
Resource Management
  • Resource
  • Storage/information
  • CPU processing
  • bandwidth
  • Issues
  • fairness
  • load balancing

14
Example Data Trading
site 1
site 2
site 3
A1
C1
B1
A2
B2
C2
15
Example Data Trading
site 1
site 2
site 3
A1
C1
B1
A2
B2
C2
16
Data Trading
  • Order of trades impacts availability
  • Issues
  • Swaps vs. Deeds
  • Fixed price vs. bids
  • Preference to
  • sites with a lot of space?
  • reliable sites?
  • desperate sites?

17
Security
  • Issues
  • Reputation
  • Trust
  • Accountability
  • Information Preservation
  • Information Quality
  • Denial of service attacks
  • Problem Detecting and punishing bad behavior

18
Information Preservation
  • Example Policy make 3 copies of documents

A1
make copies
What can go wrong?
19
What Can Go Wrong?
  • Bad sites deletes copies
  • Bad site alters copy
  • Bad site publishes fake
  • Bad site makes many copies at other sites
  • ...

A1
A1
make copies
A1
20
Reputation Systems
  • Peers evaluate each other
  • Good reviews - Good reputation
  • Bad reviews - Bad reputation
  • No reviews - ?
  • Problems
  • Trustworthiness of reviews
  • Permanence of identity

21
Efficiency of Search
  • Problem finding needle in haystack
  • Efficiency measured in terms of absolute
    resources consumed

22
Architecture
  • Hybrid
  • Centralized index, P2P
  • file storage and transfer
  • Super-peer
  • A pure network of
  • hybrid clusters
  • Pure
  • functionality completely
  • distributed

23
Goal
  • Develop search techniques for loose systems
    that are
  • Efficient
  • Simple (easy to implement, no hidden costs)
  • Realistically and thoroughly evaluated

24
Current Techniques Gnutella
Breadth-First Search (BFS)
25
Metrics
  • Cost (aggregate)
  • Bandwidth
  • Processing Power
  • Quality of Results
  • Number of results
  • Satisfaction (true if results X, false
    otherwise)
  • Time to satisfaction

26
Iterative Deepening
  • Interested in satisfaction, not of results
  • BFS returns too many results ? expensive
  • Iterative Deepening common technique to reduce
    the cost of BFS
  • Intuition A search at a small depth is much
    cheaper than at a larger depth

27
Iterative Deepening
source
forward query
processed query
found result
forward response
28
Directed BFS
  • Sends query to a subset of neighbors
  • Maintains statistics on neighbors
  • E.g., ping latency, history of number of results
  • Chooses subset intelligently (via heuristics), to
    maximize quality of results
  • E.g., Neighbors with shortest message queue,
    since long message queue implies neighbor is
    saturated/dead

29
Directed BFS
source
forward query
processed query
?
found result
forward response
30
Directed BFS Heuristics
31
Local Indices
  • Each node maintains index over other nodes
    collections
  • r is the radius of the index
  • Index covers all nodes within r hops away
  • Can process query at fewer nodes, but get just as
    many results back

r
32
Local Indices (r1)
source
forward query
processed query
found result
forward response
33
Evaluation
  • Goal realistic evaluation of techniques
  • Cannot directly evaluate techniques in a real
    environment
  • Simulation of large-scale distributed systems is
    hard
  • Use Gnutella as a laboratory for gathering data
  • Use analysis driven by query traces to project
    cost

34
Passive Observation
Gnutella Network
  • Statistics
  • Size of collection
  • redundant messages
  • Sample queries (Qrep)

35
Gathering Data
  • hops traveled
  • IP address
  • hops traveled
  • IP address
  • Timestamp
  • Individual result records

36
Gathering Data
  • For each query Q

37
Example Trace-driven Cost Projection
source
forward query
processed query
found result
forward response
38
Example Calculating Message Size
  • Use the Gnutella protocol, trace data
  • e.g., Query message consists of
  • Gnutella header (22 B)
  • Options field (2 B)
  • Query string (L(Q))
  • TCP/IP and Ethernet headers (58 B)
  • Total size of Query message for query Q
  • 82 L(Q) bytes

39
Calculating Cost
  • We know the sizes of each type of message
  • We know messages sent, for each type of
    message, for query Q
  • Put together aggregate bandwidth for Q
  • Similar process to compute aggregate processing
    power

40
Overall Comparison
B
I
D
L
B
I
D
L
B
I
D
L
Time to Satisfy
Prob. of Satisfying
results
BFS
B
Iterative Deepening (d5,W6)
I
B
D
I
L
D
Directed BFS (RES)
L
Bandwidth Cost
Local Indices (r1)
41
Summary Efficient Search
  • What weve done
  • Proposed techniques to improve performance
  • Kept simple
  • Evaluated techniques using extensive real data
  • Improved performance, with tradeoffs
  • Open issues
  • More efficient!
  • Make intelligent use of topology, replication
  • Take advantage of heterogeneity (e.g.,
    super-peers)
Write a Comment
User Comments (0)
About PowerShow.com