P2P Systems

About This Presentation

Title:

P2P Systems

Description:

Definition: Nodes of equal roles exchanging information and services directly ... System metadata (e.g filename, bitrate, filesize etc) ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 45

Provided by: gza6

Category:

more less

Transcript and Presenter's Notes

Title: P2P Systems

1
P2P Systems technologies

Zacharioudakis Giorgos

2
Presentation overview

P2P architectures typical systems
Technical issues
Popular P2P Systems
Research areas
Project JXTA technology
Vision about SeLene project

3
What is Peer-to-Peer?

Definition Nodes of equal roles exchanging
information and services directly
Scale millions (billions?) of peers
Nature of peers PCs
Application lightweight semantics (e.g.,
file-sharing)
Is this a new idea?
IP routing
DNS, NTP
Distributed Databases

4
P2P vs. Distributed DBMS

Traditional DDBMS Issues
Transactions
Network Partitions
Distributed Query Optimization
Interoperation of heterogeneous data sources
Reliability/failure of nodes
Complex features do not scale

Example P2P application file-sharing
Simple data model query language
No complex query optimization
Easy interoperation
No guarantee on quality of results
Individual site availability unimportant
Local updates
No transactions
Network partitions OK
Simple Amenable to large-scale network of
PCs

5
P2P Applications

File sharing
Napster, Gnutella
Instant Messaging
Jabber
Distributed Computation
SETI_at_home
Web services
Akamai

Distributed storage
Freenet
Anonymity, censorship resistance
Mixmaster remailers
Red Rover, Publius
Cooperative work
Groove
Other ...

6
Technical issues

scalability
fault tolerance
speed
bandwidth consumption
processing cost
security
anonymity

publishing/retrieval
metadata
semantic querying
availability of results
interoperability
...

7
Metadata and Interoperability

Metadata
System metadata (e.g filename, bitrate, filesize
etc)
Resource metadata (e.g relations, hierarchies
etc)
Currently, queries are in the form of keyword
matching
We would like to perform queries in more
expressive languages, taking advantage of
semantic knowledge metadata
Technologies
Programming interfaces
XML-RPC, SOAP, HTTP, JXTA
Data and metadata representation - common
ontologies and format
XML, RDF

8
Different Approaches to Distributed Search

Network topology based architectures
Relies on the organization of peers within the
network to route requests
These approaches focus on how to reduce the
diameter of the graph representing the
distributed networks
Content based approaches
Message content is used in either the
organization of the network or the routing of
messages or both
These approaches focus on how to reduce the query
path-length of the access structure they use

9
Spectrum of Purity

Hybrid
Centralized index, P2P file storage and transfer
Napster, SETI_at_home
Super-peer
A pure network of hybrid clusters
Morpheus, e-donkey
Pure
functionality completely distributed
Freenet, Gnutella

10
Publishing/Requesting/Responding

hybrid
central indexing
each node registers to a central index
queries are performed to the central index
retrieval is done from other peer nodes

pure
each peer manages its own index about local
(remote) resources
queries are typically performed with broadcasts
retrieval is done from responding peers that
hold the requested resource

super-peers
some nodes act as coordinators and manage indices
for a subset of nodes
each node registers to its local coordinator
queries are performed to the coordinators, which
in turn communicate as in a distributed p2p
system with other super-peers
retrieval is done from other peers that hold
the requested resource

11
Representative P2P Systems

Network topology based architectures
Napster
Gnutella
Morpheus
Content based architectures
Chord
P-Grid

12
Napster (hybrid)

Membership Each client joins a server, where he
registers its local files to the central index
Query A client make queries to the central
server which returns references to the clients
that actually hold the resources
Retrieval The client connects to other peer
clients and retrieves the resource. The selection
is performed by the user but it could be done
automatically based on bandwidth, load or other
criteria

13
Napster (hybrid)

14
Gnutella (pure)

Gnutella is not a system it is a protocol, with
various existing gnutella clients that implement
it.
Membership Through a predefined static list
with addresses or through host caches, a peer
can connect to a set of gnutella clients. After
connection a client expands its list of known
addresses with the lists obtained from other
peers.
Query A peer broadcasts a query to its known
peers these forward the query to their known
peers and so on until a max TTL (packets Time To
Live) is reached, which is the depth limit of the
query.
Retrieval Peers that hold the requested resource
respond to the peer that issued the query.
Through the reverse path of the query, the
originating peer finally discovers a list of
peers having the resource and then obtains it
from one of them.

15
Gnutella (pure)
Breadth-First Search (BFS)
16
Gnutella (pure)

Each peer maintains a small minimum number of
simultaneous active connections
These peers are selected from a locally
maintained host catcher list containing the
addresses of all known peers
Peer discovery
watching PING-PONG messages
noting the addresses of peers initiating queries
receiving connections from previously unknown
hosts
out-of-band channels (IRC, Web)
host caches
Query propagation upon receiving a query a peer
broadcasts it to all peers that is currently
connected to, and so on as a chain letter
If a peer has a file that matches the query,
sends an answer back (though it still forwards
the query). This process continues to a maximum
depth (search horizon)

17
Morpheus (Super-Peer)

Self organizing network
Neither search requests nor actual downloads pass
through any central server
The network is multi-layered, so that more
powerful computers get to become search hubs
("SuperNodes")
Any client may become a SuperNode, if it meets
the criteria of processing power, bandwidth and
latency
Network management is automatic - SuperNodes
appear and disappear according to demand

18
Morpheus (Super-Peer)
SN2
SN4
SN4 12.34.56.78
SN3
SN1
19
Morpheus (Super-Peer)

Intelligent downloads
Morpheus implements a type of fail-over system
that attempts to locate another peer sharing the
same file, and automatically resume the download
where it left off at the failed host
When Morpheus search engine finds that more than
one active peer is serving a particular file, it
associates the list of peers with the file for
later reference
If the user instructs Morpheus to download the
file, it can distribute the download task over
this list of peers
SuperNodes act like local search
hubs
and proxy search requests
on
behalf of their connected peers

20
Chord (content based search)

Chord is a lookup service, not a search service
Based on binary search trees
Provides just one operation
A peer-to-peer hash lookup
Lookup(key) ? IP address
Chord does not store the data
Uses Hash function
Key identifier SHA-1 (key)
Node identifier SHA-1 (IP address)
Both are uniformly distributed
Both exist in the same ID space
How to map key IDs to node IDs?
A key is stored at its successor node with next
higher ID (modulo N)

M
0
21
Chord (content based search)

The goal of Chord is to provide the performance
of a binary search which means O(log N) query
path-length
In order to manage a maximum path-length O(log N)
each node maintains a routing table (called
finger table) with at most m entries (where
mlogN)
The ith entry in the table at node n contains
the identity of the first node s that succeeds n
by at least 2i-1 on the identifier circle (all
arithmetic modulo 2m)
i.e., s successor(n 2i-1), 1 i m
Note that the first finger of n is its
immediate successor on the circle

existing node
not existing node, but a possible value in ID
space
22
Chord (content based search)

Important characteristics
Each node stores info only about a small number
of possible IDs (at most logN)
Knows more info about nodes closely following it
on the identifier circle
A nodes table does not generally contain enough
info to locate the successor of an arbitrary key
k

0
1
7
6
2
5
3
4
23
Chord (content based search)
Finger Table Allows Log(n)-time Lookups

How do we locate the successor of a key k?
If n can find a node whose ID is closer than its
own to k, that node will know more about the
identifier circle in the region of k than n does
Thus n searches its finger table for the node j
whose ID most immediately precedes k, and asks j
for the node it knows whose ID is closest to k

N5
N10
N110

By repeating this process, n learns about nodes
with IDs closer and closer to k
Gradually we will find the immediate predecessor
of k

K19
N20
N99
N32
N80
N60
24
Chord Autonomy

When new keys are inserted the system is not
affected. It just finds the appropriate node and
stores it
When nodes join or leave, the finger tables must
be correctly maintained and also some keys must
be transferred to other nodes
Also, every key is stored only in one node, which
means that if that node becomes unavailable the
key is also unavailable
This incurs an O(log2N) cost for maintaining the
finger tables and assuring correctness of the
system while nodes join/leave the system
This imply a restricted autonomy of the system
The only replicated information is (implicitly)
the finger tables, because each node has to
maintain its own

25
P-Grid

Basic characteristics
Based on building distributed, binary prefix
trees
Use of randomized algorithms for constructing the
access structure, updating the data and
performing the search
Scale gracefully, equally for all nodes
Access structure
We assume that the index terms are binary
strings, built from 0s 1s
The search space is partitioned into intervals
Every peer takes over responsibility for one
interval
As each key corresponds to a path in the binary
prefix tree the peer is also responsible for one
path of the search tree
Each peer stores the peers responsible for the
other branches of the path for routing
Search requests are either processed locally or
forwarded to the peers on the alternative branches

26
P-Grid

P-Grid construction
Initially, all peers are responsible for the
whole search space
Whenever peers meet, they try to make a
refinement to the access structure
they split the search space into two parts and
each take the responsibility for the one half
They also store the reference to the other peer
in order to cover the other part of the search
space
The same happens whenever two peers meet, that
are responsible for the same interval at the same
level
To avoid overspecialization of peers, we restrict
the maximal length of paths that can be
constructed to a defined maxlength

27
P-Grid
Key intervals Level 0
001
0010
01
0100
100
1001
1011
110
28
P-Grid
queries
Key intervals Level 0
0
1
Key intervals Level 1
01
11
00
10
Key intervals Level 2
001
0010
01
0100
100
1001
1011
110
29
P-Grid Autonomy

The system implies that peers eventually meet,
but does not examine how does this occur, i.e. it
is possible that they never meet
As many peers can be responsible for the same key
the general problem is how to find all those
peers in case of an update
Proposed solutions
multiple BFS or DFS searches for a key and
propagating the update to them
Creating lists of buddies for each peer (i.e.
other peers that share the same key) and
propagate the update to all buddies
These imply that although the system is
decentralized and peers does not rely to central
authorities, the construction and update of the
access structure may impose some performance
issues, especially when updating a key

30
P-Grid Autonomy

When a new node enters the system, assumes that
he is responsible over the whole prefix namespace
interval
When he meets with other nodes they split the
interval and each maintain a reference to the
other node
When a node leaves abruptly, the other nodes have
incorrect references and as soon as they are
aware of it they resume responsibility over
that prefix interval
The replicated information in this system is the
multiple references to the same keys and the
buddies lists (when used) in order to face the
update problem

31
P2P comparison
32
P2P performance metrics

Bandwidth
Storage (replication)
Processing cost
Path-length (required hops)
Quality of Results
Number of results
Satisfaction (true if results gt X, false
otherwise)
Time to satisfaction

33
Hybrid p2p

Advantages
Simple to manage and availability of results -due
to central indexing
Less (aggregated) bandwidth consumption
Small processing cost for peers
Idle nodes that do not offer resources does not
downscale systems performance

Disadvantages
Does not scale
Single point of failure
Great processing cost for server
Vulnerable to censorship

34
Pure p2p

Advantages
Efficiency harnessing unused resources
Self-organizing
Robustness and availability through replication
Anonymity/legal protection/censorship resistant

Disadvantages
Difficult to manage and poor results due to lack
of central indexing
Bandwidth consuming
Idle nodes downscale the overall performance
Higher processing cost for peers

35
Super peers

Advantages
Scalable
Fault tolerant
Adaptable and self-organizing
Efficient
Low path-length

Disadvantages
Hard to manage/maintain
Complex topology, difficult to evaluate its
metrics (through simulation or trace driven
analysis)

36
Content-based searching architectures

Advantages
Low search cost ( O(logN) )
Harnessing the content information into queries.
Good approach for content that can be described
with simple attributes.
Less messages per query than a random graph.
Load balancing.

Disadvantages
More restrictions than topology-based
architectures when nodes join/leave, rehashing
and content migration needs to be performed.
A peer needs to know what is looking for, to map
it to an address.
Not practical for content described by multiple
attributes.
Storage and routing are closely connected

37
Conclusions about p2p systems

Benefits
efficiency harnessing unused resources
Self-organizing
Sharing cost of ownership
Robustness and availability through replication
Anonymity/legal protection

Challenges
No authority to enforce behavior
Cooperation
Unreliability of individual peers
Efficiency of distributed operations (absolute
resources)

Imposed research issues
Resource Management
Security
Efficient Search

38
Resource Management

Resource
Storage/information
CPU processing
Bandwidth
Issues
fairness
load balancing

39
Security

Issues
Reputation
Trust
Accountability
Information Preservation Quality
Denial of service attacks
Problem Detecting and punishing bad behavior

40
Efficiency of Search

Problem finding needle in haystack
Efficiency measured in terms of absolute
resources consumed
Bandwidth
Processing cost
Several factors
Purity
Control
Query expressiveness

41
Project JXTA

JXTA is a set of protocols which allow peers to
discover and communicate with each other
Protocols are defined in terms of XML messages
exchanged between peers
JXTA is platform (e.g Windows), language (e.g
Java) and transport (e.g TCP/IP) independent

42
JXTA Concepts

Concepts
Peer - a node that speaks the JXTA protocols
Peer Group - a collection of cooperating peers
Message - a datagram containing an envelope,
protocol headers and bodies
Pipe - an async communication channel for
sending/receiving messages
Advertisement - an XML document that publishes
the existence of a resource (peer, peer group,
pipe, service)

43
JXTA Model
44
JXTA Protocols

Peer Discovery Protocol - used between any peers
to find other peers, peer groups, or
advertisements
Peer Information Protocol - used to learn about
another peer's properties
Peer Resolver Protocol - 'foundation protocol'
for the Peer Discovery Protocol and the Peer
Information Protocol. Can be used to build other
protocols as well. Defines send/receive 'generic
queries' and responses to be sent from one peer
to another

Peer Membership Protocol - used to find out
about, join and leave groups
Pipe Binding Protocol - used to bind a pipe to an
actual endpoint
Peer Endpoint Protocol - used to provide routing
information for paths between peers (if a direct
connection is not possible)

45
JXTA Search

JXTASearch is a framework for searching in
distributed networks
A protocol for registration, query and response
A series of services for interacting via this
protocol

46
JXTA Search

Advantages
Supports very dynamic networks
Reduce publishing and query response latency
Centralized control (centralized implementation
of security, accounting, membership, )