Title: PeertoPeer P2P and Sensor Networks
1Peer-to-Peer (P2P) and Sensor Networks
- Shivkumar Kalyanaraman
- Rensselaer Polytechnic Institute
- shivkuma_at_ecse.rpi.edu
- http//www.ecse.rpi.edu/Homepages/shivkuma
- Based in part upon slides of Don Towsley, Ion
Stoica, Scott Shenker, Joe Hellerstein, Jim
Kurose, Hung-Chang Hsiao, Chung-Ta King -
2Overview
- P2P networks Napster, Gnutella, Kazaa
- Distributed Hash Tables (DHTs)
- Database perspectives data-centricity,
data-independence - Sensor networks and its connection to P2P
3P2P Key Idea
- Share the content, storage and bandwidth of
individual (home) users
Internet
4(No Transcript)
5(No Transcript)
6What is P2P (Peer-to-Peer)?
- P2P as a mindset
- Slashdot
- P2P as a model
- Gnutella
- P2P as an implementation choice
- Application-layer multicast
- P2P as an inherent property
- Ad-hoc networks
7P2P Application Taxonomy
P2P Systems
Distributed Computing SETI_at_home
File Sharing Gnutella
Collaboration Jabber
Platforms JXTA
8How to Find an Object in a Network?
Network
9A Straightforward Idea
Use a BIG server
Store the object
How to do it in a distributed way?
Network
Provide a directory
10Why Distributed?
- Client-server model
- Client is dumb
- Server does most things (compute, store, control)
- Centralization makes things simple, but
introduces - Single point of failure, performance bottleneck,
tighter control, access fee and manage cost, - ad hoc participation?
- Estimate of net PCs
- 10 billions of Mhz CPUs
- 10000 terabytes of storage
- Clients are not that dumb after all
- Use the resources in the clients (at net edges)
11(No Transcript)
12(No Transcript)
13First Idea Napster
- Distributing objects, centralizing directory
Network
14(No Transcript)
15(No Transcript)
16(No Transcript)
17Today P2P Video traffic is dominant
- Source cachelogic Video, bittorrent, edonkey !
1840-60 P2P traffic
192006 p2p Data
- Between 50 and 65 percent of all download traffic
is P2P related.Between 75 and 90 percent of all
upload traffic is P2P related. - And it seems that more people are using p2p
today - In 2004 1 CacheLogic-server registered 3 million
IP-addresses in 30 daysIn 2006 1
CacheLogic-server registered 3 million
IP-addresses in 8 days - So what do people download?
- 61,4 percent video11,3 percent audio27,2
percent is games/software/etc. - The average filesize of shared files is 1
gigabyte! - Source http//torrentfreak.com/peer-to-peer-traff
ic-statistics/
20(No Transcript)
21A More Aggressive Idea
- Distributing objects and directory
Blind flooding!
How to find objects w/o directory?
Network
22(No Transcript)
23Gnutella
- Distribute file location
- Idea flood the request
- Hot to find a file
- Send request to all neighbors
- Neighbors recursively multicast the request
- Eventually a machine that has the file receives
the request, and it sends back the answer - Advantages
- Totally decentralized, highly robust
- Disadvantages
- Not scalable the entire network can be swamped
with request (to alleviate this problem, each
request has a TTL)
24Gnutella Unstructured P2P
- Ad-hoc topology
- Queries are flooded for bounded number of hops
- No guarantees on recall
xyz
Query xyz
25Now Bittorrent Edonkey2000! (2006)
26Lessons and Limitations
- Client-Server performs well
- But not always feasible
- Ideal performance is often not the key issue!
- Things that flood-based systems do well
- Organic scaling
- Decentralization of visibility and liability
- Finding popular stuff
- Fancy local queries
- Things that flood-based systems do poorly
- Finding unpopular stuff Loo, et al VLDB 04
- Fancy distributed queries
- Vulnerabilities data poisoning, tracking, etc.
- Guarantees about anything (answer quality,
privacy, etc.)
27Detour . Bittorrent
28(No Transcript)
29(No Transcript)
30BitTorrent joining a torrent
metadata file
peer list
join
datarequest
- Peers divided into
- seeds have the entire file
- leechers still downloading
1. obtain the metadata file
2. contact the tracker
3. obtain a peer list (contains seeds leechers)
4. contact peers from that list for data
31BitTorrent exchanging data
!
I have
? Verify pieces using hashes
? Download sub-pieces in parallel
? Advertise received pieces to the entire peer
list
? Look for the rarest pieces
32BitTorrent - unchoking
? Periodically calculate data-receiving rates
? Upload to (unchoke) the fastest downloaders
? Optimistic unchoking ? periodically select
a peer at random and upload to it ?
continuously look for the fastest partners
33End of Detour .
34Back to P2P Structures
- Unstructured P2P architecture
- Napster, Gnutella, Freenet
- No logically deterministic structures to
organize the participating peers - No guarantee objects be found
- How to find objects within some no. of hops?
- Extend hashing
- Structured P2P architecture
- CAN, Chord, Pastry, Tapestry, Tornado,
- Viewed as a distributed hash table for directory
35How to Bound Search Quality?
Work on placement!
Network
36High-Level Idea Indirection
- Indirection in space
- Logical (content-based) IDs, routing to those IDs
- Content-addressable network
- Tolerant of churn
- nodes joining and leaving the network
- Indirection in time
- Want some scheme to temporally decouple send and
receive - Persistence required. Typical Internet solution
soft state - Combo of persistence via storage and via retry
- Publisher requests TTL on storage
- Republishes as needed
- Metaphor Distributed Hash Table
hz
37Basic Idea
P2P Network
Publish (H(y))
Join (H(x))
Object y
Peer x
H(y)
H(x)
Peer nodes also have hash keys in the same hash
space
Objects have hash keys
y
x
Hash key
Place object to the peer with closest hash keys
38Distributed Hash Tables (DHTs)
- Abstraction a distributed hash-table data
structure - insert(id, item)
- item query(id) (or lookup(id))
- Note item can be anything a data object,
document, file, pointer to a file - Proposals
- CAN, Chord, Kademlia, Pastry, Tapestry, etc
- Goals
- Make sure that an item (file) identified is
always found - Scales to hundreds of thousands of nodes
- Handles rapid arrival and failure of nodes
39Viewed as a Distributed Hash Table
0
2128-1
Hash table
Peer node
Each is responsible for a range of the hash
table, according to the peer hash key Objects
are placed in the peer with the closest key
Note that peers are Internet edges
40How to Find an Object?
0
2128-1
Hash table
Peer node
Want to keep only a few entries!
one hop to find the object
Simplest idea Everyone knows everyone else!
41Structured Networks
- Distributed Hash Tables (DHTs)
- Hash table interface put(key,item), get(key)
- O(log n) hops
- Guarantees on recall
42Content Addressable Network, CAN
- Distributed hash table
- Hash table as in a Cartesian coordinate space
- A peer only needs to know its logical neighbors
- Dimensional-ordered multihop routing
43Content Addressable Network (CAN)
- Associate to each node and item a unique id in an
d-dimensional Cartesian space on a d-torus - Properties
- Routing table size O(d)
- Guarantees that a file is found in at most dn1/d
steps, where n is the total number of nodes
44CAN Example Two Dimensional Space
- Space divided between nodes
- All nodes cover the entire space
- Each node covers either a square or a rectangular
area of ratios 12 or 21 - Example
- Node n1(1, 2) first node that joins ? cover the
entire space
7
6
5
4
3
n1
2
1
0
2
3
4
6
7
0
1
5
45CAN Example Two Dimensional Space
- Node n2(4, 2) joins ? space is divided between
n1 and n2
7
6
5
4
3
n2
n1
2
1
0
2
3
4
6
7
0
1
5
46CAN Example Two Dimensional Space
- Node n2(4, 2) joins ? space is divided between
n1 and n2
7
6
n3
5
4
3
n2
n1
2
1
0
2
3
4
6
7
0
1
5
47CAN Example Two Dimensional Space
- Nodes n4(5, 5) and n5(6,6) join
7
6
n5
n4
n3
5
4
3
n2
n1
2
1
0
2
3
4
6
7
0
1
5
48CAN Example Two Dimensional Space
- Nodes n1(1, 2) n2(4,2) n3(3, 5)
n4(5,5)n5(6,6) - Items f1(2,3) f2(5,1) f3(2,1) f4(7,5)
7
6
n5
n4
n3
5
f4
4
f1
3
n1
n2
2
f3
1
f2
0
2
3
4
5
6
7
0
1
49CAN Example Two Dimensional Space
- Each item is stored by the node who owns its
mapping in the space
7
6
n5
n4
n3
f4
5
4
f1
3
n2
n1
2
f3
1
f2
0
2
3
4
6
7
0
1
5
50CAN Query Example
- Each node knows its neighbors in the d-space
- Forward query to the neighbor that is closest to
the query id - Example assume n1 queries f4
- Can route around some failures
7
6
n5
n4
n3
f4
5
4
f1
3
n2
n1
2
f3
1
f2
0
2
3
4
6
7
0
1
5
51Another Design Chord
- Node and object keys
- random location around a circle
- Neighbors
- nodes 2-i around the circle
- found by routing to desired key
- Routing greedy
- pick nbr closest to destination
- Storage own interval
- node owns key range betweenher key and previous
nodes key
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Ownership range
?
?
52OpenDHT
- A shared DHT service
- The Bamboo DHT
- Hosted on PlanetLab
- Simple RPC API
- You dont need to deploy or host to play with a
real DHT!
53Review DHTs vs Unstructured P2P
- DHTs good at
- exact match for rare items
- DHTs bad at
- keyword search, etc. cant construct DHT-based
Google - tolerating extreme churn
- Gnutella etc. (unstructured P2P) good at
- general search
- finding common objects
- very dynamic environments
- Gnutella etc. bad at
- finding rare items
54Distributed Systems Pre-Internet
- Connected by LANs (low loss and delay)
- Small scale (10s, maybe 100s per server)
- PODC literature focused on algorithms to achieve
strict semantics in the face of failures - Two-phase commits
- Synchronization
- Byzantine agreement
- Etc.
55Distributed Systems Post-Internet
- Very different context
- Huge scales (thousands if not millions)
- Highly variable connectivity
- Failures common
- Organic growth
- Abandoned distributed strict semantics
- Adaptive apps rather than guaranteed
infrastructure - Adopted pairwise client-server approach
- Server is centralized (even if server farm)
- Relatively primitive approach (no sophisticated
dist. algms.) - Little support from infrastructure or middleware
56A Database viewpoint on DHTs Towards
Data-centricity, Data Independence
57Host-centric Protocols
- Protocols defined in terms of IP addresses
- Unicast IP address host
- Multicast IP address set of hosts
- Destination address is given to protocol
- Protocol delivers data from one host to another
- unicast conceptually trivial
- multicast address is logical, not physical
58Host-centric Applications
- Classic applications destination is intrinsic
- telnet target machine
- FTP location of files
- electronic mail email address turns into mail
server - multimedia conferencing machines of participants
- Destination is specified by user (not network)
- Usually specified by hostname not address
- DNS translates names into addresses
59Domain Name System (DNS)
- DNS is built around recursive delegation
- Top level domains (TLDs) .com, .net, .edu, etc.
- TLDs delegate authority to subdomains
- berkeley.edu
- Subdomains can further delegate
- cs.berkeley.edu
- Hierarchy fits host administrative structure
- Local decentralized control
- Crucial to efficient hostname resolution
60Modern Web ? Data-Centricity
- URLs often function as names of data
- users think of www.cnn.com as data, not a host
- Fact that www.cnn.com is a hostname is irrelevant
- Users want data, not access to particular host
- The web is now data-centric
61Data-centric App in Host-centric World
- Data still associated with host names (URLs)
- administrative structure of data same as hosts
- weak point in current web
- Key enabler search engines
- Searchable databases map keywords to URLs
- Allowed users to find desired data
- Networkers focused on technical problems
- HTTP, persistence (URNs), replication (CDNs), ...
62A DNS for Data? DHTs
- Can we map data names into addresses?
- a data-centric DNS, distributed and scalable
- doesnt alter net protocols, but aids data
location - not just about stolen music, but a general
facility - A formidable challenge
- Data does not have a clear administrative
hierarchy - Likely need to support a flat namespace
- Can one do this scalably?
- Data-centrism requires scalable flat lookups gt
DHTs
63Data Independence In DB Design
- Decouple app-level API from data organization
- Can make changes to data layout without modifying
applications - Simple version location-independent names
- Fancier declarative queries
As clear a paradigm shift as we can hope to find
in computer science - C. Papadimitriou
64The Pillars of Data Independence
- Indexes
- Value-based lookups have to compete with direct
access - Must adapt to shifting data distributions
- Must guarantee performance
- Query Optimization
- Support declarative queries beyond lookup/search
- Must adapt to shifting data distributions
- Must adapt to changes in environment
65Generalizing Data Independence
- A classic level of indirection scheme
- Indexes are exactly that
- Complex queries are a richer indirection
- The key for data independence
- Its all about rates of change
- Hellersteins Data Independence Inequality
- Data independence matters when
- d(environment)/dt gtgt d(app)/dt
66Data Independence in Networks
- d(environment)/dt gtgt d(app)/dt
- In databases, the RHS is unusually small
- This drove the relational database revolution
- In extreme networked systems, LHS is unusually
high - And the applications increasingly complex and
data-driven - Simple indirections (e.g. local lookaside tables)
insufficient
67Hierarchical Networks ( Queries)
- IP
- Hierarchical name space (www.vldb.org,
141.12.12.51) - Hierarchical routing
- Autonomous Systems correlate with name space
(though not perfectly) - DNS
- Hierarchical name space (clients hierarchy of
servers) - Hierarchical routing w/aggressive caching
- 13 managed root servers
- Traditional pros/cons of Hierarchical data mgmt
- Works well for things aligned with the hierarchy
- Esp. physical locality a la Astrolabe
- Inflexible
- No data independence!
68The Pillars of Data Independence
- Indexes
- Value-based lookups have to compete with direct
access - Must adapt to shifting data distributions
- Must guarantee performance
- Query Optimization
- Support declarative queries beyond lookup/search
- Must adapt to shifting data distributions
- Must adapt to changes in environment
69Sensor Networks The Internet Meets the
Environment
70Today Internet meets Mobile Wireless Computing
iPoD impact of disk size/cost
Samsung Cameraphone w/ camcorder
- Computing smaller, faster
- Disks larger size, small form
- Communications wireless voice, data
- Multimedia integration voice, data, video, games
SONY PSP mobile gaming
Blackberry phone PDA
71Tomorrow Embedded Networked Sensing Apps
- Micro-sensors, on-board processing, wireless
interfaces feasible at very small scale--can
monitor phenomena up close - Enables spatially and temporally dense
environmental monitoring - Embedded Networked Sensing will reveal
previously unobservable phenomena
Seismic Structure response
Contaminant Transport
Ecosystems, Biocomplexity
Marine Microorganisms
72Embedded Networked Sensing Motivation
- Imagine
- high-rise buildings self-detect structural faults
(e.g., weld cracks) - schools detect airborn toxins at low
concentrations, trace contaminant transport to
source - buoys alert swimmers to dangerous bacterial
levels - earthquake-rubbled building infiltrated with
robots and sensors locate survivors, evaluate
structural damage - ecosystems infused with chemical, physical,
acoustic, image sensors to track global change
parameters - battlefield sprinkled with sensors that identify
track friendly/foe air, ground vehicles,
personnel
73Embedded Sensor Nets Enabling Technologies
Embed numerous distributed devices to monitor and
interact with physical world
Network devices to coordinate and perform
higher-level tasks
Embedded
Networked
Exploitcollaborative Sensing, action
Control system w/ Small form factor Untethered
nodes
Sensing
Tightly coupled to physical world
Exploit spatially/temporally dense, in
situ/remote, sensing/actuation
74Sensornets
- Vision
- Many sensing devices with radio and processor
- Enable fine-grained measurements over large areas
- Huge potential impact on science, and society
- Technical challenges
- untethered power consumption must be limited
- unattended robust and self-configuring
- wireless ad hoc networking
75Similarity w/ P2P Networks
- Sensornets are inherently data-centric
- Users know what data they want, not where it is
- Estrin, Govindan, Heidemann (2000, etc.)
- Centralized database infeasible
- vast amount of data, constantly being updated
- small fraction of data will ever be queried
- sending to single site expends too much energy
76Sensor Nets New Design Themes
- Self configuring systems that adapt to
unpredictable environment - dynamic, messy (hard to model), environments
preclude pre-configured behavior - Leverage data processing inside the network
- exploit computation near data to reduce
communication - collaborative signal processing
- achieve desired global behavior with localized
algorithms (distributed control) - Long-lived, unattended, untethered, low duty
cycle systems - energy a central concern
- communication primary consumer of scarce energy
resource
77From Embedded Sensing to Embedded Control
- embedded in unattended control systems
- control network, and act in environment
- critical apps extend beyond sensing to control
and actuation - transportation, precision agriculture, medical
monitoring and drug delivery, battlefield apps - concerns extend beyond traditional networked
systems and apps usability, reliability, safety - need systems architecture to manage interactions
- current system development one-off,
incrementally tuned, stove-piped - repercussions for piecemeal uncoordinated design
insufficient longevity, interoperability, safety,
robustness, scaling
78Why cant we simply adapt Internet protocols, end
to end architecture?
- Internet routes data using IP Addresses in
Packets and Lookup tables in routers - humans get data by naming data to a search
engine - many levels of indirection between name and IP
address - embedded, energy-constrained (un-tethered,
small-form-factor), unattended systems cant
tolerate communication overhead of indirection - special purpose system function(s) dont need
want Internet general purpose functionality
designed for elastic applications. -
79Sample Layered Architecture
User Queries, External Database
Resource constraints call for more tightly
integrated layers Open Question What are
defining Architectural Principles?
In-network Application processing, Data
aggregation, Query processing
Data dissemination, storage, caching
Adaptive topology, Geo-Routing
MAC, Time, Location
Phy comm, sensing, actuation, SP
80Coverage measures
- area coverage fraction of area covered by
sensors - detectability probability sensors detect moving
objects - node coverage fraction of sensors covered by
other sensors - control
- where to add new nodes for max coverage
- how to move existing nodes for max coverage
D
x
S
Given sensor field (either known sensor
locations, or spatial density)
81In Network Processing
- communication expensive when limited
- power
- bandwidth
- perform (data) processing in network
- close to (at) data
- forward fused/synthesized results
- e.g., find max. of data
- distributed data, distributed computation
82Distributed Representation and Storage
- Data Centric Protocols, In-network Processing
goal - Interpretation of spatially distributed data
(Per-node processing alone is not enough) - network does in-network processing based on
distribution of data - Queries automatically directed towards nodes that
maintain relevant/matching data - pattern-triggered data collection
- Multi-resolution data storage and retrieval
- Distributed edge/feature detection
- Index data for easy temporal and spatial
searching - Finding global statistics (e.g., distribution)
83Directed Diffusion Data Centric Routing
- Basic idea
- name data (not nodes) with externally relevant
attributes data type, time, location of node,
SNR, - diffuse requests and responses across network
using application driven routing (e.g., geo
sensitive or not) - support in-network aggregation and processing
- data sources publish data, data clients subscribe
to data - however, all nodes may play both roles
- node that aggregates/combines/processes incoming
sensor node data becomes a source of new data - node that only publishes when combination of
conditions arise, is client for triggering event
data - true peer to peer system?
84Traditional Approach Warehousing
- data extracted from sensors, stored on server
- Query processing takes place on server
85Sensor Database System
- Sensor Database System supports distributed query
processing over sensor network
Sensor Nodes
86Sensor Database System
- Can existing database techniques be reused? What
are the new problems and solutions? - Representing sensor data
- Representing sensor queries
- Processing query fragments on sensor nodes
- Distributing query fragments
- Adapting to changing network conditions
- Dealing with site and communication failures
- Deploying and Managing a sensor database system
- Characteristics of a Sensor Network
- Streams of data
- Uncertain data
- Large number of nodes
- Multi-hop network
- No global knowledge about the network
- Node failure and interference is common
- Energy is the scarce resource
- Limited memory
- No administration,
87Summary
- P2P networks Napster, Gnutella, Kazaa
- Distributed Hash Tables (DHTs)
- Database perspectives data-centricity,
data-independence - Sensor networks and its connection to P2P