Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval)

1 / 54

About This Presentation

Title:

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval)

Description:

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) –

Number of Views:185

Avg rating:3.0/5.0

Slides: 55

Provided by: itu129

Category:

more less

Transcript and Presenter's Notes

Title: Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval)

1
Querying the Internet with PIER(PIER
Peer-to-peer Information Exchange and Retrieval)
2
What is PIER?

Peer-to-Peer Information Exchange and Retrieval
Query engine that runs on top of P2P network
step to the distributed query processing at a
larger scale
way for massive distribution querying
heterogeneous data
Architecture meets traditional database query
processing with recent peer-to-peer technologies

Key goal is scalable indexing system for
large-scale decentralized storage applications on
the Internet
P2P, Large scale storage management systems
(OceanStore, Publius), wide-area name resolution
services

4
What is Very Large?Depends on Who You Are
Internet scale systems vs. hundred node systems
Database Community
Network Community

How to run DB style queries at Internet Scale!

5
What are the Key Properties?

Lots of data that is
Naturally distributed (where its generated)
Centralized collection undesirable
Homogeneous in schema
Data is more useful when viewed as a whole

6
Who Needs Internet Scale?Example 1 Filenames

Simple ubiquitous schemas
Filenames, Sizes, ID3 tags
Born from early P2P systems such as Napster,
Gnutella etc.
Content is shared by normal non-expert users
home users
Systems were built by a few individuals in their
garages ? Low barrier to entry

7
Example 2 Network Traces

Schemas are mostly standardized
IP, SMTP, HTTP, SNMP log formats
Network administrators are looking for patterns
within their site AND with other sites
DoS attacks cross administrative boundaries
Tracking virus/worm infections
Timeliness is very helpful
Might surprise you how useful it is
Network bandwidth on PlanetLab (world-wide
distributed research test bed) is mostly filled
with people monitoring the network status

8
Our Challenge

Our focus is on the challenge of scale
Applications are homogeneous and distributed
Already have significant interest
Provide a flexible framework for a wide variety
of applications

9
Four Design Principles (I)

Relaxed Consistency
ACID transactions severely limits the scalability
and availability of distributed databases
We provide best-effort results
Organic Scaling
Applications may start small, withouta priori
knowledge of size

10
Four Design Principles (II)

Natural habitat
No CREATE TABLE/INSERT
No publish to web server
Wrappers or gateways allow the information to be
accessed where it is created
Standard Schemas via Grassroots software
Data is produced by widespread software providing
a de-facto schema to utilize

11
gtgtbased on Can
12
Applications

P2P Databases
Highly distributed and available data
Network Monitoring
Intrusion detection
Fingerprint queries

13
DHTs

Implemented with CAN (Content Addressable
Network).
Node identified by hyper-rectangle in
d-dimensional space
Key hashed to a point, stored in corresponding
node.
Routing Table of neighbours is maintained. O(d)

14
Given a message with an ID, route the message to
the computer currently responsible for that ID
(16,16)
(16,0)
(0,16)
(0,0)
15
DHT Design

Routing Layer
Mapping for keys
(-- dynamic as nodes leave and join)
Storage Manager
DHT based data
Provider
Storage access interface for higher
levels

16
DHT Routing

Routing layer
maps a key into the IP address of the node
currently responsible for that key. Provides
exact lookups, callbacks higher levels when the
set of keys has changed
Routing layer API
lookup(key) ? ipaddr (Asynchronous Fnc)
join(landmarkNode)
leave()
locationMapChange()

17
DHT Storage
Storage Manager stores and retrieves records,
which consist of key/value pairs. Keys are used
to locate items and can be any data type or
structure supported Storage Manager
API store(key, item) retrieve(key)?
item remove(key)
18
DHT Provider (1)

Provider
ties routing and storage manager layers and
provides an interface
Each object in the DHT has a namespace,
resourceID and instanceID
DHT key hash(namespace,resourceID)

namespace - application or group of object, table
or relation
resourceID primary key or any attribute(Object)
instanceID integer, to separate items with the
same namespace and resourceID
Lifetime - item storage duration
CANs mapping of resourceID/Object is equivalent
to an index

19
DHT Provider (2)

Provider API
get(namespace, resourceID) ? item
put(namespace, resourceID, item, lifetime)
renew(namespace, resourceID, instanceID,
lifetime) ? bool
multicast(namespace, resourceID, item)
lscan(namespace) ? items
newData(namespace, item)

rID3
item
Node R1
(1..n)
Table R (namespace)
(1..n) tuples
(n1..m) tuples
rID2
item
Node R2
(n1..m)
rID1
item
20
Query Processor

How it works?
performs selection, projection, joins, grouping,
aggregation -gtOperators
Operators push and pull data
simultaneous execution of multiple operators
pipelined together
results are produced and queued as quick as
possible
How it modifies data?
insert, update and delete different items via DHT
interface
How it selects data to process?
dilated-reachable snapshot data, published by
reachable nodes at the query arrival time

21
Join Algorithms

Limited Bandwidth
Symmetric Hash Join
- Rehashes both tables
Semi Joins
- Transfer only matching tuples
At 40 selectivity, bottleneck switches from
computation nodes to query sites

22
Future Research

Routing, Storage and Layering
Catalogs and Query Optimization
Hierarchical Aggregations
Range Predicates
Continuous Queries over Streams
Sharing between Queries
Semi-structured Data

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Distributed Hash Tables (DHTs)

What is a DHT?
Take an abstract ID space, and partition among a
changing set of computers (nodes)
Given a message with an ID, route the message to
the computer currently responsible for that ID
Can store messages at the nodes
This is like a distributed hash table
Provides a put()/get() API
Cheap maintenance when nodes come and go

27
Distributed Hash Tables (DHTs)

Lots of effort is put into making DHTs better
Scalable (thousands ? millions of nodes)
Resilient to failure
Secure (anonymity, encryption, etc.)
Efficient (fast access with minimal state)
Load balanced
etc.

28
PIERs Three Uses for DHTs

Single elegant mechanism with many uses
Search Index
Like a hash index
Partitioning Value (key)-based routing
Like Gamma/Volcano
Routing Network routing for QP messages
Query dissemination
Bloom filters
Hierarchical QP operators (aggregation, join,
etc)
Not clear theres another substrate that supports
all these uses

29
Metrics

We are primarily interested in 3 metrics
Answer quality (recall and precision)
Bandwidth utilization
Latency
Different DHTs provide different properties
Resilience to failures (recovery time) ? answer
quality
Path length ? bandwidth latency
Path convergence ? bandwidth latency
Different QP Join Strategies
Symmetric Hash Join, Fetch Matches, Symmetric
Semi-Join, Bloom Filters, etc.
Big Picture Tradeoff bandwidth (extra rehashing)
and latency

30
Symmetric Hash Join (SHJ)
31
Fetch Matches (FM)
32
Symmetric Semi Join (SSJ)

Both R and S are projected to save bandwidth
The complete R and S tuples are fetched in
parallel to improve latency

33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Overview

CAN is a distributed system that maps keys onto
values
Keys hashed into d dimensional space
Interface
insert(key, value)
retrieve(key)

37
Overview
y
State of the system at time t
Peer
Resource
Zone
x
In this 2 dimensional space a key is mapped to a
point (x,y)
38
DESIGN

D-dimensional Cartesian coordinate
space (d-torus)
Every Node owns a distinct Zone
Map Key k1 onto a point p1 using a
Uniform Hash function
(k1,v1) is stored at the node Nx
that owns the zone with p1

Node Maintains routing
table with neighbors
Ex A Node holdsB,C,E,D
Follow the straight line path through
the Cartesian space

40
Routing
y

d-dimensional space with n zones
2 zones are neighbor if d-1 dim overlap
Routing path of length
Algorithm
Choose the neighbor nearest to the destination

Peer
(x,y)
Q(x,y)
Query/ Resource
41
CAN construction
Bootstrap node
new node
42
CAN construction
Bootstrap node
new node
1) Discover some node I already in CAN
43
CAN construction
(x,y)
I
new node
2) Pick random point in space
44
CAN construction
(x,y)
J
I
new node
3) I routes to (x,y), discovers node J
45
CAN construction
new
J
4) split Js zone in half new owns one half
46
Maintenance

Use zone takeover in case of failure or leaving
of a node
Send your neighbor table to neighbors to inform
that you are alive at discrete time interval t
If your neighbor does not send alive in time t,
takeover its zone
Zone reassignment is needed

47
Node Departure