Querying the Internet with PIER - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Querying the Internet with PIER

Description:

Querying the Internet with PIER. CS294-4. Paul Burstein. 11/10/2003 ... One relation already hashed on join attribute. R, S relations. Nr, Ns relation namespaces ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 27
Provided by: bur100
Category:
Tags: pier | internet | one | pier | querying

less

Transcript and Presenter's Notes

Title: Querying the Internet with PIER


1
Querying the Internet with PIER
  • CS294-4
  • Paul Burstein
  • 11/10/2003

2
Outline
  • Motivation
  • Architecture
  • Join Algorithms
  • Evaluation
  • Discussion

3
Motivation
  • Inject a degree of distribution into databases
  • Internet scale systems vs. hundred node systems
  • Large scale applications requiring database
    functionaity

4
Applications
  • P2P Databases
  • Highly distributed and available data
  • Network Monitoring
  • Intrusion detection
  • Fingerprint queries

5
Design Principles
  • Relaxed Consistency
  • Sacrifice Consistency in face of Availability and
    Partition tolerance
  • Organic Scaling
  • Growth with deployment
  • Natural Habitats for Data
  • Data remains in original format with a DB
    interface
  • Standard Schemas
  • Achieved though common software

6
Outline
  • Motivation
  • Architecture
  • Join Algorithms
  • Evaluation
  • Discussion

7
PIER Architecture
8
DHT Design
  • Implemented with CAN and Chord
  • Routing Layer
  • Mapping for keys
  • Storage Manager
  • Node data storage
  • Provider
  • Storage access interface for higher levels

9
Routing Storage
  • Routing Layer
  • DHT-based API
  • locationMapChange local key set change
  • Storage Manager
  • Easy to realize API
  • Efficient performance relative to network
  • Main-memory storage manager used

10
Provider
  • Couples the routing and storage layers
  • namespace relation
  • resourceId primary key
  • namespace resourceId ? key
  • instanceId distinguishes objects with same
    namespace and resourceID
  • lifetime item storage duration
  • multicast contacts namespaces nodes
  • lscan iterates over a nodes local data
  • newData application callback on data arrival

11
PIER Query Processor
  • Query dataflow engine
  • Operators
  • Selection, projection, joins, grouping,
    aggregation
  • Operators push and pull data
  • Current data modification is though the DHT
    interface
  • Relaxed consistency and reachable snapshot
  • Working only with nodes reachable at the time a
    query is issued

12
Outline
  • Motivation
  • Architecture
  • Join Algorithms
  • Evaluation
  • Discussion

13
Join Algorithms
  • Symmetric Hash Join
  • Rehashes the relations
  • Scan and copy
  • Fetch Matches
  • One relation already hashed on join attribute
  • R, S relations
  • Nr, Ns relation namespaces
  • Nq - DHT-based temporary table

14
Join Rewriting
  • Aimed at lowering the bandwidth utilization
  • Symmetric semi-join
  • Local projections to join keys
  • Global fetch matches join
  • Bloom joins
  • Local bloom filters are published into temporary
    namespaces
  • Filters multicast to opposite relations nodes

15
  • How does this scale?

16
Outline
  • Motivation
  • Architecture
  • Join Algorithms
  • Evaluation
  • Discussion

17
Workload Parameters
  • CAN configuration d 4
  • R 10 times larger than S
  • Constants provide 50 selectivity
  • f(x,y) evaluated after the join
  • 90 of R tuples match a tuple in S
  • Result tuples are 1KB each
  • Symmetric hash join used

18
Simulation Setup
  • Up to 10,000 nodes
  • Network cross-traffic, CPU and memory
    utilizations ignored
  • 1. 100ms and 10Mbps fully connected links
  • 2. GT-ITM transit-stub topology

19
Scalability
  • 1MB data per node
  • Fully-connected topology
  • Variable number of computation nodes
  • Network congestion is an issue with few
    computation nodes
  • How is the computation workload distributed?

20
Join Algorithms (1/2)
  • Infinite Bandwidth
  • 1024 data and computation nodes
  • Core join Algorithms
  • Perform faster
  • Rewrites
  • Bloom Filter two multicasts
  • Semi-join two CAN lookups

21
Join Algorithms (2/2)
  • Limited Bandwidth
  • 10Mbps inbound capacity
  • 25GB relations, 1024 nodes
  • Symmetric Hash Join
  • Rehashes both tables
  • Semi-join
  • Transfers only matching tuples
  • At 40 selectivity, bottleneck switches from
    computation nodes to query sites

22
Soft State
  • Failure detection and recovery
  • 15 second failure detection
  • 4096 nodes
  • Refresh period
  • Time to reinsert lost tuples

23
Transit Stub Topology
  • GT-ITM
  • 4 Domains, 10 nodes per domain, 3 stubs per node
  • 50ms, 10ms, 2ms latency
  • 10Mbps inbound links
  • Similar trends as fully connected topology
  • A bit longer end-to-end delays

24
Experimental Results
  • 64 PCs on 1Gbps network
  • All nodes are computation nodes

25
Outline
  • Motivation
  • Architecture
  • Join Algorithms
  • Evaluation
  • Discussion

26
Discussion
  • PIER presents a distributed query engine
  • What remains to be done?
  • DB issues
  • Networking issues
Write a Comment
User Comments (0)
About PowerShow.com