The PIER Relational Query Processing System - PowerPoint PPT Presentation

About This Presentation

Title:

The PIER Relational Query Processing System

Description:

The PIER Relational Query Processing System. Boon Thau Loo & Ryan Huebsch ... Note that space wraps (it is really a torus) so node 6 has nodes 2,3,4, and 5 as ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 30

Provided by: ryanhu

Learn more at: http://www.huebsch.org

Category:

more less

Transcript and Presenter's Notes

Title: The PIER Relational Query Processing System

1
The PIER Relational Query Processing System

Boon Thau Loo Ryan Huebsch
boonloo, huebsch_at_eecs.berkeley.edu
Also developed with Matthew Harren
Faculty Joe Hellerstein, Scott Shenker, Ion
Stoica
Group Email p2p_at_db.cs.berkeley.edu

UC Berkeley DB Seminar 2/8/02
2
What is PIER? Just another nonsense acronym?

PIER Peer-to-Peer Infrastructure for
Information Exchange and Retrieval
PIER A relational query engine that runs on top
of an existing P2P (semi-structured) network and
its data
PIER A better Napster, Gnutella, Freenet or
your other favorite file-sharing software
PIER An answer to the worlds information
problems well probably not

3
Why? Whats wrong with existing P2P systems?

Existing P2P systems were designed mainly by
networking people
Excel in
Fault tolerance
Flexibility
Decentralized (ad hoc, multiple administrators,
limited controls)
Lack
Real query facilities only support keyword
searching
Guarantees/promises (remember the networking
motto best effort)

4
Why? Whats wrong with existing database systems?

Existing database systems
Excel in
Precise, correct answers
Powerful query facilities (SQL)
Guarantees and concurrency control (ACID
Transactions)
Optimization dont tell us how to do it we
tell you
Lack
Flexibility (servers come and go, loose
structure)
Decentralized administration
Fault tolerance (can only tolerate a few
problems, if any, the DB answer to a problem it
didnt work, try again later)

5
What about Napster, Gnutella, and the rest?

Napster is centralized
Subject to failure
Subject to censorship (in this case the music
industry)
Requires administration
Gnutella does not scale
Queries only asked in nearby network (limited hop
count)
Research projects CAN, Chord (CFS, INS, TWINE),
Pastry, Tapestry (OceanStore), etc.
Lack the query facilities

6
Why do you need complex queries? N-grams

Technique from information retrieval to do
in-exact matching
I want tyranny, but I cant spell tyrrany
First, n-grams is created (bi-grams)
Doc1 tyranny ? create 8 bi-gramst, ty,
yr, ra, an, nn, ny, y
Each bi-gram contains a pointer to the doc ID
So a db might look like

ra ? Doc2 rn ? Doc3 ty ? Doc1 vu ?
Doc2 xr ? Doc4
t ? Doc1 t ? Doc5 t ? Doc6 an ?
Doc1 an ? Doc3
an ? Doc5 ba ? Doc2 nn ? Doc1 ny ?
Doc1 ra ? Doc1
7
N-grams continued

Convert search string to n-grams
Intersect (which is a join) the list of n-grams
from the search word with the index of
n-grams/docIDs.
Aggregate the results by docID, count the number
of n-gram matches for each docID More n-gram
matches closer to request
SELECT i.docID, count(i.ngram) as matchesFROM
userlist u, indexlist IWHERE u.ngram
i.ngramGROUP BY i.docIDORDER BY matches ASC

8
What PIER IS and IS NOT about

IS NOT ABOUT Performance
In most situations a centralized solution could
be faster
IS ABOUT Decentralized Features
No administrator, anonymity, shared resources,
tolerates failures, resistant to censorship
IS NOT (Just) ABOUT Hype
Buzzwords, catch phrases, and empty promises
IS ABOUT
Uniting the networking community and the database
community to get the best of both

9
General PIER Architecture

Based on Distributed Hash Tables (DHT) to get
many good networking properties
A query processor is built on top
Note the data is stored separately from the
query engine, not a standard DB practice!

10
DHTs Overview

Wide-area distributed hash table that supports
finding data in an efficient manner
Data has an ID that is used to map the item to a
node (or bucket) that stores it
Lookups/inserts ? routing problem
Number of implementations
CAN UC Berkeley/ICSI
CHORD MIT/UC Berkeley
Tapestry UC Berkeley, based on Plaxtons
algorithm (UTexas).
Any many others, all similar APIs
functionalities

11
DHTs CAN in particular

Content Addressable Network (CAN) is one of many
DHT algorithms
Overlay network uses underlying TCP
Self-forming only need to find one node to
join, no central administration needed
Fault tolerant adapts to many failures
Load balancing
Scalable each node needs only a little state
information about its neighbors, O(d) cost and
primitives are O(n1/d) cost
d of dimensions, n of nodes

12
CAN Overview

Establishes a d-dimensional Cartesian coordinate
space
Nodes randomly choose a point in the space to
join
The node currently responsible for that space
splits and the space is divided equally
Nodes store information about their spatial
neighbors in the coordinate space

13
CAN Overview II

Data is represented as a key/value pair
The key or resource ID (RID) is hashed into the
coordinate using a hashing function h(x), such as
SHA1
Data is stored at the node responsible for that
space

I1
Item 1 h(RID1) lt3,2gt
14
CAN Overview III
Note that space wraps (it is really a torus) so
node 6 has nodes 2,3,4, and 5 as neighbors, but
to make this simple example work we ignore that
detail

Data is located by hopping from neighbor to
neighbor till the right node is found
Node 6 wants I1
Determine coordinates h(RID1) lt2,3gt
Determine which neighbor is closest
N4 or N2
Hop to that neighbor and repeat

I1
15
CAN Overview IV

Cost of operations is based on the path length
from node to node O(n1/d)
Basic API
Publish(namespace, RID, object)
Lookup(namespace, RID)
Multicast(group, object)
Applications can only do EXACT lookups that just
isnt enough for real users!

16
PIER Overview

PIER is just another application as far as CAN is
concerned CAN objects PIER tuples
User applications can use PIER to query data
using a subset of SQL (Joins and Group By).
Assume
PIER is running on all CAN nodes
Metadata is stored in CAN, not the actual large
files, although functionally they could
An objects RID is its primary key (in most
cases)
The namespace is equivalent to the relations name

17
CAN Extensions
Not needed for correctness, could use polling
instead

To allow for PIER to operate, it must have access
to the local data store
LocalGet(namespace) retrieve the data stored
locally from a particular namespace
LocalNotify(namespace) receive a callback when
new data is inserted into the local store for the
namespace
This violates the abstraction of location
independence
However, access is limited to reading the data,
applications can not control the location of data
through these extensions

18
Indexes. The lifeblood of a database engine.

CANs mapping of RID/Object is equivalent to an
index
Additional indexes are created by adding another
key/value pair with the key being the value of
the indexed field(s) and value being a pointer
to the object (the RID or primary key)
Original 35, abc.mp3, classical, 1837,
Index on 3rd/4th attributes classical.1837,35
Index lookups require 2 CAN lookups, one for the
index entry, one for the object

19
Rehash-Join

The tuple is checked against predicates that
apply to it (i.e. produced gt 1970)
Unnecessary fields can be projected out
Re-insert the resulting tuple into the network
using the join key value as the new RID, and use
a new temporary namespace (both tables use same
namespace)

When each node receives the multicast it use
localGet to read all data stored at the node.
Each object or tuple is analyzed
I want Hawaiian images that appeared in movies
produced since 1970
Create a query request SELECT name, URL FROM
images, movies WHERE image.ID movie.ID AND
20
Get-Join

The tuple is checked against predicates that
apply to it (i.e. produced gt 1970)
Issue a Lookup request for the tuple(s) that may
be able to join with it the RID is join key and
the namespace is for the other table
ONE TABLE MUST BE INDEXED ON JOIN KEY!

Starts the same as the others request is
multicasted to the nodes
Instead of rehashing the entire tuple, only
rehash the primary key and join key of tuples
that may satisfy the request (just like a
semi-join algorithm or index-on-the-fly)
At the nodes with the rehashed data, if a join is
possible (matching join keys) Lookup the entire
tuple and then join

22
Quick Comparison

of data movements (latency)
Rehash, Query Request, Rehash, Result 3
Get, Query Request, Lookup, Answer, Result 4
Hybrid, Query Request, Small Rehash, Lookup,
Answer, Result 5
Amount of data sent
Rehash, send projected tuples matching some
predicates, but cant pre-check join predicate
Get, only retrieve tuples matching join
predicate, but entire tuple must be fetched and
may not match some predicates
Hybrid, send small tuples matching some
predicates, entire tuple fetched if join is very
likely

23
Applying Predicates

Example SQL query
SELECT emp.name emp.salary
FROM emp, dept
WHERE emp.deptid dept.id AND
emp.age lt 50 AND
dept.name ltgt CS AND
emp.salary gt dept.maxsalary

24
Quick Comparison
160k msgs 750kb
250k msgs 4.6MB
300k msgs 600kb
25
Rehash-Group By

Same as Rehash Join rehash tuples matching
predicates
At the rehash sites perform the aggregate
After predefined time send the result
Obviously this needs work
Partial results
Statistics work (confidence levels, etc.)
Hierarchical aggregation (like in sensor nets)

26
Current Optimizations

Result tuple batchingWhy send results one by
one?
Pipelined hash join naturally produces more
results in one pass as more tuples are processed.
Network may be optimized for packets of certain
size, batch 2,3,4,x tuples per message
Selection Projection
Before rehashing, throw out irrelevant tuples,
project to reduce the size of the tuple
Rehash optimization
Instead of rehashing to all sites, randomly
select a smaller group to be used.

27
Interesting note

The state of the join is stored in the DHT store
The query request is automatically delivered to
new nodes when they join
Rehashed data is automatically re-routed to the
proper node if the coordinate space adjusted
When a node splits (to accept a new node into the
network) the data is also split, this includes
previously delivered rehashed tuples
Allows for graceful re-organization of the
network not to interfere with ongoing operations

28
Where we are

25,000 lines of Java code, hours of debugging,
speed/memory enhancements, etc.
A working real implementation of PIER on top of a
CAN simulator
Initial work studying and analyzing algorithms
nothing really ground-breaking YET!
Analyzing the design space and which problems
seem most interesting to pursue

29
DHT API? Is it good?

API isnt good for
Range queries
Limited multicast Currently, all queries must
be asked at all nodes, this is the same scaling
problem with Gnutella Freenet.
Batch Publish/Lookup operations
How to properly extend the API if needed
End-to-End argument always in play which layer
(CAN, PIER, or user application) should do what?

30
Where to go from here?1st Understand where we
are

Initial steps we need to take
Only testing so far have been limited to small
micro-benchmarks need to take careful look at
Scaling
Effects of dynamic node re-organization
(including failures, which result in data AND
computation loss)
Heterogeneous network connections various
topologies
Different data distributions
Optimal levels of parallelism too many nodes can
be just as bad, if not worse, then too few.
More realistic simulations (including traces)
Develop or adopt performance metrics to insure
efficiency and measure alternatives

31
Where to go from here?Then see where we can go

The hardcore research areas
Replication for speed and fault tolerance (both
in data and computation)
Pre-computation of (intermediate) results
Caching Both at DHT and PIER levels
Query optimization
More algorithms, Dist-DBMS have more tricks
Security
What are the new apps the system enables?

32
Relationship to other Berkeley Projects. Are we
alone?

We have data coming from many sources, different
rates, always changing Telegraph?
We expect that data is constantly published in
this type of network, better aggregation
algorithms Sensor Nets?
Many similar or overlapping queries are likely to
be asked from different nodes within a short time
window SteMs, XML Filters?
Any ideas?

Write a Comment

User Comments (0)