Title: The PIER Relational Query Processing System
1The PIER Relational Query Processing System
- Boon Thau Loo Ryan Huebsch
- boonloo, huebsch_at_eecs.berkeley.edu
- Also developed with Matthew Harren
- Faculty Joe Hellerstein, Scott Shenker, Ion
Stoica - Group Email p2p_at_db.cs.berkeley.edu
UC Berkeley DB Seminar 2/8/02
2What is PIER? Just another nonsense acronym?
- PIER Peer-to-Peer Infrastructure for
Information Exchange and Retrieval - PIER A relational query engine that runs on top
of an existing P2P (semi-structured) network and
its data - PIER A better Napster, Gnutella, Freenet or
your other favorite file-sharing software - PIER An answer to the worlds information
problems well probably not
3Why? Whats wrong with existing P2P systems?
- Existing P2P systems were designed mainly by
networking people - Excel in
- Fault tolerance
- Flexibility
- Decentralized (ad hoc, multiple administrators,
limited controls) - Lack
- Real query facilities only support keyword
searching - Guarantees/promises (remember the networking
motto best effort)
4Why? Whats wrong with existing database systems?
- Existing database systems
- Excel in
- Precise, correct answers
- Powerful query facilities (SQL)
- Guarantees and concurrency control (ACID
Transactions) - Optimization dont tell us how to do it we
tell you - Lack
- Flexibility (servers come and go, loose
structure) - Decentralized administration
- Fault tolerance (can only tolerate a few
problems, if any, the DB answer to a problem it
didnt work, try again later)
5What about Napster, Gnutella, and the rest?
- Napster is centralized
- Subject to failure
- Subject to censorship (in this case the music
industry) - Requires administration
- Gnutella does not scale
- Queries only asked in nearby network (limited hop
count) - Research projects CAN, Chord (CFS, INS, TWINE),
Pastry, Tapestry (OceanStore), etc. - Lack the query facilities
6Why do you need complex queries? N-grams
- Technique from information retrieval to do
in-exact matching - I want tyranny, but I cant spell tyrrany
- First, n-grams is created (bi-grams)
- Doc1 tyranny ? create 8 bi-gramst, ty,
yr, ra, an, nn, ny, y - Each bi-gram contains a pointer to the doc ID
- So a db might look like
ra ? Doc2 rn ? Doc3 ty ? Doc1 vu ?
Doc2 xr ? Doc4
t ? Doc1 t ? Doc5 t ? Doc6 an ?
Doc1 an ? Doc3
an ? Doc5 ba ? Doc2 nn ? Doc1 ny ?
Doc1 ra ? Doc1
7N-grams continued
- Convert search string to n-grams
- Intersect (which is a join) the list of n-grams
from the search word with the index of
n-grams/docIDs. - Aggregate the results by docID, count the number
of n-gram matches for each docID More n-gram
matches closer to request - SELECT i.docID, count(i.ngram) as matchesFROM
userlist u, indexlist IWHERE u.ngram
i.ngramGROUP BY i.docIDORDER BY matches ASC
8What PIER IS and IS NOT about
- IS NOT ABOUT Performance
- In most situations a centralized solution could
be faster - IS ABOUT Decentralized Features
- No administrator, anonymity, shared resources,
tolerates failures, resistant to censorship - IS NOT (Just) ABOUT Hype
- Buzzwords, catch phrases, and empty promises
- IS ABOUT
- Uniting the networking community and the database
community to get the best of both
9General PIER Architecture
- Based on Distributed Hash Tables (DHT) to get
many good networking properties - A query processor is built on top
- Note the data is stored separately from the
query engine, not a standard DB practice!
10DHTs Overview
- Wide-area distributed hash table that supports
finding data in an efficient manner - Data has an ID that is used to map the item to a
node (or bucket) that stores it - Lookups/inserts ? routing problem
- Number of implementations
- CAN UC Berkeley/ICSI
- CHORD MIT/UC Berkeley
- Tapestry UC Berkeley, based on Plaxtons
algorithm (UTexas). - Any many others, all similar APIs
functionalities
11DHTs CAN in particular
- Content Addressable Network (CAN) is one of many
DHT algorithms - Overlay network uses underlying TCP
- Self-forming only need to find one node to
join, no central administration needed - Fault tolerant adapts to many failures
- Load balancing
- Scalable each node needs only a little state
information about its neighbors, O(d) cost and
primitives are O(n1/d) cost - d of dimensions, n of nodes
12CAN Overview
- Establishes a d-dimensional Cartesian coordinate
space - Nodes randomly choose a point in the space to
join - The node currently responsible for that space
splits and the space is divided equally - Nodes store information about their spatial
neighbors in the coordinate space
13CAN Overview II
- Data is represented as a key/value pair
- The key or resource ID (RID) is hashed into the
coordinate using a hashing function h(x), such as
SHA1 - Data is stored at the node responsible for that
space
I1
Item 1 h(RID1) lt3,2gt
14CAN Overview III
Note that space wraps (it is really a torus) so
node 6 has nodes 2,3,4, and 5 as neighbors, but
to make this simple example work we ignore that
detail
- Data is located by hopping from neighbor to
neighbor till the right node is found - Node 6 wants I1
- Determine coordinates h(RID1) lt2,3gt
- Determine which neighbor is closest
- N4 or N2
- Hop to that neighbor and repeat
I1
15CAN Overview IV
- Cost of operations is based on the path length
from node to node O(n1/d) - Basic API
- Publish(namespace, RID, object)
- Lookup(namespace, RID)
- Multicast(group, object)
- Applications can only do EXACT lookups that just
isnt enough for real users!
16PIER Overview
- PIER is just another application as far as CAN is
concerned CAN objects PIER tuples - User applications can use PIER to query data
using a subset of SQL (Joins and Group By). - Assume
- PIER is running on all CAN nodes
- Metadata is stored in CAN, not the actual large
files, although functionally they could - An objects RID is its primary key (in most
cases) - The namespace is equivalent to the relations name
17CAN Extensions
Not needed for correctness, could use polling
instead
- To allow for PIER to operate, it must have access
to the local data store - LocalGet(namespace) retrieve the data stored
locally from a particular namespace - LocalNotify(namespace) receive a callback when
new data is inserted into the local store for the
namespace - This violates the abstraction of location
independence - However, access is limited to reading the data,
applications can not control the location of data
through these extensions
18Indexes. The lifeblood of a database engine.
- CANs mapping of RID/Object is equivalent to an
index - Additional indexes are created by adding another
key/value pair with the key being the value of
the indexed field(s) and value being a pointer
to the object (the RID or primary key) - Original 35, abc.mp3, classical, 1837,
- Index on 3rd/4th attributes classical.1837,35
- Index lookups require 2 CAN lookups, one for the
index entry, one for the object
19Rehash-Join
- The tuple is checked against predicates that
apply to it (i.e. produced gt 1970) - Unnecessary fields can be projected out
- Re-insert the resulting tuple into the network
using the join key value as the new RID, and use
a new temporary namespace (both tables use same
namespace)
When each node receives the multicast it use
localGet to read all data stored at the node.
Each object or tuple is analyzed
I want Hawaiian images that appeared in movies
produced since 1970
Create a query request SELECT name, URL FROM
images, movies WHERE image.ID movie.ID AND
20Get-Join
- The tuple is checked against predicates that
apply to it (i.e. produced gt 1970) - Issue a Lookup request for the tuple(s) that may
be able to join with it the RID is join key and
the namespace is for the other table - ONE TABLE MUST BE INDEXED ON JOIN KEY!
When each node receives the multicast it use
localGet to read all data stored at the node.
Each object or tuple is analyzed
I want Hawaiian images that appeared in movies
produced since 1970
Create a query request SELECT name, URL FROM
images, movies WHERE image.ID movie.ID AND
21Hybrid-Join
- Starts the same as the others request is
multicasted to the nodes - Instead of rehashing the entire tuple, only
rehash the primary key and join key of tuples
that may satisfy the request (just like a
semi-join algorithm or index-on-the-fly) - At the nodes with the rehashed data, if a join is
possible (matching join keys) Lookup the entire
tuple and then join
22Quick Comparison
- of data movements (latency)
- Rehash, Query Request, Rehash, Result 3
- Get, Query Request, Lookup, Answer, Result 4
- Hybrid, Query Request, Small Rehash, Lookup,
Answer, Result 5 - Amount of data sent
- Rehash, send projected tuples matching some
predicates, but cant pre-check join predicate - Get, only retrieve tuples matching join
predicate, but entire tuple must be fetched and
may not match some predicates - Hybrid, send small tuples matching some
predicates, entire tuple fetched if join is very
likely
23Applying Predicates
- Example SQL query
- SELECT emp.name emp.salary
- FROM emp, dept
- WHERE emp.deptid dept.id AND
- emp.age lt 50 AND
- dept.name ltgt CS AND
- emp.salary gt dept.maxsalary
24Quick Comparison
160k msgs 750kb
250k msgs 4.6MB
300k msgs 600kb
25Rehash-Group By
- Same as Rehash Join rehash tuples matching
predicates - At the rehash sites perform the aggregate
- After predefined time send the result
- Obviously this needs work
- Partial results
- Statistics work (confidence levels, etc.)
- Hierarchical aggregation (like in sensor nets)
26Current Optimizations
- Result tuple batchingWhy send results one by
one? - Pipelined hash join naturally produces more
results in one pass as more tuples are processed. - Network may be optimized for packets of certain
size, batch 2,3,4,x tuples per message - Selection Projection
- Before rehashing, throw out irrelevant tuples,
project to reduce the size of the tuple - Rehash optimization
- Instead of rehashing to all sites, randomly
select a smaller group to be used.
27Interesting note
- The state of the join is stored in the DHT store
- The query request is automatically delivered to
new nodes when they join - Rehashed data is automatically re-routed to the
proper node if the coordinate space adjusted - When a node splits (to accept a new node into the
network) the data is also split, this includes
previously delivered rehashed tuples - Allows for graceful re-organization of the
network not to interfere with ongoing operations
28Where we are
- 25,000 lines of Java code, hours of debugging,
speed/memory enhancements, etc. - A working real implementation of PIER on top of a
CAN simulator - Initial work studying and analyzing algorithms
nothing really ground-breaking YET! - Analyzing the design space and which problems
seem most interesting to pursue
29DHT API? Is it good?
- API isnt good for
- Range queries
- Limited multicast Currently, all queries must
be asked at all nodes, this is the same scaling
problem with Gnutella Freenet. - Batch Publish/Lookup operations
- How to properly extend the API if needed
- End-to-End argument always in play which layer
(CAN, PIER, or user application) should do what?
30Where to go from here?1st Understand where we
are
- Initial steps we need to take
- Only testing so far have been limited to small
micro-benchmarks need to take careful look at - Scaling
- Effects of dynamic node re-organization
(including failures, which result in data AND
computation loss) - Heterogeneous network connections various
topologies - Different data distributions
- Optimal levels of parallelism too many nodes can
be just as bad, if not worse, then too few. - More realistic simulations (including traces)
- Develop or adopt performance metrics to insure
efficiency and measure alternatives
31Where to go from here?Then see where we can go
- The hardcore research areas
- Replication for speed and fault tolerance (both
in data and computation) - Pre-computation of (intermediate) results
- Caching Both at DHT and PIER levels
- Query optimization
- More algorithms, Dist-DBMS have more tricks
- Security
- What are the new apps the system enables?
32Relationship to other Berkeley Projects. Are we
alone?
- We have data coming from many sources, different
rates, always changing Telegraph? - We expect that data is constantly published in
this type of network, better aggregation
algorithms Sensor Nets? - Many similar or overlapping queries are likely to
be asked from different nodes within a short time
window SteMs, XML Filters? - Any ideas?